DNA Structure for Sequences and Repeats of All Lengths

Sequence Analysis by Additive Scales: DNA Structure for
Sequences and Repeats of All Lengths
Pierre Baldi
Dept. of Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425
Pierre-Francois Baisnee
Dept. of Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425
Abstract
sult from a variety of dierent mechanisms, a fraction
of which is likely to depend on proles characterized by
extreme structural features.
Contact: [email protected], [email protected]
DNA structure plays an important role in
a variety of biological processes. Dierent di- and trinucleotide scales have been proposed to capture various
aspects of DNA structure including base stacking energy,
propeller twist angle, protein deformability, bendability,
and position preference. Yet, a general framework for
the computational analysis and prediction of DNA structure is still lacking. Such a framework should in particular address the following issues: (1) construction of
sequences with extremal properties; (2) quantitative evaluation of sequences with respect to a given genomic background; (3) automatic extraction of extremal sequences
and proles from genomic data bases; (4) distribution
and asymptotic behavior as the length N of the sequences
increases; and (5) complete analysis of correlations between scales.
Results: We develop a general framework for sequence
analysis based on additive scales, structural or other, that
addresses all these issues. We show how to construct extremal sequences and calibrate scores for automatic genomic and data base extraction. We show that distributions rapidly converge to normality as N increases.
Pairwise correlations between scales depend both on background distribution and sequence length and rapidly converge to an analytically predictable asymptotic value. For
di- and trinucleotide scales, normal behavior and asymptotic correlation values are attained over a characteristic window length of about 10-15 bp. With a uniform background distribution, pairwise correlations between empirically-derived scales remain relatively small
and roughly constant at all lengths, except for propeller
twist and protein deformability which are positively correlated. There is a positive (resp. negative) correlation
between dinucleotide base stacking (resp. propeller twist
and protein deformability) and AT-content that increases
in magnitude with length. The framework is applied to
the analysis of various DNA tandem repeats. We derive
exact expressions for counting the number of repeat unit
classes at all lengths. Tandem repeats are likely to reMotivation:
1 Introduction
Evidence is mounting that DNA structural properties
beyond the double helical pattern play an important role
in a number of fundamental biological processes, both
under healthy and pathological conditions. This is not
too surprising if one realizes that meters of DNA must
be compacted into a nucleus that is only a few microns in
diameter while, at the same time, preserving the ability
of turning thousands of genes on and o in a precisely
orchestrated fashion. The three-dimensional structure of
DNA, as well as its organization into chromatin bers,
seems to be essential to its functions and has been implicated in diverse phenomena ranging from protein binding
sites, to gene regulation, to triplet repeat expansion diseases. The goal of this work is to develop computational
methods for the structural analysis of DNA sequences.
While DNA structure is our primary motivation and area
of application, the framework we develop is completely
general and applies to sequences over any alphabet, including codon, RNA, and protein alphabets, whenever
local additive scales, as dened below, are available.
1.1 DNA Structure
DNA structure has been found to depend on the exact sequence of nucleotides, an eect that seems to
be caused largely by interactions between neighboring
base pairs (Ornstein et al., 1978; Satchwell et al., 1986;
Breslauer et al., 1986; Calladine et al., 1988; Goodsell
& Dickerson, 1994; Sinden, 1994; Brukner et al., 1995;
Hassan & Calladine, 1996; Hunter, 1996; Ponomarenko
et al., 1999; Fye & Benham, 1999). This means that different sequences can have dierent intrinsic structures,
or dierent propensities for forming particular structures. Periodic repetitions of bent DNA in phase with
the helical pitch, for instance, will cause DNA to assume a macroscopically curved structure. Flexible or
intrinsically curved DNA is energetically more favorable
and Department of Biological Chemistry, College of Medicine,
University of California, Irvine. To whom all correspondence
should be addressed.
1
to wrap around histones than rigid and unbent DNA,
and this has been shown to inuence nucleosome positioning (Drew & Travers, 1985; Satchwell et al., 1986;
Simpson, 1991; Lu et al., 1994; Wole & Drew, 1995;
Baldi et al., 1996; Zhu & Thiele, 1996; Liu & Stein,
1997). In addition, the chromatin complex structure
of DNA and the positioning of nucleosomes along the
genome have been found to play an important (generally inhibitory) role in regulation of gene transcription (Pazin & Kadonaga, 1997; Tsukiyama & Wu, 1997;
Werner & Burley, 1997; Pedersen et al., 1998). Sequencedependent DNA structure is often important for DNA
binding proteins, such as TBP (TATA-binding-protein)
(Parvin et al., 1995; Starr et al., 1995; Grove et al., 1996)
and gene regulation (Sheridan et al., 1998). While the
number of resolved structures of DNA-protein complexes
continues to grow in the PDB data base, the eld of computational DNA structural analysis is clearly far behind
its protein cousin and completely lacks any degree of
systemicity. Most likely, most DNA structural signals
remain to be uncovered.
major groove (Lahm & Suck, 1991; Suck, 1994). Thus
Dnase I cutting frequencies on naked DNA can be interpreted as a quantitative measure of major groove compressibility or anisotropic bendability. These frequencies
allow for the derivation of bendability parameters for the
32 complementary trinucleotide pairs. Large B values
correspond to exibility.
5. The trinucleotide position preference (PP) scale
derived from experimental investigations of the positioning of DNA in nucleosomes. It has been found that
certain trinucleotides have strong preference for being
positioned in phase with the helical repeat. Depending on the exact rotational position, such triplets will
have minor grooves facing either towards or away from
the nucleosome core (Satchwell et al., 1986). Based on
the premise that exible sequences can occupy any rotational position on nucleosomal DNA, these preference
values can be used as a triplet scale that measures DNA
exibility. Hence, in this model, all triplets with close to
zero preference are assumed to be exible, while triplets
with preference for facing either in or out are taken to
be more rigid and have larger PP values. Note that we
do not use this scale as a measure of how well dierent
triplets form nucleosomal DNA. Instead, the absolute
value, or unsigned nucleosome positioning preference, is
used here, as in (Pedersen et al., 1998), as a measure of
DNA exibility. For completeness, all these scales are
displayed in the Appendix.
In previous studies, we found these models useful
(Baldi et al., 1996; Pedersen et al., 1998; Baldi et al.,
1999), in particular for the detection of putative new
structural signatures associated with an increase of
bendability in downsteam regions of RNA polymerase
II promoters. A similar approach (Liao et al., 2000)
was used to analyze the structure of insertion sites for
P transposable elements in Drosophila melanogaster and
suggest that the corresponding transposition mechanism
recognizes a structural signature rather than a specic
sequence motif.
With the exception of BS, all the models were determined by purely experimental observations of sequencestructure correlations. Additional scales capturing DNA
properties related directly or indirectly to structure, such
as enthalpy, or melting temperature, have also been proposed (Breslauer et al., 1986; Ponomarenko et al., 1999).
The primary focus of this work is not on assessing the
merits and pitfalls of each model, but rather on the development of general methods for the systematic application of any scale to any sequence of any length, up
to entire genomes, under the assumption that the scale
can be used additively within a sliding window. In general, this assumption will provide a reasonable approximation, at least up to a certain length to be determined
experimentally. In particular, we are interested in the
1.2 DNA Structural Scales
Based on many dierent empirical measurements or
theoretical approaches, several models have been constructed that relate the nucleotide sequence to DNA
exibility and curvature (Ornstein et al., 1978; Satchwell et al., 1986; Goodsell & Dickerson, 1994; Sinden,
1994; Brukner et al., 1995; Hassan & Calladine, 1996;
Hunter, 1996; Baldi et al., 1998; Ponomarenko et al.,
1999). These models are typically in the form of dinucleotide or trinucleotide scales that assign a particular
value to each di- or tri-nucleotide and its reverse complement. A non-exhaustive list of such scales includes:
1. The dinucleotide base stacking energy (BS) scale
(Ornstein et al., 1978) expressed in kilocalories per mole.
The scale is derived from approximate quantum mechanical calculations on crystal structures.
2. The dinucleotide propeller twist angle (PT)
scale (Hassan & Calladine, 1996) measured in degrees.
This scale is based on X-ray crystallography of DNA
oligomers. Dinucleotides with a large negative propellertwist angle tend to be more rigid than dinucleotides with
low negative propeller-twist angle.
3. The dinucleotide protein deformability (PD) scale
(Olson et al., 1998) derived from empirical energy functions extracted from the uctuations and correlations
of structural parameters in DNA-protein crystal complexes. Dinucleotides with large PD values tend to be
more exible.
4. The trinucleotide bendability (B) model (Brukner
et al., 1995) based on Dnase I cutting frequencies. The
enzyme Dnase I preferably binds (to the minor grove)
and cuts DNA that is bent, or bendable, towards the
2
development of methods for the automatic recognition
of structural motifs associated with extremal features,
such as extreme stiness or bendability. Calibration of
corresponding thresholds is expected to be useful in data
base searches and is conceptually similar, for instance,
to the calibration of thresholds for detecting sequence
homology. More generally, however, data base searches
may also be conducted on the basis of structural signatures or proles that need not be extremal and could
be obtained from reasonable training sets. Certain protein binding sites, for instance, are highly degenerate at
the DNA sequence level, with low sequence homology,
while exhibiting at the same time a high degree of DNA
structural similarity. Similarly, periodic exible triplets
in phase with the double helical pitch are necessary to
ensure long range curvature, for instance in nucleosome
regions.
Although several scales may agree on some structural
features, the fact remains that they may also display
divergent interpretations of some sequence elements.
While no nal consensus regarding these models exists,
it is likely that each one provides a slightly dierent and
partially complementary view of DNA structure. Thus a
second goal of this work is the comparison of the models
in the limited sense of estimating the statistical correlation between dierent scales. In (Baldi et al., 1998)
it was shown that by and large many of the commonly
used scales exhibit low correlations measured at the level
of single di- or tri-nucleotides. Empirical measurement
of correlations between the scales over longer lengths in
Escherichia coli have recently revealed dierent unexplained patterns (Pedersen et al., 2000). Here we provide
a complete explanation of this phenomena and show how
correlations vary with background distribution and with
window length. Finally, while the methods introduced
can be applied to any DNA sequence, we focus here on a
particularly important class of DNA sequences, namely
DNA tandem repeats, where the general framework is
further specialized.
nucleotides. Tandem repeats may cover up to 10% of the
human genome.
Tandem repeats vary widely, over several orders of
magnitude, both in terms of the length of the repeating
pattern and the number of more or less exact contiguous copies. Repeats are often polymorphic and therefore
play a major role in linkage studies and DNA ngerprinting. In many cases, the genetic origin, the structure,
and the function of these repetitive regions is poorly understood. There exist a few examples, however, where
the repeats are known to play a biological role in both
healthy and pathological conditions. Certain tandem repeats, for instance, have been associated with protein
binding sites or interactions with transcription factors.
An important advance in epigenetics research has been
the realization that interactions between repeated DNA
sequences can trigger the formation and the transmission of inactive genetic states and DNA modications
(Wole & Matzke, 1999). In several of these cases, the
particular DNA-helical structural features of the repeat
sequences seem to play an essential role.
Interest in tandem repeats has been heightened over
the last few years by the discovery that several important degenerative disorders including Huntington disease, Myotonic Distrophy, Fragile X Syndrome, and several forms of Ataxia, result from the abnormal expansion
of particular DNA triplets (The Huntington's Disease
Collaborative Research Group, 1993; Ashley & Warren,
1995; Ross, 1995; Gusella & MacDonald, 1996; Hardy
& Gwinn-Hardy, 1998; Rubinsztein & Hayden, 1998;
Baldi et al., 1999). The exact mechanism by which
a triplet repeat mutation causes disease varies as indicated by the fact that currently known repeat expansions are found both in 5' UTRs, in 3' UTRs, in introns,
and within coding sequences of various aected genes
(Ashley & Warren, 1995; Gusella & MacDonald, 1996;
Rubinsztein & Amos, 1998; Rubinsztein & Hayden,
1998). For instance, fragile X mental retardation is associated with an expanded CGG repeat in the 5' UTR of
the FMR1 gene (Nelson, 1995; Eichler & Nelson, 1998).
The 64 possible triplets can be clustered into 12 equivalence classes when shift and reverse complement operations are considered (see below). Currently only three
repeat classes CAG, CGG, and GAA, out of the possible
twelve, are associated with triplet repeat disorders.
There is evidence that unusual structural features of
the repeats play a role in their expansion (Wells, 1996;
Pearson & Sinden, 1998a; Pearson & Sinden, 1998b;
Moore et al., 1999). In (Baldi et al., 1999), the structural
scales above were used to show that the triplet classes
involved in the diseases have extreme structural characteristics of very high or very low exibility. Methods
to quantify the degree of extremality relative to other
sequences, however, were not developed. Furthermore,
1.3 DNA repeats
Genomes, especially eukaryotic genomes, are replete
with DNA repetitive regions (Jurka et al., 1992; Jeffreys, 1997; Jurka, 1998). Well over 30% of the human genome has been estimated to comprise repetitive
DNA of some sort (Benson & Waterman, 1994) the exact function of which is often unknown. Such DNA arises
through many dierent evolutionary and genetic mechanisms. Over 950 dierent classes (Jurka, 1998) of repeats
have been censed. Two major groups of repeats exist:
interspersed repeats, and tandem repeats. While the
methods to be developed can be applied to both groups,
our analysis will focus on tandem repeats, consisting of
two or more contiguous copies of a particular pattern of
3
quence s can be estimated by \sliding" the scale along
the sequence in the form
other triplet or non-triplet repeats may play a role in
diseases as well as other biological processes. Therefore
the techniques need to be improved and extended to all
classes of repeats.
Hence, given the importance of repeating patterns
and the exponential growth of sequence data bases,
our goal is also to develop new tools for the computational analysis of the structural properties of arbitrary
repeats and begin to apply such techniques in a systematic and quantiable way. Various algorithms for
searching tandem repeats (Milosavljevic & Jurka, 1993;
Benson & Waterman, 1994; Benson, 1999; Blanchard
et al., 2000) have been developed. The techniques presented here can also be viewed as complementing such
algorithms by introducing a structural perspective.
S (s) = S (X : : : XS ) + S (X : : : XS ) + : : :
1
=
N;
S +1
X
2
S (Xi : : : Xi
+1
S ;1 )
(1)
+
i=1
In practical applications, such quantity can also be averaged over a window of length W to get, for instance, a
more homogeneous per base-pair value (W N ). This
averaging process does not concern us at this stage since
it merely amounts to using a dierent scale, with a larger
size. The form given in Equation 1 corresponds to a free
boundary condition. The ideas to be developed can be
applied to other boundary conditions, including periodic
boundary conditions, where the sequence is wrapped
around, as described below. With the proper modications, the theory applies immediately to the case where
the scales are shifted by more than one position at each
step.
Consider now a repeat sequence r consisting of a unit
pattern or period p = (X1 : : : XP ) of length P , and repetition number R > 1, so that r = (X1 : : : XP )R with
N = PR S . Notice that the period is not uniquely
determined since, for instance, XXXX can be viewed
as (X )4 , or as (XX )2. In addition, we will assume that
P + S ; 1 N , or equivalently that S (R ; 1)P + 1
so that the scale S is applied starting at least once from
each letter in the repetitive unit, without exceeding the
repetitive sequence boundary. In this case, S (r) has the
form:
1.4 Organization
The remainder of the paper is organized as follows. In
the next section we develop a general framework for the
analysis of the score of a sequence (repetitive or not)
under any additive scale. We determine the number of
dierent sequence equivalence classes under circular permutation and reverse complement operations. We show
how to determine and visualize maximal and minimal
patterns and study the statistical properties of the scales,
including intra scale (mean and variance) and interscale
(correlations) statistics for sequences of various lengths,
as well as asymptotic normality. This framework is essential in order to compare the behavior of various scales,
to locate a given sequence with respect to a comparable
population, and to automatically set thresholds in data
base searches. We then apply the general framework
to the 5 structural models described above and various
tandem repeats.
S (r) = lS (p) + where S (p) is the contribution of the periodic unit
2 Methods and Theory
(2)
S (p) = S (X : : : XS ) + S (X : : : XS ) + : : :
+ S (XP X : : : XS; )
(3)
1
2.1 General Framework
The general framework we consider begins with an alphabet A of size A and a scale S of length (or size) S . The
scale is a function that assigns a value to any S -tuple of
the alphabet, for instance in the form of a table with AS
entries. In the result section, we deal exclusively with
the nucleotide DNA alphabet (A = 4) and with DNA
scales, such as di-nucleotide with S = 2 (e.g. propeller
twist) or tri-nucleotide with S = 3 (e.g. bendability)
structural scales. The same framework, however, can
readily be applied to other situations (e.g. amino acid
alphabet with hydrophobicity scales). Given a primary
sequence s = X1 X2 : : : XN of length N S over A, we
assume that the scale S is approximately additive in the
sense that the corresponding global property of the se-
=
P
X
i=1
2
1
+1
1
S (Xi : : : Xi
S ;1 ) [modP ]
+
The number l of times the periodic unit is covered by S
and its shifted version is given by:
l = PR ;PS + 1 = R ; S P; 1
(4)
Finally, if lP + S ; 1 = RP then the boundary tail is
equal to 0. Otherwise
= S (XlP +1 : : : XlP +S ) + : : : + S (XRP ;S+1 : : : XRP )
(5)
4
where indices can be taken modulo P , i.e. XlP +1 = X1
and so forth. The sum in Equation 5 has at most P ; 1
terms. In practice, at least in the case of DNA, only
short scales are currently available and therefore in most
cases, S P + 1. In this case, Equation 2 simplies to:
S (r) = (R ; 1)S (p) + using standard group theory arguments detailed in the
Appendix. These arguments are not restricted to circular permutation and reverse complement operations,
but apply to any group of transformations over any sequences.
The number of classes, when only circular permutations without reverse complement are taken into account,
is given by
(6)
2.2 Equivalence Classes
1 X ( P )Ad = 1 X A(P;k)
P
d
P
In the special case of repetitive sequences, we also need
to be able to count the number of dierent repeats with
respect to a given scale. It is often the case that the
scale S is characterized by some kind of invariance with
respect to the sequences of length S of A. In the case
of DNA, the structural scales we have are invariant with
respect to the reverse complement. When looking at repeat sequences, this determines how many dierent repeat patterns of length P need to be considered.
A triplet repeat, for instance, can be described in
terms of dierent unit trinucleotides depending on what
strand and triplet frame is chosen. Thus, the repeat
CAGCAGCAG : : : can be said to be a repeat of the
triplet CAG, and also of its reverse complement CTG.
Ignoring repeat boundaries, however, the sequence can
also be described as a repeat of the shifted triplet pairs
AGC/GCT and GCA/TGC. In this way, the 64 dierent trinucleotides can be divided into 12 possible repeat
classes. Of these 12 classes, only 10 are proper triplet
repeat classes in the sense that they do not result from a
repeat pattern of shorter length. The two classes associated with shorter patterns are obviously the triplet pairs
AAA/TTT and CCC/GGG which are more precisely described as mono-nucleotide repeats. [For a generic alphabet A, a reverse complement operation can be dened by
introducing a one to one function X ! X from the alphabet to itself, satisfying X = X so that the reverse
complement of X1 : : : XN is dened to be XN ; : : : ; X1 .]
In the case of a DNA repeat with unit repeat length P ,
the number of classes and the number of elements in each
equivalence class is dictated by the action of the group
of transformations associated with the circular permutations and the reverse complement operations on the
set of all possible strings of length P . AAA: : :/TTT: : :
and CCC: : :/GGG: : : always give rise to two separate
classes with 2 elements each. In general, a typical class
will contain 2P elements associated with the P permutations and the P reverse complements. Classes containing
less elements, however, can arise for instance as a result of sub-periodicity eects when P is not prime, and
of identical reverse complement eects. For instance,
when P = 4, the class of ATAT contains only 2 elements
since it is identical to its reverse complement and can
be shifted circularly only once before returning to the
original pattern. The number of classes can be counted
kP
djP
(7)
1
where (P; k) is the greatest common divider (gcd) of P
and k. (n) is the Euler function counting the number
of integers less than n which are prime to n, i.e. without common dividers with n. If p1 ; : : : ; pk is the list of
distinct prime factors of n, then the Euler function can
be expressed as:
k
Y
(n) = n (1 ; p1 )
i
i=1
(8)
1 X ( P )Ad = 1 X A(P;d)
2 P d jP d
2P 1kP
(9)
When both circular permutations and reverse complement are taken into account, the number of classes for
odd P is given by
When P is even, the corresponding number of classes is
2
3
X
1 4
P d P P=2 5
2P djP ( d )A + 2 A
or, equivalently,
2
(10)
3
1 4
(P;d)
+ P2 AP=2 5
(11)
2P 1kP A
In particular, when P is a prime, the number of dierent
classes under periodic and reverse complement equivalence is
X
1 [(P ; 1)A + AP ]
(12)
2P
The number of classes which are new at a given length
P , i.e. that do not result from the repetition of a shorter
pattern of length dividing P , can easily be obtained by
subtracting the corresponding counts for each divisor of
P . When P is prime, all classes are new except for the
classes resulting from mono-letter repeats. Table 1 in the
Results section exemplies the application of Equations
9-12.
5
2.3 Extremal Sequences and Automata
2.4 Probabilistic Modeling
Consider now that sequences are being generated by a
random process. In order to x the ideas, we take for
simplicity a Markov model of order 0, i.e. we assume
that sequences are generated by N tosses of the same
die with distribution D = (pX ) over the alphabet A.
The same analysis, however, can easily be extended to
other probabilistic models such as higher-order Markov
models where distributions are dened, for instance, on
pairs or triplets of letters.
From Equation 1, S (s) is now a random variable which
is the sum of N ; S + 1 random variables: S (s) = Y1 +
: : : + YN ;S+1 . By construction, all the variables Yi =
S (Xi : : : Xi+S;1 ) have the same distribution, but they
are not independent. Rather they satisfy a form of local
dependence, called \m-dependence" in statistics. More
precisely, for i < j , Yi and Yj are independent if and
only if j ; i S . Using the linearity of the expectation,
we have:
We are interested in the construction and recognition of
sequences s that are extremal for S , i.e. such that S (s) is
very large or very small relative to the other sequences
of length N . For this, we attach to each scale a prex automata, or prex graph. The prex automata can
be described by a directed graph containing AS;1 nodes,
each labeled by a string of length S ; 1 over A of the form
X1 : : : XS;1 (see Figure 1 for an example). Each node
has A directed outgoing connections. X1 : : : XS;1 is connected to X2 : : : XS;1 Y , for each letter Y in A, hence
the notion of prex. The weight (or length) of the corresponding transition is provided by the entry associated
with X1 X2 : : : XS;1Y in the structural table. The A
nodes labeled (X )S;1 = XXX:::X (mono-repeats) are
the only ones to have a self-connection. Any sequence
s of length N , is trivially associated with the path:
X1 : : : XS;1 ! X2 : : : XS ! : : : ! XN ;S+2 : : : XN .
The value of S (s) is found just by adding the weights
of the corresponding connections.
As a result, sequences associated with maximal or
minimal values of S (s) correspond to paths in the prex
graph, with maximal or minimal total weight or length.
These can easily be found by standard dynamic programming techniques which can also be extended to nding,
for instance, the k longest or shortest paths. A repeat
pattern of length P is a directed cycle in the prex automata graph. Notice that any path of length greater
than AS;1 must intersect itself at least once. Thus any
cycle of length strictly greater than AS;1 must be composed of non-intersecting cycles of length at most AS;1 .
For instance, with a dinucleotide scale, any repeat unit
of length greater than 4 must contain at least two cycles
of length at most 4. Therefore in the study of repeats, we
need only to study the properties of all non-intersecting
directed cycles of length up to AS;1 together with all
possible ways of joining them.
In addition to dynamic programming techniques, it is
also useful to tabulate the weights of all possible short
cycles for at least two reasons. First, because longer
patterns are built from shorter cycles. Second, at least
in the case of DNA, many important existing repeats,
such as triplet repeats, are based on a short repeating
pattern.
While the prex graph is useful for constructing extremal sequences and recognizing them as long as A, S
and N are small, it is also necessary to develop more general techniques by which we can rapidly assess, for any
sequence s, the magnitude of S (s) with respect to all
the other comparable sequences. This is best achieved
by viewing the sequences in a probabilistic context.
with
E (S (s)) = (N ; S + 1)E (Yi ) NS
E (Yi ) =
X
X1 :::XS
(13)
S (X : : : XS )p(X ) : : : p(XS ) = S
1
1
(14)
the sum being over all AS S -tuples of the alphabet. To
situate an individual sequence with respect to the entire population, we need to calculate the variance. The
variance also can be calculated explicitly by taking advantage of the local dependence of the variables Yi . We
have
X
Var(S (s)) = (N ; S + 1)Var(Yi ) + 2
<j ;i<S
Cov(Yi ; Yj )
0
(15)
with the covariances Cov(Yi ; Yj ) = E [(Yi ; E (Yi ))(Yj ;
E (Yj ))]. As soon as j ; i S , Yi and Yj are independent
and the corresponding covariance is 0. Thus, for any
given scale S , one needs only to tabulate the expectation
E (Yi ) and the S relevant short-range covariances
Ck = Ck (S ) = Cov(Yi ; Yi+k )
(16)
for 0 k < S (C0 = Var(Yi )). Alternatively, by factoring out the variance of Yi , Equation 15 can also be
expressed in terms of the correlations
X
Var(S (s)) = Var(Yi )[N ; S + 1 + 2
0
<j;i<S
Cor(Yi ; Yj )]
(17)
To obtain the exact variance at each length N , it is
then only a matter of counting how many times each
6
when P = 2n. Periodic boundary conditions must be
used in the computation of the covariances Ck whenever
necessary (jkj > P ; S ).
For a periodic sequence r, where the period P as well
as S are small relative to the length N = RP we can
use:
type of covariance is present in the sequences and adjust
for any boundary eects as needed.
If N 2S ; 1, then
Var(S (s)) = (N ; S + 1)C0
+ 2
SX
;1
k=1
(N ; S ; k + 1)Ck
(18)
E (S (r)) RE (S (p))
(25)
and the approximation
If S N < 2S ; 1, then
Var(S (r)) RVar(S (p))
Var(S (s)) = (N ; S + 1)C0
(26)
(19)
For long repetitive sequences with period P < S , we can
use the same approach with a larger period P 0 , multiple
of P , so that S P 0 .
It is worth noticing that, for xed S , both the expectation and the variance are linear in N . In particular, for
large N
2.4.1 Central Limit Theorem
S (s) consists of a sum of identical but non-independent
+ 2
Var(S (s)) N [C0 + 2
NX
;S
k=1
SX
;1
k=1
(N ; S ; k + 1)Ck
Ck ] = N [
SX
;1
;S+1
random variables. Therefore standard central limit theorems for sums of independent random variables cannot be applied. Yet, because the dependencies are local,
a sum Z = Y1 + : : : + YK of K m-dependent random
variables Yi still approaches a normal distribution. This
can be shown using the theorem in (Baldi & Rinott,
1989) which provides also a bound on the rate of convergence. Here we use the improved bound found in
(RinottP& Dembo, 1996). We let maxi jYi ; E (Yi )j = B ,
and E ( Ki=1 jYi ; E (Yi )j)=K = . For all the scales to
be considered, these constants are well dened and easy
to compute. Under these assumptions,
Ck ] = NS
(20)
In the last equality, for obvious symmetry reasons, we let
C;k = Ck . This notation will prove to be useful below.
In the case of repetitive sequences, it is also useful to
calculate the expectation of S (p) = Y1 + : : : YP , and its
variance with periodic boundary conditions modulo P ,
i.e. assuming the variables Y1 : : : YP and the corresponding letters are arranged along a circle. Here both the
expectation and the variance are directly proportional
to P and satisfy E (S (p)) = S P and Var(S (p)) = S P .
Clearly, for any P , E (S (p)) = PE (Yi ) so
If P 2S ; 1,
S = E (Yi )
S = C0 + 2
SX
;1
k=1
Ck =
jP ( pZ ; EZ u) ; (u)j [Var(7K
Z )] = (2S ; 1) B
;S+1
Ck
(22)
When S P < 2S ; 1, all variables along the circle are
dependent and therefore S = S (P ) is given by
S (P ) = C0 + 2
when P = 2n + 1, and
S (P ) = C0 + Cn + 2
n
X
k=1
Ck =
n
X
;n
Ck
(23)
jP ( pZ ; EZ u) ; (u)j pC
Var(Z )
nX
;1
k=1
Ck = Cn +
nX
;1
;n+1
Ck
2
3 2
(27)
where (u) is the normalized Gaussian distribution. The
factor (2S ; 1) represents the size of the clusters associated with m-dependence. For a xed scale, such size is
constant but the theorem remains true if S grows slowly
with N . Thus Equation 27 can readily be applied to S (s)
or S (r) with K = N ; S + 1 or K = RP . >From Equation 20, the variance of the sequences being considered is
linear in their length: Var(S (s)) N , where depends
only on the scale S . Thus pwe obtain a convergence rate
that scales at most like 1= N
(21)
SX
;1
2
Var(Z )
N
(28)
with C 7(2S ; 1)2 B 2 ;3=2 . The rate of this bound
is known to be essentially optimal (similar to the BerryEsseen theorems (Feller, 1971)).
(24)
7
2.4.2 Normalized Distances and Extremal Sequences
The value of S (s) or S (r) of any sequence or repeat of
or 0 i ; j S2 ; 1. It is sucient to tabulate the
nite set of S1 + S2 ; 1 covariances Cov(Yi ; Zi+k )
Ck = Ck (S1 ; S2 ) = E [(Yi ; E (Yi ))(Zi+k ; E (Zi+k ))]
length N can be compared to the average value of a
background population by computing a normalized Zscore of the form:
Z (s) = S (sp) ; NS N
S
(32)
with S1 S2 and ;S2 + 1 k S1 ; 1. These covariances can be used to compute correlations at all lengths
by writing
(29)
Cov(S1 (s); S2 (s)) = (N ; S2 + 1)Cov(Yi ; Zi )
X
+ 2 Cov(Yi ; Zj )
(33)
A repeat r with period unit length P and repetition R
(N = RP ) can be compared to a background population of repeats, or a background population of generic
sequences. In the latter case, we have S (r) S (p)R or
S (r) 0 N . Therefore the Z-score
p
p
0
Z (r) = ( ;p) N
i6=j
For large N it is clear that, except for small boundary eects, each type of covariance occurs approximately
N times in the formula above. Therefore for large N ,
Cov(S1 (s); S2 (s)) behaves approximately as
(30)
grows with N and is larger than the Z-score Z (p) computed on the repeat unit. In other words, if a repeat
unit displays extremal features when compared to other
repeat units of the same length, its expansion will appear even more extreme compared to the background of
all sequences of similar length.
The Z scores can be used to assess how extreme a
sequence is and to search databases for subsequences
with extremal features. As in the case of alignments,
this can also be done using extreme value distributions
(Durbin et al., 1998). Note also that one can search a
data base using a structural prole rather than extreme
values. The degree of similarity between two proles
can be measured, for instance, using the standard mean
square error.
N [C0 +
SX
1 ;1
k=;S2 +1;k6=0
Ck ] = N [
SX
1 ;1
k=;S2 +1
Ck ]
(34)
We have seen in Equations 20 that the variance of each
scale is also asymptotically linear in the length N . Thus,
as N increases, the correlation Cor(S1 (s); S2 (s)) rapidly
converges to a constant given by:
PS1 ;1
k=;S2 +1PCk (S1 S2 )
PS1 ;1
[( k=;S1 +1 Ck (S1 ))( Sk=2 ;;1S2 +1 Ck (S2 ))]1=2
(35)
In checking calculations on DNA scales (or other
alphabets) that are invariant under the reverse complement operation, it is worth noticing that with a
uniform distribution on the alphabet (pA = pC =
pG = pT = 0:25), the correlations are symmetric.
That is, for any 0 < k < S1 we have Ck (S1 ; S2 ) =
C;k (S1 ; S2 ). This results immediately from the fact that
the sum of the terms S2 (X1 : : : XS2 )S1 (X1 : : : XS1 ) and
S2 (XS2 : : : X1 ) S1 (XS2 : : : XS2 ;S1 +1 ) is equal to the
sum of the terms S2 (X1 : : : XS2 ) S1 (XS2 ;S1 +1 : : : XS2 )
and S2 (XS2 : : : X1 ) S1 (XS1 : : : X1 ), and similarly for
other degrees of overlaps. The terms in the sums can
be identically paired using the fact that S1 and S2 are
assumed to be reverse-complement invariant. The result is not true if the scales, or the distribution, are not
reverse-complement invariant.
2.4.3 Correlations Between Scales
It is useful to have some information regarding the degree of correlation between two scales and how such correlation behaves at all sequence lengths. Consider then
two scales S1 and S2 of length S1 and S2 . Without any
loss of generality assume that S1 S2 . For sequences
s of length N , we are interested in measuring the correlation between the random variables S1 (s) = Y1 + : : : +
YN ;S1 +1 , with Yi = S1 (Xi : : : Xi+S1 ;1 ), and S2 (s) =
Z1 + : : : + ZN ;S2 +1 , with Zi = S2 (Xi : : : Xi+S2 ;1 ). We
have:
); S2 (s))
Cor(S1 (s); S2 (s)) = p Cov(S1 (sp
(31)
Var(S1 (s)) Var(S2 (s))
3 Results
3.1 DNA Repeat Equivalence Classes
Again only terms of the form Cov(Yi ; Zj ), where the distance between i and j is small, are non-zero. More precisely, non-zero terms can arise only if 0 j ; i S1 ; 1
We wrote a program that cycles through all possible
DNA sequences of length P counting and listing all
8
the classes that are equivalent under circular permutation and reverse complement operations. Because of
this equivalence, in the case of scales that are reversecomplement invariant, it is sucient to study the repeats
of one representative member of each class. We ran the
program up to length P = 12. The results are in complete agreement with Equations 9-12.
-18.66
-8.11
-13.10
-1
1
-13.48
3
.0
-1
0
-14.00
1
-13.48
.0
.0
8
5
-1
5
Sequence length
1
2
3
4
5
6
7
8
9
10
11
12
.8
or proper classes are classes that do not contain a shorter
periodic pattern.
1
-1
Table 1: Number of repeat unit equivalence classes. New
C
-9.45
-14.00
A
Classes (total) Classes (new)
2
2
6
4
12
10
39
33
104
102
366
350
1,172
1,170
4,179
4,140
14,572
14,560
52,740
52,632
190,652
190,650
700,274
699,875
G
-13.10
T
-9.45
-8.11
-18.66
Figure 1: Dinucleotide prex automata for the propeller
twist angle scale. The CAG repeat, for instance, is associated
with the cycle C ! A ! G ! C in the graph and has a total
propeller twist value of ;9 45 + ;14 00 ; 11 08 = ;34 53.
The corresponding reverse complement cycle is given by C
! T ! G ! C. The triplet repeat class with the largest
propeller twist value is CCC followed by CCG.
:
In Tables 2, 3, and 4 we list alphabetically all the members of each equivalence class for sequences of length 2,
3, and 4. When P = 4, for instance, one nds 39 classes:
26 classes with 8 elements, 8 classes with 4 elements, and
4 classes with 2 elements. Only 33 classes are new, in
the sense that 6 classes are derived from patterns already
encountered at P = 1 and P = 2. Likewise, when P > 2
is a prime number, the total number of classes is given
by:
:
:
:
In the Appendix, we provide tables in alphabetical
order that allow to invert Tables 3 and 4, i.e. to nd the
class associated with any given P -tuple (P = 3; 4).
3.2 Analysis of DNA Repeats by Dinucleotide Scales
In the case of dinucleotide scales, the prex automata
contains 4 nodes (Figure 1). Each DNA sequence is associated with a path through the corresponding graph,
and exact repeats are associated with cycles. All paths,
including cycles, of length greater than 4 are composite
in the sense that they contain a cycle of length 4 or less.
In table 5, we list the dinucleotide scale values
S (X1 X2 ) + S (X2 X1 ) for the 6 equivalence classes associated with all 16 possible dinucleotide repeats of the
form (X1 X2 )R . For each scale, we list classes (represented by their rst alphabetical member) and the corresponding scale value, in decreasing value order. The
highest level of base stacking energy is achieved by the
AT repeat class (-10.39) and the lowest by the CG repeat
class (-24.28). The ranking of all possible dinucleotide
repeats induced by the propeller twist and the protein
deformability scales are identical with the exception of
an inversion between the CC (-16.22 and 12.2) and CG
(-21.11 and 16.1) classes at the high (exible) end of the
spectrum. At the opposite (sti) end, we nd the single
letter repeat class AA (-37.2 and 5.8) followed by the
4P ; 4 + 2
(36)
2P
with two classes of size 2 associated with poly-A and
poly-C, while all the remaining classes are new and contain 2P members.
Table 2: Dinucleotide classes equivalent under circular per-
mutation and reverse complement operations. Classes are
numbered in alphabetical order vertically. Class members
are listed in alphabetical order horizontally. Classes 1 and 5
are not proper dinucleotide classes.
Class
List of Members
Number (Alphabetical Order)
1
AA TT
2
AC CA GT TG
3
AG CT GA TC
4
AT TA
5
CC GG
6
CG GC
9
Table 3: Trinucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered
in alphabetical order vertically. Class members are listed in alphabetical order horizontally.
Class
Number
1
AAA
2
AAC
3
AAG
4
AAT
5
ACC
6
ACG
7
ACT
8
AGC
9
AGG
10
ATC
11
CCC
12
CCG
List of Members
(Alphabetical Order)
TTT
ACA CAA GTT TGT TTG
AGA CTT GAA TCT TTC
ATA ATT TAA TAT TTA
CAC CCA GGT GTG TGG
CGA CGT GAC GTC TCG
AGT CTA GTA TAC TAG
CAG CTG GCA GCT TGC
CCT CTC GAG GGA TCC
ATG CAT GAT TCA TGA
GGG
CGC CGG GCC GCG GGC
Table 4: Tetranucleotide classes equivalent under circular permutation and reverse complement operations. Classes are
numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally.
Class
Number
1
AAAA
2
AAAC
3
AAAG
4
AAAT
5
AACC
6
AACG
7
AACT
8
AAGC
9
AAGG
10
AAGT
11
AATC
12
AATG
13
AATT
14
ACAC
15
ACAG
16
ACAT
17
ACCC
18
ACCG
19
ACCT
20
ACGC
21
ACGG
22
ACGT
23
ACTC
24
ACTG
25
AGAG
26
AGAT
27
AGCC
28
AGCG
29
AGCT
30
AGGC
31
AGGG
32
ATAT
33
ATCC
34
ATCG
35
ATGC
36
CCCC
37
CCCG
38
CCGG
39
CGCG
TTTT
AACA
AAGA
AATA
ACCA
ACGA
ACTA
AGCA
AGGA
ACTT
ATCA
ATGA
ATTA
CACA
AGAC
ATAC
CACC
CCGA
AGGT
CACG
CCGT
CGTA
AGTG
AGTC
CTCT
ATAG
CAGC
CGAG
CTAG
CAGG
CCCT
TATA
ATGG
CGAT
CATG
GGGG
CCGC
CGGC
GCGC
List of Members
(Alphabetical Order)
ACAA
AGAA
ATAA
CAAC
CGAA
AGTT
CAAG
CCTT
AGTA
ATTG
ATTC
TAAT
GTGT
CAGA
ATGT
CCAC
CGAC
CCTA
CGCA
CGGA
GTAC
CACT
CAGT
GAGA
ATCT
CCAG
CGCT
GCTA
CCTG
CCTC
CAAA
CTTT
ATTT
CCAA
CGTT
CTAA
CTTG
CTTC
CTTA
CAAT
CATT
TTAA
TGTG
CTGT
CATA
CCCA
CGGT
CTAC
CGTG
CGTC
TACG
CTCA
CTGA
TCTC
CTAT
CTGG
CTCG
TAGC
CTGC
CTCC
GTTT
GAAA
TAAA
GGTT
GAAC
GTTA
GCAA
GAAG
GTAA
GATT
GAAT
TGTT
TCTT
TATT
GTTG
GTTC
TAAC
GCTT
GGAA
TAAG
TCAA
TCAT
TTGT
TTCT
TTAT
TGGT
TCGT
TAGT
TGCT
TCCT
TACT
TGAT
TGAA
TTTG
TTTC
TTTA
TTGG
TTCG
TTAG
TTGC
TTCC
TTAC
TTGA
TTCA
GACA
GTAT
GGGT
GACC
GGTA
GCAC
GACG
GTCT
TACA
GGTG
GGTC
GTAG
GCGT
GGAC
TCTG
TATG
GTGG
GTCG
TACC
GTGC
GTCC
TGTC
TGTA
TGGG
TCGG
TAGG
TGCG
TCCG
GAGT
GACT
GTGA
GTCA
TCAC
TCAG
TGAG
TGAC
GATA
GCCA
GAGC
TAGA
GCTG
GCGA
TATC
GGCT
GCTC
TCTA
TGGC
TCGC
GCAG
GAGG
GCCT
GGAG
GGCA
GGGA
TGCC
TCCC
CATC
GATC
GCAT
CCAT
TCGA
TGCA
GATG
GGAT
TCCA
TGGA
CGCC
GCCG
CGGG
GGCC
GCCC
GCGG
GGCG
GGGC
10
proper dinucleotide repeat class AG (-27.48 and 6.6).
Table 6: Dinucleotide structural scale values for repeat unit
= 1 2 3 with = 3. S ( ) = S ( 1 2 ) + S ( 2 3 ) +
S ( 3 1 ). Repeat classes associated with triplet repeat exp
Table 5: Dinucleotide structural scale values for repeat unit
= 1 2 with = 2. S ( ) = S ( 1 2 ) + S ( 2 1 ).
p
X X
P
Base
Stacking
AT(-10.39)
AA(-10.74)
CC(-16.52)
AG(-16.59)
AC(-17.08)
CG(-24.28)
p
X X
Propeller
Twist
CC(-16.22)
CG(-21.11)
AC(-22.55)
AT(-26.86)
AG(-27.48)
AA(-37.32)
X X X
P
p
X X
X X
X X
pansion diseases are in bold.
X X
Base
Stacking
AAT (-15.76)
AAA (-16.11)
ACT (-21.11)
AAG(-21.96)
AAC (-22.45)
ATC (-22.95)
CCC (-24.78)
AGG (-24.85)
ACC (-25.34)
AGC(-27.94)
ACG (-30.01)
CCG(-32.54)
Protein
Deformability
CG (16.1)
CC (12.2)
AC (12.1)
AT
(7.9)
AG
(6.6)
AA
(5.8)
In table 6, we list the dinucleotide scale values for the
12 equivalence classes associated with all possible triplet
repeats of the form (XYZ)R . In this special case, we
nd the results of (Baldi et al., 1999). The high and
low ends of the base stacking energy scale are occupied
by the triplet classes AAT (-15.76) and CCG (-32.54)
respectively. We nd again a high degree of correlation
between the propeller twist and protein deformability
scales. If we exclude the classes AAA/TTT (-55.98)
and CCC/GGG (-24.33), which are not proper triplet
repeat classes, then the maximum and the minimum of
the propeller twist spectrum are respectively occupied
by the classes CCG (-29.22) and AAG (-46.14). A similar ranking with the same extremal triplets is observed
with the protein deformability scale: CCG (22.2) occupies the high end, whereas AAA (8.7) and AAG (9.5)
occupy the low end of the spectrum.
When considering all three dinucleotide scales, three
minima and two maxima are occupied by two of the three
repeat classes known to be involved in triplet repeat expansion diseases, namely AAG and CCG. GAA triplet
(in the AAG class) expansion is associated with Friedreich's ataxia (Orr et al., 1993; Campuzano et al., 1996;
Junck & Fink, 1996; Paulson et al., 1997; Koenig, 1998;
Lee, 1998; Orr & Zoghbi, 1998; Paulson, 1998; Pulst,
1998; Stevanin et al., 1998). Abnormal GCC triplet (in
the CCG class) expansion is associated with FRAXE
mental retardation and abnormal expansion of the CGG
triplet with Fragile X syndrome (FRAXA) (Nelson,
1995; Gusella & MacDonald, 1996; Eichler & Nelson,
1998; Skinner et al., 1998; Gecz & Mulley, 1999). The
third triplet expansion disease related class, AGC, has
average rank in all dinucleotide scales.
In table 7, we list the scale values for the 39 equivalence classes associated with all possible tetranucleotide
repeats of the form (X1 X2 X3 X4 )R . The maximum of
the base stacking scale is occupied by the dinucleotide
repeat ATAT (-20.78) and the proper tetranucleotide
repeat AAAT (-21.13). The minimum corresponds to
CGCG (-48.56) followed by ACGC (-41.36). We again
observe a substantial positive correlation between the
Propeller
Twist
CCC (-24.33)
CCG(-29.22)
ACC (-30.66)
AGC(-34.53)
AGG (-35.59)
ACG (-36.61)
ATC (-37.94)
ACT (-38.95)
AAC (-41.21)
AAT (-45.52)
AAG(-46.14)
AAA (-55.98)
Protein
Deformability
CCG (22.2)
ACG (18.9)
CCC (18.3)
ACC (18.2)
AGC (15.9)
ATC (15.9)
AAC (15.0)
AGG (12.7)
AAT (10.8)
ACT (10.7)
AAG (9.5)
AAA (8.7)
values produced by the propeller twist and protein deformability scales together with a weaker negative correlation with respect to the base stacking energy scale.
The high end of the propeller twist scale is occupied by
CCCC (-32.44) and CCCG (-37.33) while that of the protein deformability scale is occupied by CGCG (32.2) and
CCCG (28.3). The lowest values correspond for both
scales to AAAA (-74.64 and 11.6) and AAAG (-64.80
and 12.4).
All repeat units of length greater than 4 are made up
of shorter cyclic paths in the prex automata and therefore their properties can essentially be predicted from
the previous three tables. For all lengths, for instance,
the highest level of base stacking energy is achieved by
the class ATATATAT... when P is even, and by the class
AATATATAT... when P is odd. The lowest level by the
class CGCGCG... when P is even, and CCGCGCG...
when P is odd. For protein deformability, the maximal
level is achieved by the class CGCGCG... when P is
even, and by CCGCGCG... when P is odd. The lowest level is associated with poly-A (i.e. (A)P ). Poly-C
and poly-A give also the absolute highest and lowest propeller twist angles at all lengths.
3.3 Analysis of DNA Repeats by Trinucleotide Scales
In the case of trinucleotide scales, the prex automata
contains 16 nodes (Figure 2), each one labeled with a different dinucleotide. All paths, including cycles, of length
greater than 16 are composite, i.e. contain at least one
cycle of length 16 or less.
The trinucleotide scale values for all repeats with periodic unit length P = 2 are given in Table 8. The highest
level of bendability is achieved by AT (0.364) and the
lowest by AA (-0.548) and CG (-0.154). The highest
11
AA
Table 7: Dinucleotide structural scale values for repeat unit
= 1 2 3 4 with = 4. S ( ) = S ( 1 2 ) + S ( 2 3 ) +
S ( 3 4 ) + S ( 4 1 ).
p
X X X X
X X
P
p
X X
TT
X X
TG
AG
X X
Base
Stacking
ATAT (-20.78)
AAAT (-21.13)
AATT (-21.13)
AAAA(-21.48)
AACT (-26.48)
AAGT(-26.48)
AGAT (-26.98)
AAAG(-27.33)
ACAT (-27.47)
AAAC(-27.82)
AATC (-28.32)
AATG (-28.32)
ACCT (-29.37)
AAGG(-30.22)
AACC (-30.71)
ATCC (-31.21)
AGCT(-31.97)
CCCC (-33.04)
AGGG(-33.11)
AGAG(-33.18)
AAGC(-33.31)
ACCC (-33.60)
ACAG(-33.67)
ACTC (-33.67)
ACTG(-33.67)
ACAC (-34.16)
ATGC (-34.30)
ACGT(-34.53)
AACG(-35.38)
ATCG (-35.88)
AGCC(-36.20)
AGGC(-36.20)
ACCG(-38.27)
ACGG(-38.27)
CCCG(-40.80)
CCGG(-40.80)
AGCG(-40.87)
ACGC(-41.36)
CGCG(-48.56)
Propeller
Twist
CCCC (-32.44)
CCCG(-37.33)
CCGG(-37.33)
ACCC (-38.77)
CGCG(-42.22)
AGCC(-42.64)
AGGC(-42.64)
ACGC(-43.66)
AGGG(-43.70)
ACCG(-44.72)
ACGG(-44.72)
ATGC (-44.99)
ACAC (-45.10)
ATCC (-46.05)
ACCT (-47.06)
ACGT(-48.08)
AGCG(-48.59)
AACC (-49.32)
ACAT (-49.41)
ACAG(-50.03)
ACTC (-50.03)
ACTG(-50.03)
AGCT(-50.93)
ATCG (-52.00)
AAGC(-53.19)
ATAT (-53.72)
AAGG(-54.25)
AGAT (-54.34)
AGAG(-54.96)
AACG(-55.27)
AATC (-56.60)
AATG (-56.60)
AACT (-57.61)
AAGT(-57.61)
AAAC(-59.87)
AAAT (-64.18)
AATT (-64.18)
AAAG(-64.80)
AAAA(-74.64)
Protein
Deformability
CGCG (32.2)
CCCG (28.3)
CCGG (28.3)
ACGC (28.2)
ATGC (25.2)
ACCG (25.0)
ACGG (25.0)
CCCC (24.4)
ACCC (24.3)
ACAC (24.2)
ACGT (23.0)
AGCG (22.7)
ATCG (22.7)
AGCC (22.0)
AGGC (22.0)
ATCC (22.0)
AACG (21.8)
AACC (21.1)
ACAT (20.0)
AAGC (18.8)
AATC (18.8)
AATG (18.8)
AGGG (18.8)
ACAG (18.7)
ACTC (18.7)
ACTG (18.7)
AAAC (17.9)
ACCT (16.8)
ATAT (15.8)
AAGG (15.6)
AGAT (14.5)
AGCT (14.5)
AAAT (13.7)
AATT (13.7)
AACT (13.6)
AAGT (13.6)
AGAG (13.2)
AAAG (12.4)
AAAA (11.6)
0.175
P
p
Bendability
AT (0.364)
AG (0.058)
AC (0.034)
CC (-0.024)
CG (-0.154)
AA(-0.548)
X X X
0.017
TA
CA
0.076
GT
CC
GG
CG
GC
CT
GA
Figure 2: Trinucleotide prex automata for the bendability
scale. Circle is used for ease of display but does not represent
actual connections. The CAG repeat, for instance, is associated with the cycle CA ! AG ! GC ! CA in the graph and
has a total bendability value of 0 175+0 017+0 076 = 0 268.
It is the highest bendability value for any triplet repeat.
Other edges are not shown.
:
:
:
:
The trinucleotide scale values for all repeats with periodic unit length P = 3 are given in Table 9 (see also
(Baldi et al., 1999)). The highest level of bendability
is achieved by the class AGC (0.268) and the lowest by
AAA (-0.822) and ACC(-0.238). In fact only two classes
of repeats (AGC and ATC) have positive bendability
and are well separated from the rest. The highest level
of position preference is achieved by the class AAA (108)
followed by CCG (72), and the lowest by AGG and ACC
(21). The class AGC, which contains the CAG repeat
responsible for the majority of the known triplet repeat
expansion diseases, has the highest bendability. It is the
only repeat class for which all three shifted triplets have
a high individual bendability. Moreover, this class has
relatively low position preference value, another sign of
exibility. Therefore one can hypothesize that long CAG
repeats correspond to stretches of DNA that are highly
exible in all positions. Consistently with their high exibility, CAG/CTG repeats have been found to have the
highest anity for histones among all possible triplet repeats (Wang & Grith, 1994; Wang & Grith, 1995;
Godde & Wole, 1996). Other DNA sequences can
adopt long range curvature only if they contain highly
exible triplets in phase with the helical pitch (roughly
every 10.5 bp). The exibility of extended CAG repeats has been veried experimentally (Chastain & Sinden, 1998). The CCG class, which contains the diseaserelated triplets CGG and GCC, is found at the high
Table 8: Trinucleotide structural scale values for repeat unit
= 1 2 with = 2. S ( ) = S ( 1 2 1 ) + S ( 2 1 2 ).
X X
AT
TC
level of position preference is achieved by AA (72) and
CG (50), and the lowest by AG (17).
p
AC
X X X
Position
Preference
AA (72)
CG (50)
AT (26)
CC (26)
AC (23)
AG (17)
12
(rigid) end of the position preference scale (72), exceeded
only by poly-A. This class is also sti according to the
bendability scale (-0.106). This is consistent with the
fact that CGG/CCG repeats seem completely unable
to form nucleosomes (Wang et al., 1996; Godde et al.,
1996). The AAG class, which contains the disease related triplet GAA, occupies the lower (exible) end of
the position preference scale (27). It is the second lowest
considering that the last two classes have the same value
(21). We also note that AAA/TTT is by large the stiest
of all possible repeats according to both scales. Such homopolymeric tracts are known from X-ray crystallography to be rigid and straight (Nelson et al., 1987) and they
are bad candidates for nucleosome positioning. In fact,
a number of promoters in yeast contain homopolymeric
dA:dT elements. Studies in two dierent yeast species
have shown that the homopolymeric elements destabilize
nucleosomes and thereby facilitate the access of transcription factors bound nearby (Iyer & Struhl, 1995;
Zhu & Thiele, 1996). Interestingly, the sequence of the
IT15 gene involved in Huntington Disease has a repeat
containing 18 adenine nucleotides at its 3' end.
Whereas the class CCG is extremely rigid according to
the trinucleotide scales, it is extremely exible according
to the dinucleotide scales. Similarly, the predicted exibility of the AAG class according to the position preference scale is in contradiction with the results obtained
using all other di- or trinucleotide scales. Such discrepancies can result from imperfections of the scales, or from
the fact that each scale captures a dierent facet of DNA
structure. Dinucleotides and tri-nucleotides scales are in
good agreement for CAG repeats and homopolymeric
poly-A tracts.
est level of bendability is achieved by the class ATAT
(0.728), which is rather a dinucleotide repeat, followed
by the proper tetranucleotide repeat ATGC (0.420). On
the opposite end of the scale, we nd AAAA (-1.096)
and AAAC (-0.470). For the position preference scale,
AAAA (144), AATT (100), CGCG (100), AAAT(99) are
on the higher end, AAAT being the rst proper tetranucleotide repeat, while ACGG (23)occupies the lower end
of the spectrum.
Table 10: Trinucleotide structural scale values for repeat
unit = 1 2 3 4 with = 4. S ( ) = S ( 1 2 3 ) +
S ( 2 3 4 ) + S ( 3 4 1 ) + S ( 4 1 2 ).
p
p
X X X
P
p
ATAT (0.728)
ATGC (0.420)
ACAT (0.335)
AGGC (0.301)
AGCT (0.214)
AGAT (0.189)
ACAG (0.183)
ACTG (0.173)
AGAG (0.116)
ACTC (0.082)
ACAC (0.068)
AGCC (0.053)
AAGC (0.027)
ACCT (0.026)
AATG (0.011)
ACGC (0.006)
ACGT(-0.016)
AGGG(-0.025)
AGCG(-0.032)
CCCC (-0.048)
CCGG(-0.058)
CCCG(-0.118)
AAGG(-0.162)
ACGG(-0.169)
AAGT(-0.171)
AATC (-0.181)
ACCG(-0.184)
ATCC (-0.209)
ATCG (-0.226)
AACT (-0.230)
ACCC (-0.250)
AACG(-0.278)
AAAT (-0.304)
CGCG(-0.308)
AAAG(-0.365)
AATT (-0.424)
AACC (-0.468)
AAAC(-0.470)
AAAA(-1.096)
X X X
X X X
triplet repeat expansion diseases are in bold.
Bendability
AGC (0.268)
ATC (0.218)
AGG (-0.013)
AAT (-0.030)
CCC (-0.036)
ACG (-0.049)
ACT (-0.068)
AAG(-0.091)
CCG(-0.106)
AAC (-0.196)
ACC (-0.238)
AAA (-0.822)
P
X X X
Bendability
Table 9: Trinucleotide structural scale values for repeat
unit = 1 2 3 with = 3. S ( ) = S ( 1 2 3 ) +
S ( 2 3 1 ) + S ( 3 1 2 ). Repeat classes associated with
X X X
X X X X
X X X
Position
Preference
AAA (108)
CCG (72)
AAT (63)
ACG (47)
AGC (40)
CCC (39)
ACT (35)
ACC (33)
ATC (33)
AAG (27)
AAC (21)
AGG (21)
p
X X X
X X X
Positioning
Preference
AAAA(144)
AATT (100)
CGCG(100)
AAAT (99)
CCGG (94)
AGCG (89)
AGCT (86)
CCCG (85)
AGCC (80)
ATCG (76)
AATG (68)
AGGC (68)
AAAG (63)
ACGC (63)
ATGC (62)
AAAC (57)
AACG (57)
AACT (55)
AATC (54)
AAGC (53)
ATAT (52)
CCCC (52)
ACCG (49)
AGAT (47)
ACAC (46)
ACCC (46)
ACTC (44)
AAGT (43)
ACAT (43)
ACCT (40)
ATCC (38)
AGAG (34)
AGGG (34)
AACC (31)
AAGG (31)
ACTG (29)
ACGT (28)
ACAG (25)
ACGG (23)
In order to nd the most extreme repeats for a given
scale at a given repeat unit length, one would have to
explore scale values for repeat units up to length P = 16
(see Section 2.3). Because of particular values of the
scale, in some cases the results tabulated above for val-
The trinucleotide scale values for all repeats with periodic unit length P = 4 are given in Table 10. The high13
3.4.1 Convergence to Normality
ues of P up to 4 only are sucient. For instance, the
most bendable repeat with P = 2n is always ATATAT...,
while the least bendable is poly-A. Similarly, the highest
value of the position preference scale is always occupied
by poly-A. Extremal results can also be derived by dynamic programming. In many cases, however, a sequence
of interest may have a very high or low score according
to a given scale, without being the most extreme. The
probabilistic theory provides the means to quantify directly how extreme any given sequence is with respect
to a given family or background.
Using sampling methods we also studied the convergence
of S (s) and S (p) to a normal distribution as the length
N or P of the sequences is increased, as predicted by
our central limit theorem. In practice, the convergence
rate is very fast. As an example, an histogram of bendability values for repeat units of length 5, 10, and 15 is
given in Figure 3. Similar results are observed with plain
sequences.
300
200
3.4 Probabilistic Analysis of DNA
Scales
P=5
µ=−0.0923
σ=0.0771
P=10
µ=−0.185
σ=0.154
P=15
µ=−0.277
σ=0.231
100
0
4
x 10
For simplicity, we rst assume a uniform distribution
pA = pC = pG = pT . In specic applications, other
distributions can be used, such as the background distribution of a given genome or a given class of DNA sequences. We can then use Equations 21-24 to calculate
the expectation and variance of S (p) across all possible
repeat unit patterns p and all scales S . In particular,
E (S (p)) = S P and Var(S (p)) = S P .
In Table 11 we list the relevant coecients for the
dinucleotide scales.
2.5
2
1.5
1
0.5
0
6
x 10
10
5
0
Figure 3: Histogram of bendability values S ( ) for all possip
ble repeat units of length = 5 10 15. Vertical dashed lines
represent standard deviation units.
P
Table 11: Basic intra-scale coecients for dinucleotide scales
with repeat unit length (2 ; 1) = 3.
P
S = E (Yi )
C0 = Var(Yi )
C1 = Cov((Yi Yi+1 )
S = Var(S (p))
;
;
S
3.4.2 Examples of Z-Scores for Disease Triplets
BS
PT PD
-8.08 -12.59 4.96
6.62
9.68 9.62
0.31
2.26 -2.19
7.23 14.20 5.23
We have seen that the triplets involved in expansion diseases often tend to have extremal structural properties.
This was assessed by computing the scores S (p). We can
now also compute Z-scores using Equations 29 and 30,
as in Table 13.
In Table 12 we list the relevant coecients for the
trinucleotide scales.
Table 13: Z-scores for the repeats involved in HD and
FRAXA for the bendability and propeller twist scales. ( )
is the Z-score of the repeat unit against the background of
all possible repeat units of same length = 3. ( ) is the
Z-score of a long repeat containing repeat units against the
background of all possible sequences of same length = ,
including non repetitive sequences. Values of are chosen
at the characteristic low end and high end of each disease.
Z p
Table 12: Basic intra-scale coecients for trinucleotide
scales with repeat unit length (2 ; 1) = 5.
P
P
S
Z r
R
N
B
PP
S = E (Yi )
-0.018 13.78
C0 = Var(Yi )
0.015 103.108
C1 = Cov(Yi Yi+1 ) 0.0015 18.214
C2 = Cov(Yi Yi+2 ) -0.001 -2.558
S = Var(S (p)
0.015 134.42
R
Disease
Triplet
Scale
Z (p)
R (low end)
Z (r)
R (high end)
Z (r)
To double-check the mathematical formula, all the
constants above were also obtained independently by exhaustive sampling.
14
HD
CAG
B
1.60
36
9.00
121
16.52
FRAXA
CGG
PP
1.56
200
21.48
1000
48.23
RP
3.4.4 Correlations Between Scales: Asymptotic
Values
When repeat length is taken into consideration, disease causing repeats
p appear to be even more extreme
because of the N factor in Equation 30. For example, a CGG repeat of length N = 3; 000 (R = 1; 000)
observed in FRAXA patients is more than 48 standard
deviations away from the mean propeller twist value of
random uniform sequences of the same length.
If the same correlations are computed by shuing the
4.6 Mbp of the Escherichia coli genome randomly over a
length of 31bp, one obtains the numbers given in Table
15 (Pedersen et al., 2000). The correlation between BS
and PT is even higher (-0.744) and so is the correlation
of AT-content with BS and PT (0.899 and -0.882). Incidentally, when measured on the actual Escherichia coli
genome the correlations are even higher. For instance,
the correlation between BS and PT becomes -0.825.
3.4.3 Correlations Between Scales at Short
Lengths
We can use Equations 31-35 to study the correlations
between the scales at short lengths and asymptotically.
Clearly we can also consider AT-content as a scale. It
can be viewed, for instance, as a mononucleotide scale
with value 1 for A and T, and 0 for C and G. This scale is
trivially reverse-complement invariant and perfectly additive. We include it to see whether it correlates strongly
with any of the structural scales, especially asymptotically.
Table 15: Correlations between the scales measured over 31
bp random segments from Escherichia coli.
AT
BS
PT
PD
B
PP
Table 14: Correlations between the scales at a given position
( 0 (S1 S2 )). AT-content, Base Stacking, Propeller Twist,
C
;
AT BS
1 0.478
1
PT
PD
-0.539 -0.294
-0.294 0.043
1
0.668
1
B
-0.0098
-0.018
0.249
0.141
1
BS
0.899
1
PT
-0.882
-0.744
1
PD
-0.777
-0.805
0.801
1
B
-0.153
-0.181
0.370
0.108
1
PP
0.023
-0.029
-0.154
0.062
-0.206
1
These results are easily explained by the theory developed here. The asymptotic correlation between the
scales computed using Equation 35 are displayed in Table 16. Because the Escherichia coli genome has a nucleotide distribution close to uniform, the results are
indeed remarkably similar to Table 15, and would be
identical up to sampling uctuations if in Table 16 we
had used the precise distribution for E. coli (A=0.2462,
C=0.2542, G=0.2537, T=0.2459), instead of a uniform
distribution.
Protein Deformability, Bendability, Position Preference.
AT
BS
PT
PD
B
PP
AT
1
PP
0.040
-0.089
-0.123
-0.036
-0.080
1
Correlations at a given position are given in Table 14.
Consistently with (Baldi et al., 1998) and the results
above, the correlations between the structural scales are
very low with the exception of PT and PD (0.668).
BS and PT have also non-trivial opposite correlations
with respect to AT-content (0.48 and -0.54). Here correlations between a dinucleotide scale S1 and a trinucleotide
scale S2 are computed using sums of the form
P
X1 X2 X3 S1 (X1 X2 )S2 (X1 X2 X3 ). Because the third nucleotide does not appear in the dinucleotide scale, in
(Baldi et al., 1998) correlations between dinucleotide
and trinucleotide scales were also computed using, for
the dinucleotide scale, the sum S1 (X1 X2 ) + S1 (X2 X3 ).
When considering neighboring dinucleotides, the correlation between BS and PT, for instance, increases its
magnitude from -0.294 to -0.550. This eect must be
caused by correlations that are present in runs of overlapping dinucleotides, but not in the single dinucleotides.
Such a phenomenon may arise if the physical reality behind both scales is that the structure actually depends
on more than a dinucleotide step, and this is very likely
to be the case.
Table 16: Asymptotic correlations between the scales using
a uniform distribution.
AT
BS
PT
PD
B
PP
AT
1
BS
0.914
1
PT
-0.891
-0.757
1
PD
-0.798
-0.843
0.810
1
B
-0.167
-0.193
0.387
0.113
1
PP
0.046
-0.003
-0.175
0.051
-0.225
1
It is essential to notice that the asymptotic values do
not require very long sequences but are approximately
correct already at a length scale of 15-20 base pairs or
so (Figure 4). Asymptotically, and with a uniform distribution, all the dinucleotide scales have strong positive
or negative correlations with each other and with ATcontent. Notice that this is not the case for the trinucleotide scales. It is also not necessarily the case if
the correlations are measured with respect to other nucleotide distributions.
15
1
0.8
PT/PD
0.4
0.6
0.2
0.4
PT/B
Correlation AT/B
0
Correlation
0.2
PD/B
PD/PP
0
BS/PP
PT/PP
BS/B
−0.2
−0.2
−0.4
−0.6
B/PP
−0.8
−0.4
1
.75
−1
1
−0.6
.50
.75
.50
BS/PT
−0.8
A/(A+T) Ratio
BS/PD
.25
.25
0
0
AT Content
−1
0
5
10
15
20
25
30
35
Figure 5: Surface representing the asymptotic correlation
N
between the bendability scale and the AT content scale for
dierent distribution values of AT-content and A/T proportions.
Figure 4: Rapid convergence of correlations between pairs
of scales to the theoretically predicted asymptotic values as
a function of string length with uniform nucleotide distribution and free boundary conditions. Curves start at = 2
for pairs of dinucleotide scales, and = 3 for all other
pairs. Correlations are calculated exactly up to = 12,
and using a random uniform sample of 70,000,000 points for
= 17 22 27 32.
N
GenBank data base. The rst category, in particular,
contains 67 classes of patterns that were found to occur
in tandem repeats of total length N 12 with R 3 in
at least two dierent length sizes. The second category
includes 71 pattern classes that are found to occur in tandem repeats over 12 nucleotides long in only one length
size. The last category contains 363 pattern classes that
were found not to expand beyond 12 nucleotides. For
each pattern class, simple indicators are provided, such
as average length of repeats, relative abundance of class,
and expandability.
In Figure 6 we display the Z-scores for the all the
repeat units in the rst category, as a function of repeat unit length (P ), computed with respect to repeat
units of the same length using Equation 29. The distributions of repeat classes with respect to each structural scale are approximately symmetric and normal at
all lengths, showing no clear-cut bias towards extremal
values of any scale. When taking into account the relative abundance of the classes, as quantied in (Jurka &
Pethiyagoda, 1995) by using the number of nucleotides
occurring in corresponding repetitive sequences, we nevertheless observe that the most abundant class, (poly-A,
33%), corresponds to the stiest repeat at all lengths, for
all scales except BS (Figure 7). It is reasonable to assume that the structural properties of poly-A are related
to its abundance. The next most frequent repeats, however, (AC=17.94%, AG= 5.38%, and AAAT=3.39%) do
not show a clear pattern of extremal values in the ve
scales considered. Likewise, we do not nd any obvious correlation between scale values and relative abundance or expandability indicators provided in (Jurka &
Pethiyagoda, 1995).
N
N
N
;
;
;
Figure 5 shows how the asymptotic correlation of
bendability with AT content varies with the underlying distribution. Similar surfaces for all other scales are
given in Figure 8 in the Appendix. Notice that in general
A=(A + T ) 0:5 in eukaryotic genomic DNA as soon as
suciently long stretches of DNA are taken into consideration (Charga's second parity rule) (Prabhu, 1993;
Bell & Forsdyke, 1999). This is not necessarily the case
with, for instance, relatively short stretches of DNA,
synthetic DNA, or with bacterial DNA that contains a
strong composition skew associated with independently
replicated regions.
3.5 Analysis of a Set of Expandable Repeats in Primate Genomes
Because triplets involved in disease expansions seem to
have extremal properties which may be related to the
expansion mechanism, it is worth testing whether this
is a fairly general feature of units associated with tandem repeats. Here we consider the large set of repeat
unit classes derived in (Jurka & Pethiyagoda, 1995) corresponding to frequently encountered tandem repeats of
multiple lengths (i.e. that are polymorphic) in primate
genomes. The data contains 501 unique classes of repeat
units ranging in length from P = 1 to P = 6, classied
into three categories: expandable, weakly expandable,
and non-expandable. These categories were derived from
simple statistical criteria calculated over a subset of the
16
BS
PT
5
PD
5
35
5
30
0
0
0
−5
−5
1
2
3
4
5
6
Relative Abundance (%)
25
−5
1
2
B
3
4
5
6
1
2
3
4
5
6
PP
5
5
AT
AG
20
A
15
AGAT
AAGG
10
AC
AAAC
AAAAG
0
0
5
0
−5
−5
1
2
3
4
5
6
1
2
3
4
5
6
1
2
AAAG
AGCCC
AAAAT
AAAT
3
4
Repeat Unit Length (P)
AAAAC
AACCCT
5
6
Figure 7: Relative abundance of expandable repeat unit
Figure 6: Distribution of expandable repeat classes over
classes in (Jurka & Pethiyagoda, 1995) as a function of repeat
unit length. The individual contribution of repeat classes totaling more than 1% relative abundance are shown.
scale values. Horizontal axis represents repeat unit length
( ). Vertical axis represents scale value distances from expectation for all possible repeat units, normalized by the corresponding standard deviation (see Equation 29). Dots represent a set of 67 expandable repeat classes. The number
of classes at each length 1-6 is as follows: 2, 4, 9, 18, 18,
and 16. Circles represent the most extreme repeat classes,
when considering the full population of repeat patterns at a
given length. Note that at each length , only proper tuple repeats are represented in the set found in (Jurka &
Pethiyagoda, 1995), excluding repeats with shorter periodicity. For instance poly-A is only represented once as a single
letter repeat. It is worth noticing that most extreme positions are actually occupied at almost all length by mono-,
di-, or tri-nucleotide repeats. The set of expandable repeats
therefore actually covers the full range of each scale when taking into account patterns made up of shorter repeated units.
Only 5 circles out of 64 remain unmatched by an expandable
repeat class.
P
P
AGC
AGG
CCG
AAT
at least over relatively short distances. The precise nature of such distances, however is an open important
question that ultimately will have to be addressed experimentally. The structural scales used here should be
regarded only as a rst order approximation. The twist
angle between bases, for instance, is likely to depend
on more than just the two neighboring bases (Dickerson, 1992). A better estimate could be derived using the
tetranucleotide consisting of the two bases before and
after the twist angle. Unfortunately, the structure of all
possible 256 tetranucleotides is not known and represents
a considerable experimental challenge. But the methods we have developed are independent of any particular scale, approximation, or oligonucleotide length. They
are readily applicable to new scales, tetranucleotide and
other, as well as to completely dierent scales dened
over other alphabets (codons, RNA, proteins, etc.). Furthermore, the methods are also applicable in conjunction
with computationally-derived scales that are parameterized and tted to the data using neural network representations and other statistical machine learning techniques.
The theory presented resolves the issue of correlation
between scales. For each pair of scales there is not a single correlation number because correlation depends both
on background distribution and window length. Given
a xed background distribution, the correlations rapidly
converge to a xed asymptotic value that can be predicted mathematically. This value is attained here over
a characteristic window length of about 10-15 bp, the
same over which normality is achieved, and corresponding to a few times the size of the longest of the two
P
4 Discussion
A general framework for sequence analysis in the presence of one or more additive scales has been developed.
The framework solves a number of open issues including: (1) construction of sequences with extremal properties; (2) quantitative evaluation of sequences with respect to a given genomic background; (3) automatic extraction of extremal sequences and proles from genomic
data bases; (4) rapid convergence to normal distributions
when N increases; and (5) complete analysis of correlations between scales and their rapid convergence towards
a xed asymptotic value. The framework has been applied to DNA sequences and structural scales.
The fundamental requirement for the application of
the framework is the additivity of the scale. This is
likely to be a reasonable approximation for many scales,
17
uniform distribution may be the least biased. On the
other hand, if correlations are measured over a large
number of sequences extracted from genomic data, it is
clear that the sequence composition inuences the correlation. Similarly, if the scales are used to pullout signals against a genomic background, it is important to
take the statistical composition into consideration. In
this respect, it is worth noting that large scale genomic
DNA is characterized by strand invariant compositions
(pA = pT and pC = pG ) where correlations between
empirically-derived scales tend to be smaller in absolute
value. The framework, however, applies as well to compositions that are not strand invariant.
We have modeled the background distribution using
single nucleotide probabilities but it should be clear that
the same framework can accommodate more complex
probabilistic models, such as Markov models of order
k. In fact, it is interesting to note that with a higher
order Markov model, some of the correlations between
the scales measured in E. coli would be slightly higher.
It is fair to suspect, however, that the structural models currently available are somewhat noisy and therefore
only marginal gains are to be expected at best from the
use of more rened probabilistic models.
Taken together, all these results indicate that with the
exception of propeller twist and protein deformability,
the empirical scales we have used are by and large uncorrelated. DNA 3D structure is a complex phenomenon
that cannot be captured locally by a single number, but
rather corresponds to a vector of properties. It is therefore likely that the scales we have used represent dierent
attempts at capturing various aspects of DNA structure
from dierent perspectives. This view is consistent with
the simultaneous presence of predominantly low and occasionally high correlations between pairs of scales. In
particular, this provides a possible partial explanation
for the dierences in interpretation the scales provide
for some of the three extremal triplet classes involved in
triplet repeat expansion diseases.
The CAG-class of repeats is consistently found to be
exible according to all the models used here and in
agreement with experimental evidence (Chastain & Sinden, 1998). This class is special among triplet repeats,
being responsible for a large fraction of the currently
known triplet repeat diseases (10 of the 13 mentioned
in (Baldi et al., 1999)). Furthermore, in a model study
in E. coli, the CAG triplet repeat was found to be the
predominant genetic expansion product. In this study,
the CAG-class was expanded at least nine times more
frequently than any other triplet (Ohshima et al., 1996).
This is also the case in the primate data of (Jurka &
Pethiyagoda, 1995), as shown in Figure 7.
The CGG-class of repeats, on the other hand, seems
to be very rigid, except for the propeller twist (and thus
scales being compared. This is the range over which local statistical uctuations are stabilized, but it does not
correspond necessarily to the range over which the scales
are additive.
With a uniform background distribution, for instance,
the correlation between propeller twist and base stacking varies monotonically from -0.294 to -0.757, as window
size is increased from 2 to 10 or so. Thus the increase in
window length signicantly changes the measured correlation from slightly negative to substantially negative.
An even more striking example is provided by the correlation between base stacking and protein deformability,
which varies from 0.043 to -0.843, under the same conditions.
Empirical determination of the proper window length
for additivity and for measuring correlations may not be
easy. It must be noted, however, that these large variations are observed only with the theoretically-derived
base stacking energy scale (as well as the AT-content
scale). In general, all other empirically-derived scales
exhibit pairwise correlations that do not vary dramatically with window length and are relatively small (Figure
4), except for propeller twist and protein deformability.
Thus for the empirically-derived scales the local behavior
and the aggregate behavior over 10-15 bp is quite similar
in terms of correlations, so that the precise selection of
a window length may not be a serious obstacle in this
case.
Propeller twist and protein deformability are highly,
but not perfectly, correlated, over both very short and intermediate distances. These scales were derived by crystallography of naked DNA and DNA in complex with
protein respectively. This suggests that DNA structures
observed in protein-DNA complexes may to some extent
be determined at the DNA-sequence level. Or at least
that the structure of DNA in the complex has to be consistent with the inherent structural features of the naked
DNA.
In general, when substantial positive or negative correlation between two scales is observed, two dierent
sets of conclusions can be drawn. First, from a practical standpoint, it may be simpler and faster in data
base searches to use only one of the two scales, since
the results provided by the second one are redundant.
Second, high correlation between two completely dierent experimental approaches attempting to quantify the
connection between DNA sequence and structure can be
taken as a sign that the approaches are measuring the
same underlying reality. Thus correlation analysis, in
addition to which scale to use in practice, may tell us
something about their interpretation and validity.
It may at rst seem strange that correlations depend
also on background distribution, since structure is a deterministic function of DNA sequence. In this sense, a
18
Our analysis of a large set of tandem repeats from
primate genomes reveals that many repeat units do not
have salient structural extreme properties according to
the models used here. The results suggest that tandem
repeats are likely to belong to dierent classes and result from a variety of dierent mechanisms not all of
which involve extremal structural proles. This is not to
say that structural signatures, rather than extreme patterns, may not be involved in other cases as suggested,
for instance, in (Liao et al., 2000). Further evidence for
such a possibility is provided by the fact that among the
most frequently expanded repeat classes with 3 < P < 7,
substrings of mono repeats (particularly poly-A which is
sti) seem to be present almost always (Figure 7).
Although some of the equations derived are for exact
repeats, it should be clear that the framework applies
immediately to situations where the repeat is not perfect, either because of small variations in the sequence
or because the repetition number R contains a fractional
component. Cases are described in the literature (Orr
et al., 1993), for instance, where the G of a few isolated CAG triplets within a long CAG repeat region are
replaced by a T. Of interest, the CAT triplet belongs
to the second highest bendability class, and therefore
the exibility properties of such stretches are likely to
be preserved. More generally, the scales could perhaps
form the basis of new alignment penalties in cases where
structure, rather than sequence, is preserved.
The methods developed here can be applied more
systematically to other repeats including telomeric repeats, non-triplet disease-causing repeats, as well as to
database searches for new putative disease-causing repeat classes (Kleiderlein et al., 1998; Baldi et al., 1999).
For instance, a repeated twelve-mer upstream of the
EPM1 gene displays intergenerational instability and
has been associated with myoclonus epilepsy (Lalioti
et al., 1997; Lalioti et al., 1998). A similarly unstable, AT-rich, 42 bp-repeat is involved in the fragile site
FRA10B (Hewett et al., 1998). In addition, a particular
form of Muscular Distropy (FSHD) seems to be caused
by DNA contraction, rather than expansion. In FSHD,
the repeat units are surprisingly long (3.3 Kb), and located at the tip of the long arm of chromosome IV. The
units are repeated 30 to 40 times in normal individuals, and reduced down to 8 repeats in aected individuals (van Deutekom et al., 1993; Winokur et al., 1994;
Hewitt et al., 1994). Statistical analysis of long repeats
units may benet even more from the techniques developed here.
Finally, these techniques can be applied to repeats associated with interspersed elements as well, and, more
broadly, to the analysis of structural signals, across entire genomes (Pedersen et al., 2000), or associated with
specic regions such as regulatory regions, protein bind-
protein deformability) scale. Better structural models
may be needed to shed light on such a discrepancy. However, it is important to remember that the models used
in this work are based on mutually dierent and also
rather indirect investigations of DNA structure. Any
single scale is likely to capture correctly only some structural features of some sequence elements. For instance,
the enzyme DNase I used to produce the bendability
scale preferentially binds and cuts sites where DNA is
bent or bendable away from the minor groove. This
means that a high DNase I value can be caused by either a very exible piece of DNA (isotropically exible,
or anisotropically exible in the right direction), or alternatively by a piece of DNA that is sti but curved
with a compressed major groove.
The framework derived can be used to study how extreme tandem repeats and other sequences are. In several cases, we nd that commonly encountered repeat
units have extremal structural properties. This is the
case for the most common repeat poly-A or poly-T but
also for the repeats involved in triplet repeat expansion
diseases. It is essential to notice that these extremal
properties pertain to the repeat unit class, e.g. the
triplet and its shifted versions, rather than the repeat
unit alone. A triplet that is not extremal for a given
scale, may become extremal once its two shifted versions are considered. For example, AGC has relatively
low bendability when taken alone, but corresponds to
the most bendable class when GCA and CAG are taken
into account. When the large repetition numbers associated with repeat diseases are also taken into consideration, the extremality of the corresponding DNA sequences with respect to the general background are even
more striking. Incidentally, the triplet repeat class is
under-represented amongst primate DNA repeat expansions (Figure 7) suggesting that special expansion mechanisms may be at work for P = 1 and P = 3, at least in
a fraction of the cases.
How the expansion of disease causing triplets occurs,
as well as several related puzzling questions such as why
expansion frequency depends on repeat length, remain
poorly understood although it is widely assumed that
unusual structural characteristics of the repeats may
play a role. Several models have been proposed involving base-slippage and alternative DNA structures during DNA replication, recombination, and repair (Wells,
1996; Pearson & Sinden, 1998a; Pearson & Sinden,
1998b; Moore et al., 1999). Growing evidence indicates that the formation of hairpins in Okazaki fragments (during replication of the lagging strand) is probably involved in the expansion process (Chen et al., 1995;
Gacy et al., 1995; Wells, 1996; Mariappan et al., 1998;
Miret et al., 1998; Pearson & Sinden, 1998b). But many
questions remain open to interpretation.
19
ing sites, SARs (scaold attachment regions), or polytene bands in Drosophila. The methods are also being
applied to phylogenetic questions and to whether DNA
structure may have had any inuence on the origin of
the genetic code. While all these problems would benet
from improved structural models, the methods are now
in place to work in conjunction with any new scale that
may become available in the near future with progress
in DNA experimental techniques.
Table 18: Trinucleotide scales.
Triplet
AAA/TTT
AAC/GTT
AAG/CTT
AAT/ATT
ACA/TGT
ACC/GGT
ACG/CGT
ACT/AGT
AGA/TCT
AGC/GCT
AGG/CCT
ATA/TAT
ATC/GAT
ATG/CAT
CAA/TTG
CAC/GTG
CAG/CTG
CCA/TGG
CCC/GGG
CCG/CGG
CGA/TCG
CGC/GCG
CTA/TAG
CTC/GAG
GAA/TTC
GAC/GTC
GCA/TGC
GCC/GGC
GGA/TCC
GTA/TAC
TAA/TTA
TCA/TGA
Acknowledgements
The work of PB was initially supported by an NIH SBIR
grant to Net-ID, Inc., and currently by a Laurel Wilkening Faculty Innovation award at UCI. We would like to
thank Anders Gorm Pedersen and David Ussery for comments on the manuscript.
Appendix: Dinucleotide and Trinucleotide Scales
The 3 dinucleotide scales (Table 17) and 2 trinucleotide
scales (Table 18) used in the text.
Table 17: Dinucleotide scales.
Pair
AA/TT
AC/GT
AG/CT
AT
CA/TG
CC/GG
CG
GA/TC
GC
TA
BS
-5.37
-10.51
-6.78
-6.57
-6.57
-8.26
-9.69
-9.81
-14.59
-3.82
PT
PD
-18.66 2.9
-13.10 2.3
-14.00 2.1
-15.01 1.6
-9.45 9.8
-8.11 6.1
-10.03 12.1
-13.48 4.5
-11.08 4.0
-11.85 6.3
B
PP
-0.274 36
-0.205 6
-0.081 6
-0.280 30
-0.006 6
-0.032 8
-0.033 8
-0.183 11
0.027 9
0.017 25
-0.057 8
0.182 13
-0.110 7
0.134 18
0.015 9
0.040 17
0.175 2
-0.246 8
-0.012 13
-0.136 2
-0.003 31
-0.077 25
0.090 18
0.031 8
-0.037 12
-0.013 8
0.076 13
0.107 45
0.013 5
0.025 6
0.068 20
0.194 8
where denotes the reverse complement permutation,
with 0 k P ; 1, and l = 0 or 1. Notice that this
group is not commutative. However k l = l P ;k .
In both cases, G acts on B and an equivalence relation
1 2 is dened if and only if there exists a g 2 G such
that 2 = g1 . The total number of classes or orbits is
given by the Burnside lemma and equal to:
1 X jB j
(37)
jGj g2G g
where Bg = fx 2 B jg(x) = xg is the set of all the
Appendix: Enumeration of Equivalence Classes
elements of B that are xed by g.
We thus need to study Bk and Bk . A case by case
inspection easily shows that:
If (k; P ) = 1, then jBk j = A since only sequences
of one repeated letter are stable.
If kjP , then jBk j = Ak .
If (k; P ) = l, then jBk j = Al .
When G is the group of cyclic permutations, jGj = P .
Putting these results together gives the right hand side
Polya's enumeration theory solves a number of combinatorial questions related to the action of groups on sets.
The relevant set here is the set B of all words of length
P over the alphabet A with AP elements. The group G
of interest is a subgroup of the group of all permutations
of B . If we consider circular permutations on one strand
only, it is the circular group with P elements, generated
by the single right shift operator . If we consider also
the reverse complement, then the group G is easily described as the set of all permutations of the form k l ,
20
Appendix: Correlations as a Function of Background Distribution
of Equation 7, which is equivalent to the left hand side
after some algebra.
When we take into account the reverse complement,
we have jGj = 2P . A case by case inspection again shows
that:
If P is odd, then Bk is empty.
If P is even, then Bk has P=2 degrees of freedom
and therefore jBk j = AP=2 .
This yields immediately the formula in Equations 9 and
10. If needed, it is also straightforward to count the
number of elements inside each type of equivalence class.
Examples of surfaces of correlations between the ATcontent scales and the other scales as a function of background distribution are given in Figure 8. Similar curves
can be obtained for any pair of scales.
Appendix: Repeat Classes for N =
3 and N = 4
Repeat classes for each trinucleotide (Table 19) and each
tetranucleotide (Table 20).
Table 19: Repeat class for each trinucleotide (
= 3).
Classes are numbered in alphabetical order. The rst alphabetical member of each class is in bold.
N
Triplet Class Triplet Class Triplet Class
AAA
1 CCG
12 GTA
7
AAC
2 CCT
9 GTC
6
AAG
3 CGA
6 GTG
5
AAT
4 CGC
12 GTT
2
ACA
2 CGG
12 TAA
4
ACC
5 CGT
6 TAC
7
ACG
6 CTA
7 TAG
7
ACT
7 CTC
9 TAT
4
AGA
3 CTG
8 TCA
10
AGC
8 CTT
3 TCC
9
AGG
9 GAA
3 TCG
6
AGT
7 GAC
6 TCT
3
ATA
4 GAG
9 TGA
10
ATC
10 GAT
10 TGC
8
ATG
10 GCA
8 TGG
5
ATT
4 GCC
12 TGT
2
CAA
2 GCG
12 TTA
4
CAC
5 GCT
8 TTC
3
CAG
8 GGA
9 TTG
2
CAT
10 GGC
12 TTT
1
CCA
5 GGG
11
CCC
11 GGT
5
21
Table 20: Repeat class for each tetranucleotide ( = 4). Classes are numbered in alphabetical order. The rst alphabetical
member of each class is in bold.
N
Quadruplet Class Quadruplet
AAAA
1
AGGT
AAAC
2
AGTA
AAAG
3
AGTC
AAAT
4
AGTG
AACA
2
AGTT
AACC
5
ATAA
AACG
6
ATAC
AACT
7
ATAG
AAGA
3
ATAT
AAGC
8
ATCA
AAGG
9
ATCC
AAGT
10
ATCG
AATA
4
ATCT
AATC
11
ATGA
AATG
12
ATGC
AATT
13
ATGG
ACAA
2
ATGT
ACAC
14
ATTA
ACAG
15
ATTC
ACAT
16
ATTG
ACCA
5
ATTT
ACCC
17
CAAA
ACCG
18
CAAC
ACCT
19
CAAG
ACGA
6
CAAT
ACGC
20
CACA
ACGG
21
CACC
ACGT
22
CACG
ACTA
7
CACT
ACTC
23
CAGA
ACTG
24
CAGC
ACTT
10
CAGG
AGAA
3
CAGT
AGAC
15
CATA
AGAG
25
CATC
AGAT
26
CATG
AGCA
8
CATT
AGCC
27
CCAA
AGCG
28
CCAC
AGCT
29
CCAG
AGGA
9
CCAT
AGGC
30
CCCA
AGGG
31
CCCC
Class Quadruplet
19
CCCG
10
CCCT
24
CCGA
23
CCGC
7
CCGG
4
CCGT
16
CCTA
26
CCTC
32
CCTG
11
CCTT
33
CGAA
34
CGAC
26
CGAG
12
CGAT
35
CGCA
33
CGCC
16
CGCG
13
CGCT
12
CGGA
11
CGGC
4
CGGG
2
CGGT
5
CGTA
8
CGTC
11
CGTG
14
CGTT
17
CTAA
20
CTAC
23
CTAG
15
CTAT
27
CTCA
30
CTCC
24
CTCG
16
CTCT
33
CTGA
35
CTGC
12
CTGG
5
CTGT
17
CTTA
27
CTTC
33
CTTG
17
CTTT
36
GAAA
Class Quadruplet
37
GAAC
31
GAAG
18
GAAT
37
GACA
38
GACC
21
GACG
19
GACT
31
GAGA
30
GAGC
9
GAGG
6
GAGT
18
GATA
28
GATC
34
GATG
20
GATT
37
GCAA
39
GCAC
28
GCAG
21
GCAT
38
GCCA
37
GCCC
18
GCCG
22
GCCT
21
GCGA
20
GCGC
6
GCGG
7
GCGT
19
GCTA
29
GCTC
26
GCTG
23
GCTT
31
GGAA
28
GGAC
25
GGAG
24
GGAT
30
GGCA
27
GGCC
15
GGCG
10
GGCT
9
GGGA
8
GGGC
3
GGGG
3
GGGT
22
Class
6
9
12
15
18
21
24
25
28
31
23
26
34
33
11
8
20
30
35
27
37
38
30
28
39
37
20
29
28
27
8
9
21
31
33
30
38
37
27
31
37
36
17
Quadruplet Class Quadruplet
GGTA
19
TCCT
GGTC
18
TCGA
GGTG
17
TCGC
GGTT
5
TCGG
GTAA
10
TCGT
GTAC
22
TCTA
GTAG
19
TCTC
GTAT
16
TCTG
GTCA
24
TCTT
GTCC
21
TGAA
GTCG
18
TGAC
GTCT
15
TGAG
GTGA
23
TGAT
GTGC
20
TGCA
GTGG
17
TGCC
GTGT
14
TGCG
GTTA
7
TGCT
GTTC
6
TGGA
GTTG
5
TGGC
GTTT
2
TGGG
TAAA
4
TGGT
TAAC
7
TGTA
TAAG
10
TGTC
TAAT
13
TGTG
TACA
16
TGTT
TACC
19
TTAA
TACG
22
TTAC
TACT
10
TTAG
TAGA
26
TTAT
TAGC
29
TTCA
TAGG
19
TTCC
TAGT
7
TTCG
TATA
32
TTCT
TATC
26
TTGA
TATG
16
TTGC
TATT
4
TTGG
TCAA
11
TTGT
TCAC
23
TTTA
TCAG
24
TTTC
TCAT
12
TTTG
TCCA
33
TTTT
TCCC
31
TCCG
21
Class
9
34
28
18
6
26
25
15
3
12
24
23
11
35
30
20
8
33
27
17
5
16
15
14
2
13
10
7
4
12
9
6
3
11
8
5
2
4
3
2
1
1
Correlation AT/PT
Correlation AT/BS
1
0.5
0
−0.5
−1
1
0
−0.5
−1
1
.75
.50
.25
0
AT Content
0
.25
.50
.75
.75
1
.50
.25
0
AT Content
A/(A+T) Ratio
0
.25
.50
.75
1
A/(A+T) Ratio
1
Correlation AT/PP
1
Correlation AT/PD
0.5
0.5
0
−0.5
−1
1
0.5
0
−0.5
−1
1
.75
.50
.25
AT Content
0
0
.25
.50
.75
.75
1
.50
.25
AT Content
A/(A+T) Ratio
0
0
.25
.50
.75
1
A/(A+T) Ratio
Figure 8: Surface representing the asymptotic correlation between AT-content and all the other scales for dierent background
distributions.
References
dependent bending propensity of DNA as revealed by DNase I:
parameters for trinucleotides. EMBO J., 14, 1812{1818.
Calladine, C. R., Drew, H. R. & McCall, M. J. (1988). The intrinsic
structure of DNA in solution. J. Mol. Biol., 201, 127{137.
Campuzano, V., Montermini, L., Molto, M. D., Pianese, L.,
Cossee, M., Cavalcanti, F., Montos, E., Rodius, F., Duclos, F.,
Monticelli, A., Zara, F., Canizares, J., Koutnikova, H., Bidichandani, S. I., Gellera, C., Brice, A., Trouillaas, P., Michele, G. D.,
Filla, A., Frutos, R. D., Palau, F., Patel, P. I., Donato, S. D.,
Mandel, J. L., Cocozza, S., Koenig, M. & Pandolfo, M. (1996).
Friedreich's ataxia: autosomal recessive disease caused by an intronic GAA triplet repeat expansion. Science , 271, 1423{1427.
Chastain, P. D. & Sinden, R. R. (1998). CTG repeats associated
with human genetic disease are inherently exible. J. Mol. Biol.,
275, 405{411.
Chen, X., Mariappan, S. V. S., Catasti, P., Ratli, R., Moyzis,
R. K., Ali, L., Smith, S. S., Bradbury, E. M. & Gupta, G. (1995).
Hairpins are formed by the single DNA strands of the fragile X
triplet repeats: structure and biological implications. Proc. Natl.
Acad. Sci. USA, 52, 5199{5203.
Dickerson, R. E. (1992). DNA structure from A to Z. Meth. Enz.,
211, 67{111.
Drew, H. R. & Travers, A. A. (1985). DNA bending and its relation
to nucleosome positioning. J. Mol. Biol., 186, 773{790.
Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. (1998). Biological Sequence Analysis. Probabilstic Models of Proteins and
Nucleic Acids. Cambridge University Press, Cambridge, UK.
Eichler, E. E. & Nelson, D. L. (1998). The FRAXA fragile site
and fragile X syndrome. In Rubinsztein, D. C. & Hayden, M. R.,
(eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 13{50.
Feller, W. (1971). An Introduction to Probability Theory and its
Applications, volume 2. John Wiley & Sons, New York. Second
Edition.
Ashley, C. T. & Warren, S. T. (1995). Trinucleotide repeat expansion and human disease. Ann. Rev. Genetics , 29, 703{728.
Baldi, P., Brunak, S., Chauvin, Y. & Krogh, A. (1996). Naturally occurring nuclesome positioning signals in human exons and
introns. J. Mol. Biol., 263, 503{510.
Baldi, P., Brunak, S., Chauvin, Y. & Pedersen, A. G. (1999).
Structural basis for triplet repeat disorders: a computational analyis. Bioinformatics , 15, 918{929.
Baldi, P., Chauvin, Y., Pedersen, A. G. & Brunak, S. (1998). Computational applications of DNA structural scales. In Proceedings of
the 1998 Conference on Intelligent Systems for Molecular Biology
(ISMB98). The AAI Press, Menlo Park, CA, pp. 35{42.
Baldi, P. & Rinott, Y. (1989). On normal approximations of distributions in terms of dependency graphs. The Annals of Probability ,
17, 1646{1650.
Bell, S. J. & Forsdyke, D. R. (1999). Accounting units in DNA.
Journal of Theoretical Biology , 197, 51{61.
Benson, G. (1999). Tandem repeats nder: a program to analyze
DNA sequences. Nucleic Acids Research , 27, 573{580.
Benson, G. & Waterman, M. S. (1994). A method for fast database
search for all k-nucleotide repeats. Nucleic Acids Research , 22,
4828{4836.
Blanchard, M. K., Chiapello, H. & Coward, E. (2000). Detecting
localized repeats in genomic sequences: a new strategy and its applications to Bacillus subtilis and Arabidopsis thaliana sequences.
Computers and Chemistry , 24, 57{70.
Breslauer, K. J., Frank, R., Blocker, H. & Marky, L. A. (1986).
Predicting DNA duplex stability from the base sequence. Proc.
Natl. Acad. Sci. USA, 83, 3746{3750.
Brukner, I., Sanchez, R., Suck, D. & Pongor, S. (1995). Sequence-
23
Fye, R. M. & Benham, C. J. (1999). Exact method for numerically
analyzing a model of local denaturation in superhelically stressed
DNA. Physical Review E , 59, 3408{3426.
Gacy, A. M., Goellner, G., Juranic, N., Macura, S. & McMurray,
C. T. (1995). Trinucleotide repeats that expand in human disease
form hairpin structures in vitro. Cell , 8, 533{540.
Gecz, J. & Mulley, J. C. (1999). Characterisation and expression
of a large, 13.7 kb fmr2 isoform. Eur. J. Hum. Genet., 7, 157{162.
Godde, J. S., Kass, S. U., Hirst, M. C. & Wole, A. P. (1996).
Nucleosome assembly on methylated CGG triplet repeats in the
fragile X mental retardation gene 1 promoter. J. Biol. Chem., 271,
24325{24328.
Godde, J. S. & Wole, A. P. (1996). Nucleosome assembly on CTG
triplet repeats. J. Biol. Chem., 271, 15222{15229.
Goodsell, D. S. & Dickerson, R. E. (1994). Bending and curvature
calculations in B-DNA. Nucl. Acids Res., 22, 5497{5503.
Grove, A., Galeone, A., Mayol, L. & Geiduschek, E. P. (1996).
Localized DNA exibility contributes to target site selection by
DNA-bending proteins. J. Mol. Biol., 260, 120{125.
Gusella, J. F. & MacDonald, M. E. (1996). Trinucleotide instability: a repeating theme in human inherited disorders. Ann. Rev.
in Medicine , 47, 201{209.
Hardy, J. & Gwinn-Hardy, K. (1998). Genetic classication of
primary neurodegenerative disease. Science , 282, 1075{1079.
Hassan, M. A. E. & Calladine, C. R. (1996). Propeller-twisting of
base-pairs and the conformational mobility of dinucleotide steps
in DNA. J. Mol. Biol., 259, 95{103.
Hewett, D. R., Handt, O., Hobson, L., Mangelsdorf, M., Eyre,
H. J., Baker, E., Sutherland, G. R., Schuenhauer, S., Mao, J. I.
& Richards, R. I. (1998). FRA10B structure reveals common elements in repeat expansion and chromosomal fragile site genesis.
Mol. Cell , 1, 773{781.
Hewitt, J. E., Lyle, R., Clark, L. N., Valleley, E. M., Wright, T. J.,
Wijmenga, C., van Deutekom, J. C., Francis, F., Sharpe, P. T. &
et al., M. H. (1994). Analysis of the tandem repeat locus D4Z4
associated with facioscapulohumeral muscular dystrophy. Hum.
Mol. Genetics , 3, 1287{1295.
Hunter, C. A. (1996). Sequence-dependent DNA structure. Bioessays , 18, 157{162.
Iyer, V. & Struhl, K. (1995). Poly (dA:dT), a ubiquitous promoter element that stimulates transcription via its intrinsic DNA
structure. EMBO J., 14, 2570{2579.
Jereys, A. J. (1997). Spontaneous and induced minisatellite instability in the human genome. Clinical Science , 93, 383{390.
Junck, L. & Fink, J. K. (1996). Machado-Joseph disease and SCA3:
the genotype meets the phenotypes. Neurology , 46, 4{8.
Jurka, J. (1998). Repeats in genomic DNA: mining and meaning.
Curr. Opin. Struct. Biol., 8, 333{337.
Jurka, J. & Pethiyagoda, C. (1995). Simple repetitive DNA sequences from primates: compilation and analysis. J. Mol. Evol.,
40, 120{126.
Jurka, J., Walichiewicz, J. & Milosavljevic, A. (1992). Prototypic
sequences for human repetitive DNA. J. Mol. Evol., 35, 286{291.
Kleiderlein, J. J., Nisson, P. E., Jessee, J., Li, W., Becker, K. G.,
Derby, M. L., Ross, C. A. & Margolis, R. L. (1998). CCG repeats
in cDNAs from human brain. Hum. Genet., 103, 666{673.
Koenig, M. (1998). Friedreich's ataxia. In Rubinsztein, D. C. &
Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS
Scientic Publishers Ltd., Oxford, UK., pp. 219{238.
Lahm, A. & Suck, D. (1991). DNase I-induced DNA conformation:
2
A structure of a DNase I-octamer complex. J. Mol. Biol., 222,
645{667.
Lalioti, M. D., Scott, H. S., Buresi, C., Rossier, C., Bottani, A.,
Morris, M. A., Malafosse, A. & Antonarakis, S. E. (1997). Dodecamer repeat expansion in cystatin B gene in progressive myoclonus epilepsy. Nature , 386, 847{851.
Lalioti, M. D., Scott, H. S., Genton, P., Grid, D., Ouazzani, R.,
M'Rabet, A., Ibrahim, S., Gouider, R., Dravet, C., Chkili, T., Bottani, A., Buresi, C., Malafosse, A. & Antonarakis, S. E. (1998). A
PCR amplication method reveals instability of the dodecamer repeat in progressive myoclonus epilepsy (EPM1) and no correlation
between the size of the repeat and age at onset. Am. J. Hum.
Genet., 62, 842{847.
Lee, C. C. (1998). Spinocerebellar ataxia type 6 (SCA6). In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat
disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 145{
154.
Liao, G., Rehm, E. J. & Rubin, G. M. (2000). Insertion site preferences of the P transposable element in drosophila melanogaster.
Proc. Natl. Acad. Sci. USA, 97, 3347{3351.
Liu, K. & Stein, A. (1997). DNA sequence encodes information
for nucleosome array formation. J. Mol. Biol., 270, 559{573.
Lu, Q., Wallrath, L. L. & Elgin, S. C. R. (1994). Nucleosome
positioning and gene regulation. J. Cell. Biochem., 55, 83{92.
Mariappan, S. V. S., Silks III, L. A., Bradbury, E. M. & Gupta,
G. (1998). Fragile X DNA triplet repeats, (GCC )n , form hairpins with single hydrogen-bonded cytosine. cytosine mispairs at
the CpG sites: isotope-edited nuclear magnetic resonance spectroscopy on (GCC )n with selective 15 n4-labeled cytosine bases. J.
Mol. Biol., 283, 111{120.
Milosavljevic, A. & Jurka, J. (1993). Discovering simple DNA
sequences by the algorithmic signicance method. CABIOS , 9,
407{411.
Miret, J. J., Pessoa-Brandao, L. & Lahue, R. S. (1998).
Orientation-dependent and sequence-specic expansions of
CTG/CAG trinucleotide repeats in Saccharomyces cerevisiae.
Proc. Natl. Acad. Sci. USA, 95, 12438{12443.
Moore, H., Greenwell, P. W., Liu, C. P., Arnheim, N. & Petes,
T. D. (1999). Triplet repeats form secondary structures that escape
DNA repair in yeast. Proc. Natl. Acad. Sci. USA, 96, 1504{1509.
Nelson, D. L. (1995). The fragile X syndrome. Semin. Cell Biol.,
6, 5{11.
Nelson, H. C. M., Finch, J. T., Luisi, B. F. & Klug, A. (1987).
The structure of an oligo(dA)-oligo(dT) tract and its biological
implications. Nature , 330, 221{226.
Ohshima, K., Kang, S. & Wells, R. D. (1996). CTG triplet repeats
from human hereditary diseases are dominant genetic expansion
products in Escherichia coli. J. Biol. Chem., 271, 1853{1856.
Olson, W. K., Gorin, A. A., Lu, X., Hock, L. M. & Zhurkin,
V. B. (1998). DNA sequence-dependent deformability deduced
from protein-DNA crystal complexes. Proc. Natl. Acad. Sci. USA,
95, 11163{11168.
Ornstein, R. L., Rein, R., Breen, D. L. & MacElroy, R. D. (1978).
An optimised potential function for the calculation of nucleic acid
interaction energies. I. Base stacking. Biopolymers , 17, 2341{2360.
Orr, H. T., Chung, M., Ban, S., Jr., T. J. K., Servadio, A.,
Beaudet, A. L., McCall, A. E., Duvick, L. A., Ranum, L. P. W.
& Zoghbi, H. Y. (1993). Expansion of an unstable trinucleotide
CAG repeat in spinocerebellar ataxia type 1. Nature Genet., 4,
221{226.
Orr, H. T. & Zoghbi, H. Y. (1998). Polyglutamine tract vs. protein
context in SCA1 pathogenesis. In Rubinsztein, D. C. & Hayden,
M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic
Publishers Ltd., Oxford, UK., pp. 105{118.
24
Parvin, J. D., McCormick, R. J., Sharp, P. A. & Fisher, D. E.
(1995). Pre-bending of a promoter sequence enhances anity for
the TATA-binding factor. Nature , 373, 724{727.
Paulson, H. L. (1998). Spinocerebellar ataxia type 3/machadojoseph disease. In Rubinsztein, D. C. & Hayden, M. R., (eds.)
Analysis of triplet repeat disorders . BIOS Scientic Publishers
Ltd., Oxford, UK., pp. 129{144.
Paulson, H. L., Perez, M. K., Trottier, Y., Trojanowski, J. Q.,
Subramony, S. H., Das, S. S., Vig, P., Mandel, J. L., Fischbeck,
K. H. & Pittman, R. N. (1997). Intranuclear inclusions of expanded
polyglutamine protein in spinocerebellar ataxia type 3. Neuron ,
19, 333{344.
Pazin, M. J. & Kadonaga, J. T. (1997). SWI2/SNF2 and related
proteins: ATP-driven motors that disrupt protein-DNA interactions? Cell , 88, 737{740.
Pearson, C. E. & Sinden, R. R. (1998a). Slipped strand DNA,
dynamic mutations and human disease. In Wells, R. D. & Warren, S. T., (eds.) Genetic instabilities and hereditary neurological
diseases . Academic Press, New York, pp. 585{621.
Pearson, C. E. & Sinden, R. R. (1998b). Trinucleotide repeat DNA
structures: dynamic mutations from dynamic DNA. Cur. Opin.
Struct. Biol., 8, 321{330.
Pedersen, A. G., Baldi, P., Brunak, S. & Chauvin, Y. (1998). DNA
structure in human RNA polymerase II promoters. J. Mol. Biol.,
281, 663{673.
Pedersen, A. G., Jensen, L. J., Brunak, S., Staerfeldt, H. H. &
Ussery, D. W. (2000). A DNA structural atlas for escherichia coli.
J. Mol. Biol.. In press.
Ponomarenko, M. P., Ponomarenko, J. V., Frolov, A. S., Podkolodny, N. L., Savinkova, L. K., Kolchanov, N. A. & Overton,
G. C. (1999). Identication of sequence-dependent DNA sites interacting with proteins. Bioinformatics , 15, 687.
Prabhu, V. V. (1993). Symmetry observations in long nucleotide
sequences. Nucleic Acids Research , 21, 2797{2800.
Pulst, S.-M. (1998). Spinocerebellar ataxia type 2. In Rubinsztein,
D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders .
BIOS Scientic Publishers Ltd., Oxford, UK., pp. 119{128.
Rinott, Y. & Dembo, A. (1996). Some examples of normal approximations by Stein's method. In Aldous, D. & Pemantle, R., (eds.)
Random Discrete Structures . Springer Verlag, New York, NY, pp.
25{44.
Ross, C. A. (1995). When more is less: pathogenesis of glutamine
repeat neurodegenerative diseases. Neuron , 15, 493{496.
Rubinsztein, D. C. & Amos, B. (1998). Trinucleotide repeat mutation processes. In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd.,
Oxford, UK., pp. 257{268.
Rubinsztein, D. C. & Hayden, M. R. (1998). Introduction. In
Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet
repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK.,
pp. 1{12.
Satchwell, S. C., Drew, H. R. & Travers, A. A. (1986). Sequence
periodicities in chicken nucleosome core DNA. J. Mol. Biol., 191,
659{675.
Sheridan, S. D., Benham, C. J. & Hateld, G. W. (1998). Activation of gene expression by a novel DNA structural transmission
mechanism that requires supercoiling-induced DNA duplex destabilization in an upstream activating sequence. J. Biol. Chem..
Simpson, R. T. (1991). Nucleosome positioning: occurrence, mechanisms, and functional consequences. Prog. in Nucleic Acids Res.
and Mol. Biol., 40, 143{184.
Sinden, R. R. (1994). DNA structure and function. Academic
Press, San Diego, CA.
Skinner, J. A., Foss, G. S., Miller, W. J. & Davies, K. E. (1998).
Molecular studies of the fragile sites FRAXE and FRAXF. In
Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet
repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK.,
pp. 51{60.
Starr, D. B., Hoopes, B. C. & Hawley, D. K. (1995). DNA bending
is an important component of site-specic recognition by the TATA
binding protein. J. Mol. Biol., 250, 434{446.
Stevanin, G., Daviid, G., Abbas, N., Durr, A., Holmberg, M.,
Duyckaerts, C., Giunti, P., Cancel, G., Ruberg, M., Mandel, J.L. & Brice, A. (1998). Spinocerebellar ataxia type 7 (SCA7). In
Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet
repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK.,
pp. 155{168.
Suck, D. (1994). DNA recognition by Dnase I. J. Mol. Recognition ,
7, 65{70.
The Huntington's Disease Collaborative Research Group (1993). A
novel gene containing a trinucleotide repeat that is expanded and
unstable on Huntington's disease chromosomes. Cell , 72, 971{983.
Tsukiyama, T. & Wu, C. (1997). Chromatin remodeling and transcription. Curr. Opin. Gen. Dev , 7, 182{191.
van Deutekom, J. T., Wijmenga, C., van Tienhoven, E. A. E.,
Gruter, A. M., Hewitt, J. E., Padberg, G. W., van Ommen, G.
J. B., Hofker, M. H. & Frants, R. R. (1993). FSHD associated
DNA rearrangements are due to deletions of integral copies of a
3.3 kb tandemly repeated unit. Human Molecular Genetics , 2,
2037{2042.
Wang, Y.-H., Gellibolian, R., Shimizu, M., Wells, R. D. & Grifth, J. D. (1996). Long CCG triplet repeat blocks exclude
nucleosomes|a possible mechanism for the nature of fragile sites
in chromosomes. J. Mol. Biol., 263, 511{516.
Wang, Y.-H. & Grith, J. D. (1994). Preferential nucleosome assembly at DNA triplet repeats from the myotonic dystrophy gene.
Science , 265, 669{671.
Wang, Y.-H. & Grith, J. D. (1995). Expanded CTG triplet blocks
from the myotonic dystrophy gene create the strongest known natural nucleosome positioning elements. Genomics , 25, 570{573.
Wells, R. D. (1996). Molecular basis of triplet repeat diseases. J.
Biol. Chem., 271, 2875{2878.
Werner, M. H. & Burley, S. K. (1997). Architectural transcription
factors: proteins that remodel DNA. Cell , 88, 733{736.
Winokur, S. T., Bengtsson, U., Feddersen, J., Mathews, K. D.,
B.Weienbach, Bailey, H., Markovich, R. P., Murray, J. C., Wasmuth, J. J., Altherr, M. R. & Schutte, B. C. (1994). The DNA
rearrangement associated with facioscapulohumeral muscular dystrophy involves a heterochromatin-associated repetitive element:
implications for a role of chromatin structure in the pathogenesis
of the disease. Chromosome Research , 2, 225{234.
Wole, A. P. & Drew, H. R. (1995). DNA structure: implications
for chromatin structure and function. In Elgin, S. C. R., (ed.)
Chromatin structure and gene expression . IRL Press, Oxford, pp.
27{48.
Wole, A. P. & Matzke, M. A. (1999). Epigenetics: regulation
through repression. Science , 286, 481{486.
Zhu, Z. & Thiele, D. J. (1996). A specialized nucleosome modulates transcription factor access to a c. glabrata metal responsive
promoter. Cell , 87, 459{470.
25

Download Report

DNA Structure for Sequences and Repeats of All Lengths

Paperzz.com

Your Paperzz