Sequence Analysis by Additive Scales: DNA Structure for Sequences and Repeats of All Lengths Pierre Baldi Dept. of Information and Computer Science University of California, Irvine Irvine, CA 92697-3425 Pierre-Francois Baisnee Dept. of Information and Computer Science University of California, Irvine Irvine, CA 92697-3425 Abstract sult from a variety of dierent mechanisms, a fraction of which is likely to depend on proles characterized by extreme structural features. Contact: [email protected], [email protected] DNA structure plays an important role in a variety of biological processes. Dierent di- and trinucleotide scales have been proposed to capture various aspects of DNA structure including base stacking energy, propeller twist angle, protein deformability, bendability, and position preference. Yet, a general framework for the computational analysis and prediction of DNA structure is still lacking. Such a framework should in particular address the following issues: (1) construction of sequences with extremal properties; (2) quantitative evaluation of sequences with respect to a given genomic background; (3) automatic extraction of extremal sequences and proles from genomic data bases; (4) distribution and asymptotic behavior as the length N of the sequences increases; and (5) complete analysis of correlations between scales. Results: We develop a general framework for sequence analysis based on additive scales, structural or other, that addresses all these issues. We show how to construct extremal sequences and calibrate scores for automatic genomic and data base extraction. We show that distributions rapidly converge to normality as N increases. Pairwise correlations between scales depend both on background distribution and sequence length and rapidly converge to an analytically predictable asymptotic value. For di- and trinucleotide scales, normal behavior and asymptotic correlation values are attained over a characteristic window length of about 10-15 bp. With a uniform background distribution, pairwise correlations between empirically-derived scales remain relatively small and roughly constant at all lengths, except for propeller twist and protein deformability which are positively correlated. There is a positive (resp. negative) correlation between dinucleotide base stacking (resp. propeller twist and protein deformability) and AT-content that increases in magnitude with length. The framework is applied to the analysis of various DNA tandem repeats. We derive exact expressions for counting the number of repeat unit classes at all lengths. Tandem repeats are likely to reMotivation: 1 Introduction Evidence is mounting that DNA structural properties beyond the double helical pattern play an important role in a number of fundamental biological processes, both under healthy and pathological conditions. This is not too surprising if one realizes that meters of DNA must be compacted into a nucleus that is only a few microns in diameter while, at the same time, preserving the ability of turning thousands of genes on and o in a precisely orchestrated fashion. The three-dimensional structure of DNA, as well as its organization into chromatin bers, seems to be essential to its functions and has been implicated in diverse phenomena ranging from protein binding sites, to gene regulation, to triplet repeat expansion diseases. The goal of this work is to develop computational methods for the structural analysis of DNA sequences. While DNA structure is our primary motivation and area of application, the framework we develop is completely general and applies to sequences over any alphabet, including codon, RNA, and protein alphabets, whenever local additive scales, as dened below, are available. 1.1 DNA Structure DNA structure has been found to depend on the exact sequence of nucleotides, an eect that seems to be caused largely by interactions between neighboring base pairs (Ornstein et al., 1978; Satchwell et al., 1986; Breslauer et al., 1986; Calladine et al., 1988; Goodsell & Dickerson, 1994; Sinden, 1994; Brukner et al., 1995; Hassan & Calladine, 1996; Hunter, 1996; Ponomarenko et al., 1999; Fye & Benham, 1999). This means that different sequences can have dierent intrinsic structures, or dierent propensities for forming particular structures. Periodic repetitions of bent DNA in phase with the helical pitch, for instance, will cause DNA to assume a macroscopically curved structure. Flexible or intrinsically curved DNA is energetically more favorable and Department of Biological Chemistry, College of Medicine, University of California, Irvine. To whom all correspondence should be addressed. 1 to wrap around histones than rigid and unbent DNA, and this has been shown to inuence nucleosome positioning (Drew & Travers, 1985; Satchwell et al., 1986; Simpson, 1991; Lu et al., 1994; Wole & Drew, 1995; Baldi et al., 1996; Zhu & Thiele, 1996; Liu & Stein, 1997). In addition, the chromatin complex structure of DNA and the positioning of nucleosomes along the genome have been found to play an important (generally inhibitory) role in regulation of gene transcription (Pazin & Kadonaga, 1997; Tsukiyama & Wu, 1997; Werner & Burley, 1997; Pedersen et al., 1998). Sequencedependent DNA structure is often important for DNA binding proteins, such as TBP (TATA-binding-protein) (Parvin et al., 1995; Starr et al., 1995; Grove et al., 1996) and gene regulation (Sheridan et al., 1998). While the number of resolved structures of DNA-protein complexes continues to grow in the PDB data base, the eld of computational DNA structural analysis is clearly far behind its protein cousin and completely lacks any degree of systemicity. Most likely, most DNA structural signals remain to be uncovered. major groove (Lahm & Suck, 1991; Suck, 1994). Thus Dnase I cutting frequencies on naked DNA can be interpreted as a quantitative measure of major groove compressibility or anisotropic bendability. These frequencies allow for the derivation of bendability parameters for the 32 complementary trinucleotide pairs. Large B values correspond to exibility. 5. The trinucleotide position preference (PP) scale derived from experimental investigations of the positioning of DNA in nucleosomes. It has been found that certain trinucleotides have strong preference for being positioned in phase with the helical repeat. Depending on the exact rotational position, such triplets will have minor grooves facing either towards or away from the nucleosome core (Satchwell et al., 1986). Based on the premise that exible sequences can occupy any rotational position on nucleosomal DNA, these preference values can be used as a triplet scale that measures DNA exibility. Hence, in this model, all triplets with close to zero preference are assumed to be exible, while triplets with preference for facing either in or out are taken to be more rigid and have larger PP values. Note that we do not use this scale as a measure of how well dierent triplets form nucleosomal DNA. Instead, the absolute value, or unsigned nucleosome positioning preference, is used here, as in (Pedersen et al., 1998), as a measure of DNA exibility. For completeness, all these scales are displayed in the Appendix. In previous studies, we found these models useful (Baldi et al., 1996; Pedersen et al., 1998; Baldi et al., 1999), in particular for the detection of putative new structural signatures associated with an increase of bendability in downsteam regions of RNA polymerase II promoters. A similar approach (Liao et al., 2000) was used to analyze the structure of insertion sites for P transposable elements in Drosophila melanogaster and suggest that the corresponding transposition mechanism recognizes a structural signature rather than a specic sequence motif. With the exception of BS, all the models were determined by purely experimental observations of sequencestructure correlations. Additional scales capturing DNA properties related directly or indirectly to structure, such as enthalpy, or melting temperature, have also been proposed (Breslauer et al., 1986; Ponomarenko et al., 1999). The primary focus of this work is not on assessing the merits and pitfalls of each model, but rather on the development of general methods for the systematic application of any scale to any sequence of any length, up to entire genomes, under the assumption that the scale can be used additively within a sliding window. In general, this assumption will provide a reasonable approximation, at least up to a certain length to be determined experimentally. In particular, we are interested in the 1.2 DNA Structural Scales Based on many dierent empirical measurements or theoretical approaches, several models have been constructed that relate the nucleotide sequence to DNA exibility and curvature (Ornstein et al., 1978; Satchwell et al., 1986; Goodsell & Dickerson, 1994; Sinden, 1994; Brukner et al., 1995; Hassan & Calladine, 1996; Hunter, 1996; Baldi et al., 1998; Ponomarenko et al., 1999). These models are typically in the form of dinucleotide or trinucleotide scales that assign a particular value to each di- or tri-nucleotide and its reverse complement. A non-exhaustive list of such scales includes: 1. The dinucleotide base stacking energy (BS) scale (Ornstein et al., 1978) expressed in kilocalories per mole. The scale is derived from approximate quantum mechanical calculations on crystal structures. 2. The dinucleotide propeller twist angle (PT) scale (Hassan & Calladine, 1996) measured in degrees. This scale is based on X-ray crystallography of DNA oligomers. Dinucleotides with a large negative propellertwist angle tend to be more rigid than dinucleotides with low negative propeller-twist angle. 3. The dinucleotide protein deformability (PD) scale (Olson et al., 1998) derived from empirical energy functions extracted from the uctuations and correlations of structural parameters in DNA-protein crystal complexes. Dinucleotides with large PD values tend to be more exible. 4. The trinucleotide bendability (B) model (Brukner et al., 1995) based on Dnase I cutting frequencies. The enzyme Dnase I preferably binds (to the minor grove) and cuts DNA that is bent, or bendable, towards the 2 development of methods for the automatic recognition of structural motifs associated with extremal features, such as extreme stiness or bendability. Calibration of corresponding thresholds is expected to be useful in data base searches and is conceptually similar, for instance, to the calibration of thresholds for detecting sequence homology. More generally, however, data base searches may also be conducted on the basis of structural signatures or proles that need not be extremal and could be obtained from reasonable training sets. Certain protein binding sites, for instance, are highly degenerate at the DNA sequence level, with low sequence homology, while exhibiting at the same time a high degree of DNA structural similarity. Similarly, periodic exible triplets in phase with the double helical pitch are necessary to ensure long range curvature, for instance in nucleosome regions. Although several scales may agree on some structural features, the fact remains that they may also display divergent interpretations of some sequence elements. While no nal consensus regarding these models exists, it is likely that each one provides a slightly dierent and partially complementary view of DNA structure. Thus a second goal of this work is the comparison of the models in the limited sense of estimating the statistical correlation between dierent scales. In (Baldi et al., 1998) it was shown that by and large many of the commonly used scales exhibit low correlations measured at the level of single di- or tri-nucleotides. Empirical measurement of correlations between the scales over longer lengths in Escherichia coli have recently revealed dierent unexplained patterns (Pedersen et al., 2000). Here we provide a complete explanation of this phenomena and show how correlations vary with background distribution and with window length. Finally, while the methods introduced can be applied to any DNA sequence, we focus here on a particularly important class of DNA sequences, namely DNA tandem repeats, where the general framework is further specialized. nucleotides. Tandem repeats may cover up to 10% of the human genome. Tandem repeats vary widely, over several orders of magnitude, both in terms of the length of the repeating pattern and the number of more or less exact contiguous copies. Repeats are often polymorphic and therefore play a major role in linkage studies and DNA ngerprinting. In many cases, the genetic origin, the structure, and the function of these repetitive regions is poorly understood. There exist a few examples, however, where the repeats are known to play a biological role in both healthy and pathological conditions. Certain tandem repeats, for instance, have been associated with protein binding sites or interactions with transcription factors. An important advance in epigenetics research has been the realization that interactions between repeated DNA sequences can trigger the formation and the transmission of inactive genetic states and DNA modications (Wole & Matzke, 1999). In several of these cases, the particular DNA-helical structural features of the repeat sequences seem to play an essential role. Interest in tandem repeats has been heightened over the last few years by the discovery that several important degenerative disorders including Huntington disease, Myotonic Distrophy, Fragile X Syndrome, and several forms of Ataxia, result from the abnormal expansion of particular DNA triplets (The Huntington's Disease Collaborative Research Group, 1993; Ashley & Warren, 1995; Ross, 1995; Gusella & MacDonald, 1996; Hardy & Gwinn-Hardy, 1998; Rubinsztein & Hayden, 1998; Baldi et al., 1999). The exact mechanism by which a triplet repeat mutation causes disease varies as indicated by the fact that currently known repeat expansions are found both in 5' UTRs, in 3' UTRs, in introns, and within coding sequences of various aected genes (Ashley & Warren, 1995; Gusella & MacDonald, 1996; Rubinsztein & Amos, 1998; Rubinsztein & Hayden, 1998). For instance, fragile X mental retardation is associated with an expanded CGG repeat in the 5' UTR of the FMR1 gene (Nelson, 1995; Eichler & Nelson, 1998). The 64 possible triplets can be clustered into 12 equivalence classes when shift and reverse complement operations are considered (see below). Currently only three repeat classes CAG, CGG, and GAA, out of the possible twelve, are associated with triplet repeat disorders. There is evidence that unusual structural features of the repeats play a role in their expansion (Wells, 1996; Pearson & Sinden, 1998a; Pearson & Sinden, 1998b; Moore et al., 1999). In (Baldi et al., 1999), the structural scales above were used to show that the triplet classes involved in the diseases have extreme structural characteristics of very high or very low exibility. Methods to quantify the degree of extremality relative to other sequences, however, were not developed. Furthermore, 1.3 DNA repeats Genomes, especially eukaryotic genomes, are replete with DNA repetitive regions (Jurka et al., 1992; Jeffreys, 1997; Jurka, 1998). Well over 30% of the human genome has been estimated to comprise repetitive DNA of some sort (Benson & Waterman, 1994) the exact function of which is often unknown. Such DNA arises through many dierent evolutionary and genetic mechanisms. Over 950 dierent classes (Jurka, 1998) of repeats have been censed. Two major groups of repeats exist: interspersed repeats, and tandem repeats. While the methods to be developed can be applied to both groups, our analysis will focus on tandem repeats, consisting of two or more contiguous copies of a particular pattern of 3 quence s can be estimated by \sliding" the scale along the sequence in the form other triplet or non-triplet repeats may play a role in diseases as well as other biological processes. Therefore the techniques need to be improved and extended to all classes of repeats. Hence, given the importance of repeating patterns and the exponential growth of sequence data bases, our goal is also to develop new tools for the computational analysis of the structural properties of arbitrary repeats and begin to apply such techniques in a systematic and quantiable way. Various algorithms for searching tandem repeats (Milosavljevic & Jurka, 1993; Benson & Waterman, 1994; Benson, 1999; Blanchard et al., 2000) have been developed. The techniques presented here can also be viewed as complementing such algorithms by introducing a structural perspective. S (s) = S (X : : : XS ) + S (X : : : XS ) + : : : 1 = N; S +1 X 2 S (Xi : : : Xi +1 S ;1 ) (1) + i=1 In practical applications, such quantity can also be averaged over a window of length W to get, for instance, a more homogeneous per base-pair value (W N ). This averaging process does not concern us at this stage since it merely amounts to using a dierent scale, with a larger size. The form given in Equation 1 corresponds to a free boundary condition. The ideas to be developed can be applied to other boundary conditions, including periodic boundary conditions, where the sequence is wrapped around, as described below. With the proper modications, the theory applies immediately to the case where the scales are shifted by more than one position at each step. Consider now a repeat sequence r consisting of a unit pattern or period p = (X1 : : : XP ) of length P , and repetition number R > 1, so that r = (X1 : : : XP )R with N = PR S . Notice that the period is not uniquely determined since, for instance, XXXX can be viewed as (X )4 , or as (XX )2. In addition, we will assume that P + S ; 1 N , or equivalently that S (R ; 1)P + 1 so that the scale S is applied starting at least once from each letter in the repetitive unit, without exceeding the repetitive sequence boundary. In this case, S (r) has the form: 1.4 Organization The remainder of the paper is organized as follows. In the next section we develop a general framework for the analysis of the score of a sequence (repetitive or not) under any additive scale. We determine the number of dierent sequence equivalence classes under circular permutation and reverse complement operations. We show how to determine and visualize maximal and minimal patterns and study the statistical properties of the scales, including intra scale (mean and variance) and interscale (correlations) statistics for sequences of various lengths, as well as asymptotic normality. This framework is essential in order to compare the behavior of various scales, to locate a given sequence with respect to a comparable population, and to automatically set thresholds in data base searches. We then apply the general framework to the 5 structural models described above and various tandem repeats. S (r) = lS (p) + where S (p) is the contribution of the periodic unit 2 Methods and Theory (2) S (p) = S (X : : : XS ) + S (X : : : XS ) + : : : + S (XP X : : : XS; ) (3) 1 2.1 General Framework The general framework we consider begins with an alphabet A of size A and a scale S of length (or size) S . The scale is a function that assigns a value to any S -tuple of the alphabet, for instance in the form of a table with AS entries. In the result section, we deal exclusively with the nucleotide DNA alphabet (A = 4) and with DNA scales, such as di-nucleotide with S = 2 (e.g. propeller twist) or tri-nucleotide with S = 3 (e.g. bendability) structural scales. The same framework, however, can readily be applied to other situations (e.g. amino acid alphabet with hydrophobicity scales). Given a primary sequence s = X1 X2 : : : XN of length N S over A, we assume that the scale S is approximately additive in the sense that the corresponding global property of the se- = P X i=1 2 1 +1 1 S (Xi : : : Xi S ;1 ) [modP ] + The number l of times the periodic unit is covered by S and its shifted version is given by: l = PR ;PS + 1 = R ; S P; 1 (4) Finally, if lP + S ; 1 = RP then the boundary tail is equal to 0. Otherwise = S (XlP +1 : : : XlP +S ) + : : : + S (XRP ;S+1 : : : XRP ) (5) 4 where indices can be taken modulo P , i.e. XlP +1 = X1 and so forth. The sum in Equation 5 has at most P ; 1 terms. In practice, at least in the case of DNA, only short scales are currently available and therefore in most cases, S P + 1. In this case, Equation 2 simplies to: S (r) = (R ; 1)S (p) + using standard group theory arguments detailed in the Appendix. These arguments are not restricted to circular permutation and reverse complement operations, but apply to any group of transformations over any sequences. The number of classes, when only circular permutations without reverse complement are taken into account, is given by (6) 2.2 Equivalence Classes 1 X ( P )Ad = 1 X A(P;k) P d P In the special case of repetitive sequences, we also need to be able to count the number of dierent repeats with respect to a given scale. It is often the case that the scale S is characterized by some kind of invariance with respect to the sequences of length S of A. In the case of DNA, the structural scales we have are invariant with respect to the reverse complement. When looking at repeat sequences, this determines how many dierent repeat patterns of length P need to be considered. A triplet repeat, for instance, can be described in terms of dierent unit trinucleotides depending on what strand and triplet frame is chosen. Thus, the repeat CAGCAGCAG : : : can be said to be a repeat of the triplet CAG, and also of its reverse complement CTG. Ignoring repeat boundaries, however, the sequence can also be described as a repeat of the shifted triplet pairs AGC/GCT and GCA/TGC. In this way, the 64 dierent trinucleotides can be divided into 12 possible repeat classes. Of these 12 classes, only 10 are proper triplet repeat classes in the sense that they do not result from a repeat pattern of shorter length. The two classes associated with shorter patterns are obviously the triplet pairs AAA/TTT and CCC/GGG which are more precisely described as mono-nucleotide repeats. [For a generic alphabet A, a reverse complement operation can be dened by introducing a one to one function X ! X from the alphabet to itself, satisfying X = X so that the reverse complement of X1 : : : XN is dened to be XN ; : : : ; X1 .] In the case of a DNA repeat with unit repeat length P , the number of classes and the number of elements in each equivalence class is dictated by the action of the group of transformations associated with the circular permutations and the reverse complement operations on the set of all possible strings of length P . AAA: : :/TTT: : : and CCC: : :/GGG: : : always give rise to two separate classes with 2 elements each. In general, a typical class will contain 2P elements associated with the P permutations and the P reverse complements. Classes containing less elements, however, can arise for instance as a result of sub-periodicity eects when P is not prime, and of identical reverse complement eects. For instance, when P = 4, the class of ATAT contains only 2 elements since it is identical to its reverse complement and can be shifted circularly only once before returning to the original pattern. The number of classes can be counted kP djP (7) 1 where (P; k) is the greatest common divider (gcd) of P and k. (n) is the Euler function counting the number of integers less than n which are prime to n, i.e. without common dividers with n. If p1 ; : : : ; pk is the list of distinct prime factors of n, then the Euler function can be expressed as: k Y (n) = n (1 ; p1 ) i i=1 (8) 1 X ( P )Ad = 1 X A(P;d) 2 P d jP d 2P 1kP (9) When both circular permutations and reverse complement are taken into account, the number of classes for odd P is given by When P is even, the corresponding number of classes is 2 3 X 1 4 P d P P=2 5 2P djP ( d )A + 2 A or, equivalently, 2 (10) 3 1 4 (P;d) + P2 AP=2 5 (11) 2P 1kP A In particular, when P is a prime, the number of dierent classes under periodic and reverse complement equivalence is X 1 [(P ; 1)A + AP ] (12) 2P The number of classes which are new at a given length P , i.e. that do not result from the repetition of a shorter pattern of length dividing P , can easily be obtained by subtracting the corresponding counts for each divisor of P . When P is prime, all classes are new except for the classes resulting from mono-letter repeats. Table 1 in the Results section exemplies the application of Equations 9-12. 5 2.3 Extremal Sequences and Automata 2.4 Probabilistic Modeling Consider now that sequences are being generated by a random process. In order to x the ideas, we take for simplicity a Markov model of order 0, i.e. we assume that sequences are generated by N tosses of the same die with distribution D = (pX ) over the alphabet A. The same analysis, however, can easily be extended to other probabilistic models such as higher-order Markov models where distributions are dened, for instance, on pairs or triplets of letters. From Equation 1, S (s) is now a random variable which is the sum of N ; S + 1 random variables: S (s) = Y1 + : : : + YN ;S+1 . By construction, all the variables Yi = S (Xi : : : Xi+S;1 ) have the same distribution, but they are not independent. Rather they satisfy a form of local dependence, called \m-dependence" in statistics. More precisely, for i < j , Yi and Yj are independent if and only if j ; i S . Using the linearity of the expectation, we have: We are interested in the construction and recognition of sequences s that are extremal for S , i.e. such that S (s) is very large or very small relative to the other sequences of length N . For this, we attach to each scale a prex automata, or prex graph. The prex automata can be described by a directed graph containing AS;1 nodes, each labeled by a string of length S ; 1 over A of the form X1 : : : XS;1 (see Figure 1 for an example). Each node has A directed outgoing connections. X1 : : : XS;1 is connected to X2 : : : XS;1 Y , for each letter Y in A, hence the notion of prex. The weight (or length) of the corresponding transition is provided by the entry associated with X1 X2 : : : XS;1Y in the structural table. The A nodes labeled (X )S;1 = XXX:::X (mono-repeats) are the only ones to have a self-connection. Any sequence s of length N , is trivially associated with the path: X1 : : : XS;1 ! X2 : : : XS ! : : : ! XN ;S+2 : : : XN . The value of S (s) is found just by adding the weights of the corresponding connections. As a result, sequences associated with maximal or minimal values of S (s) correspond to paths in the prex graph, with maximal or minimal total weight or length. These can easily be found by standard dynamic programming techniques which can also be extended to nding, for instance, the k longest or shortest paths. A repeat pattern of length P is a directed cycle in the prex automata graph. Notice that any path of length greater than AS;1 must intersect itself at least once. Thus any cycle of length strictly greater than AS;1 must be composed of non-intersecting cycles of length at most AS;1 . For instance, with a dinucleotide scale, any repeat unit of length greater than 4 must contain at least two cycles of length at most 4. Therefore in the study of repeats, we need only to study the properties of all non-intersecting directed cycles of length up to AS;1 together with all possible ways of joining them. In addition to dynamic programming techniques, it is also useful to tabulate the weights of all possible short cycles for at least two reasons. First, because longer patterns are built from shorter cycles. Second, at least in the case of DNA, many important existing repeats, such as triplet repeats, are based on a short repeating pattern. While the prex graph is useful for constructing extremal sequences and recognizing them as long as A, S and N are small, it is also necessary to develop more general techniques by which we can rapidly assess, for any sequence s, the magnitude of S (s) with respect to all the other comparable sequences. This is best achieved by viewing the sequences in a probabilistic context. with E (S (s)) = (N ; S + 1)E (Yi ) NS E (Yi ) = X X1 :::XS (13) S (X : : : XS )p(X ) : : : p(XS ) = S 1 1 (14) the sum being over all AS S -tuples of the alphabet. To situate an individual sequence with respect to the entire population, we need to calculate the variance. The variance also can be calculated explicitly by taking advantage of the local dependence of the variables Yi . We have X Var(S (s)) = (N ; S + 1)Var(Yi ) + 2 <j ;i<S Cov(Yi ; Yj ) 0 (15) with the covariances Cov(Yi ; Yj ) = E [(Yi ; E (Yi ))(Yj ; E (Yj ))]. As soon as j ; i S , Yi and Yj are independent and the corresponding covariance is 0. Thus, for any given scale S , one needs only to tabulate the expectation E (Yi ) and the S relevant short-range covariances Ck = Ck (S ) = Cov(Yi ; Yi+k ) (16) for 0 k < S (C0 = Var(Yi )). Alternatively, by factoring out the variance of Yi , Equation 15 can also be expressed in terms of the correlations X Var(S (s)) = Var(Yi )[N ; S + 1 + 2 0 <j;i<S Cor(Yi ; Yj )] (17) To obtain the exact variance at each length N , it is then only a matter of counting how many times each 6 when P = 2n. Periodic boundary conditions must be used in the computation of the covariances Ck whenever necessary (jkj > P ; S ). For a periodic sequence r, where the period P as well as S are small relative to the length N = RP we can use: type of covariance is present in the sequences and adjust for any boundary eects as needed. If N 2S ; 1, then Var(S (s)) = (N ; S + 1)C0 + 2 SX ;1 k=1 (N ; S ; k + 1)Ck (18) E (S (r)) RE (S (p)) (25) and the approximation If S N < 2S ; 1, then Var(S (r)) RVar(S (p)) Var(S (s)) = (N ; S + 1)C0 (26) (19) For long repetitive sequences with period P < S , we can use the same approach with a larger period P 0 , multiple of P , so that S P 0 . It is worth noticing that, for xed S , both the expectation and the variance are linear in N . In particular, for large N 2.4.1 Central Limit Theorem S (s) consists of a sum of identical but non-independent + 2 Var(S (s)) N [C0 + 2 NX ;S k=1 SX ;1 k=1 (N ; S ; k + 1)Ck Ck ] = N [ SX ;1 ;S+1 random variables. Therefore standard central limit theorems for sums of independent random variables cannot be applied. Yet, because the dependencies are local, a sum Z = Y1 + : : : + YK of K m-dependent random variables Yi still approaches a normal distribution. This can be shown using the theorem in (Baldi & Rinott, 1989) which provides also a bound on the rate of convergence. Here we use the improved bound found in (RinottP& Dembo, 1996). We let maxi jYi ; E (Yi )j = B , and E ( Ki=1 jYi ; E (Yi )j)=K = . For all the scales to be considered, these constants are well dened and easy to compute. Under these assumptions, Ck ] = NS (20) In the last equality, for obvious symmetry reasons, we let C;k = Ck . This notation will prove to be useful below. In the case of repetitive sequences, it is also useful to calculate the expectation of S (p) = Y1 + : : : YP , and its variance with periodic boundary conditions modulo P , i.e. assuming the variables Y1 : : : YP and the corresponding letters are arranged along a circle. Here both the expectation and the variance are directly proportional to P and satisfy E (S (p)) = S P and Var(S (p)) = S P . Clearly, for any P , E (S (p)) = PE (Yi ) so If P 2S ; 1, S = E (Yi ) S = C0 + 2 SX ;1 k=1 Ck = jP ( pZ ; EZ u) ; (u)j [Var(7K Z )] = (2S ; 1) B ;S+1 Ck (22) When S P < 2S ; 1, all variables along the circle are dependent and therefore S = S (P ) is given by S (P ) = C0 + 2 when P = 2n + 1, and S (P ) = C0 + Cn + 2 n X k=1 Ck = n X ;n Ck (23) jP ( pZ ; EZ u) ; (u)j pC Var(Z ) nX ;1 k=1 Ck = Cn + nX ;1 ;n+1 Ck 2 3 2 (27) where (u) is the normalized Gaussian distribution. The factor (2S ; 1) represents the size of the clusters associated with m-dependence. For a xed scale, such size is constant but the theorem remains true if S grows slowly with N . Thus Equation 27 can readily be applied to S (s) or S (r) with K = N ; S + 1 or K = RP . >From Equation 20, the variance of the sequences being considered is linear in their length: Var(S (s)) N , where depends only on the scale S . Thus pwe obtain a convergence rate that scales at most like 1= N (21) SX ;1 2 Var(Z ) N (28) with C 7(2S ; 1)2 B 2 ;3=2 . The rate of this bound is known to be essentially optimal (similar to the BerryEsseen theorems (Feller, 1971)). (24) 7 2.4.2 Normalized Distances and Extremal Sequences The value of S (s) or S (r) of any sequence or repeat of or 0 i ; j S2 ; 1. It is sucient to tabulate the nite set of S1 + S2 ; 1 covariances Cov(Yi ; Zi+k ) Ck = Ck (S1 ; S2 ) = E [(Yi ; E (Yi ))(Zi+k ; E (Zi+k ))] length N can be compared to the average value of a background population by computing a normalized Zscore of the form: Z (s) = S (sp) ; NS N S (32) with S1 S2 and ;S2 + 1 k S1 ; 1. These covariances can be used to compute correlations at all lengths by writing (29) Cov(S1 (s); S2 (s)) = (N ; S2 + 1)Cov(Yi ; Zi ) X + 2 Cov(Yi ; Zj ) (33) A repeat r with period unit length P and repetition R (N = RP ) can be compared to a background population of repeats, or a background population of generic sequences. In the latter case, we have S (r) S (p)R or S (r) 0 N . Therefore the Z-score p p 0 Z (r) = ( ;p) N i6=j For large N it is clear that, except for small boundary eects, each type of covariance occurs approximately N times in the formula above. Therefore for large N , Cov(S1 (s); S2 (s)) behaves approximately as (30) grows with N and is larger than the Z-score Z (p) computed on the repeat unit. In other words, if a repeat unit displays extremal features when compared to other repeat units of the same length, its expansion will appear even more extreme compared to the background of all sequences of similar length. The Z scores can be used to assess how extreme a sequence is and to search databases for subsequences with extremal features. As in the case of alignments, this can also be done using extreme value distributions (Durbin et al., 1998). Note also that one can search a data base using a structural prole rather than extreme values. The degree of similarity between two proles can be measured, for instance, using the standard mean square error. N [C0 + SX 1 ;1 k=;S2 +1;k6=0 Ck ] = N [ SX 1 ;1 k=;S2 +1 Ck ] (34) We have seen in Equations 20 that the variance of each scale is also asymptotically linear in the length N . Thus, as N increases, the correlation Cor(S1 (s); S2 (s)) rapidly converges to a constant given by: PS1 ;1 k=;S2 +1PCk (S1 S2 ) PS1 ;1 [( k=;S1 +1 Ck (S1 ))( Sk=2 ;;1S2 +1 Ck (S2 ))]1=2 (35) In checking calculations on DNA scales (or other alphabets) that are invariant under the reverse complement operation, it is worth noticing that with a uniform distribution on the alphabet (pA = pC = pG = pT = 0:25), the correlations are symmetric. That is, for any 0 < k < S1 we have Ck (S1 ; S2 ) = C;k (S1 ; S2 ). This results immediately from the fact that the sum of the terms S2 (X1 : : : XS2 )S1 (X1 : : : XS1 ) and S2 (XS2 : : : X1 ) S1 (XS2 : : : XS2 ;S1 +1 ) is equal to the sum of the terms S2 (X1 : : : XS2 ) S1 (XS2 ;S1 +1 : : : XS2 ) and S2 (XS2 : : : X1 ) S1 (XS1 : : : X1 ), and similarly for other degrees of overlaps. The terms in the sums can be identically paired using the fact that S1 and S2 are assumed to be reverse-complement invariant. The result is not true if the scales, or the distribution, are not reverse-complement invariant. 2.4.3 Correlations Between Scales It is useful to have some information regarding the degree of correlation between two scales and how such correlation behaves at all sequence lengths. Consider then two scales S1 and S2 of length S1 and S2 . Without any loss of generality assume that S1 S2 . For sequences s of length N , we are interested in measuring the correlation between the random variables S1 (s) = Y1 + : : : + YN ;S1 +1 , with Yi = S1 (Xi : : : Xi+S1 ;1 ), and S2 (s) = Z1 + : : : + ZN ;S2 +1 , with Zi = S2 (Xi : : : Xi+S2 ;1 ). We have: ); S2 (s)) Cor(S1 (s); S2 (s)) = p Cov(S1 (sp (31) Var(S1 (s)) Var(S2 (s)) 3 Results 3.1 DNA Repeat Equivalence Classes Again only terms of the form Cov(Yi ; Zj ), where the distance between i and j is small, are non-zero. More precisely, non-zero terms can arise only if 0 j ; i S1 ; 1 We wrote a program that cycles through all possible DNA sequences of length P counting and listing all 8 the classes that are equivalent under circular permutation and reverse complement operations. Because of this equivalence, in the case of scales that are reversecomplement invariant, it is sucient to study the repeats of one representative member of each class. We ran the program up to length P = 12. The results are in complete agreement with Equations 9-12. -18.66 -8.11 -13.10 -1 1 -13.48 3 .0 -1 0 -14.00 1 -13.48 .0 .0 8 5 -1 5 Sequence length 1 2 3 4 5 6 7 8 9 10 11 12 .8 or proper classes are classes that do not contain a shorter periodic pattern. 1 -1 Table 1: Number of repeat unit equivalence classes. New C -9.45 -14.00 A Classes (total) Classes (new) 2 2 6 4 12 10 39 33 104 102 366 350 1,172 1,170 4,179 4,140 14,572 14,560 52,740 52,632 190,652 190,650 700,274 699,875 G -13.10 T -9.45 -8.11 -18.66 Figure 1: Dinucleotide prex automata for the propeller twist angle scale. The CAG repeat, for instance, is associated with the cycle C ! A ! G ! C in the graph and has a total propeller twist value of ;9 45 + ;14 00 ; 11 08 = ;34 53. The corresponding reverse complement cycle is given by C ! T ! G ! C. The triplet repeat class with the largest propeller twist value is CCC followed by CCG. : In Tables 2, 3, and 4 we list alphabetically all the members of each equivalence class for sequences of length 2, 3, and 4. When P = 4, for instance, one nds 39 classes: 26 classes with 8 elements, 8 classes with 4 elements, and 4 classes with 2 elements. Only 33 classes are new, in the sense that 6 classes are derived from patterns already encountered at P = 1 and P = 2. Likewise, when P > 2 is a prime number, the total number of classes is given by: : : : In the Appendix, we provide tables in alphabetical order that allow to invert Tables 3 and 4, i.e. to nd the class associated with any given P -tuple (P = 3; 4). 3.2 Analysis of DNA Repeats by Dinucleotide Scales In the case of dinucleotide scales, the prex automata contains 4 nodes (Figure 1). Each DNA sequence is associated with a path through the corresponding graph, and exact repeats are associated with cycles. All paths, including cycles, of length greater than 4 are composite in the sense that they contain a cycle of length 4 or less. In table 5, we list the dinucleotide scale values S (X1 X2 ) + S (X2 X1 ) for the 6 equivalence classes associated with all 16 possible dinucleotide repeats of the form (X1 X2 )R . For each scale, we list classes (represented by their rst alphabetical member) and the corresponding scale value, in decreasing value order. The highest level of base stacking energy is achieved by the AT repeat class (-10.39) and the lowest by the CG repeat class (-24.28). The ranking of all possible dinucleotide repeats induced by the propeller twist and the protein deformability scales are identical with the exception of an inversion between the CC (-16.22 and 12.2) and CG (-21.11 and 16.1) classes at the high (exible) end of the spectrum. At the opposite (sti) end, we nd the single letter repeat class AA (-37.2 and 5.8) followed by the 4P ; 4 + 2 (36) 2P with two classes of size 2 associated with poly-A and poly-C, while all the remaining classes are new and contain 2P members. Table 2: Dinucleotide classes equivalent under circular per- mutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally. Classes 1 and 5 are not proper dinucleotide classes. Class List of Members Number (Alphabetical Order) 1 AA TT 2 AC CA GT TG 3 AG CT GA TC 4 AT TA 5 CC GG 6 CG GC 9 Table 3: Trinucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally. Class Number 1 AAA 2 AAC 3 AAG 4 AAT 5 ACC 6 ACG 7 ACT 8 AGC 9 AGG 10 ATC 11 CCC 12 CCG List of Members (Alphabetical Order) TTT ACA CAA GTT TGT TTG AGA CTT GAA TCT TTC ATA ATT TAA TAT TTA CAC CCA GGT GTG TGG CGA CGT GAC GTC TCG AGT CTA GTA TAC TAG CAG CTG GCA GCT TGC CCT CTC GAG GGA TCC ATG CAT GAT TCA TGA GGG CGC CGG GCC GCG GGC Table 4: Tetranucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally. Class Number 1 AAAA 2 AAAC 3 AAAG 4 AAAT 5 AACC 6 AACG 7 AACT 8 AAGC 9 AAGG 10 AAGT 11 AATC 12 AATG 13 AATT 14 ACAC 15 ACAG 16 ACAT 17 ACCC 18 ACCG 19 ACCT 20 ACGC 21 ACGG 22 ACGT 23 ACTC 24 ACTG 25 AGAG 26 AGAT 27 AGCC 28 AGCG 29 AGCT 30 AGGC 31 AGGG 32 ATAT 33 ATCC 34 ATCG 35 ATGC 36 CCCC 37 CCCG 38 CCGG 39 CGCG TTTT AACA AAGA AATA ACCA ACGA ACTA AGCA AGGA ACTT ATCA ATGA ATTA CACA AGAC ATAC CACC CCGA AGGT CACG CCGT CGTA AGTG AGTC CTCT ATAG CAGC CGAG CTAG CAGG CCCT TATA ATGG CGAT CATG GGGG CCGC CGGC GCGC List of Members (Alphabetical Order) ACAA AGAA ATAA CAAC CGAA AGTT CAAG CCTT AGTA ATTG ATTC TAAT GTGT CAGA ATGT CCAC CGAC CCTA CGCA CGGA GTAC CACT CAGT GAGA ATCT CCAG CGCT GCTA CCTG CCTC CAAA CTTT ATTT CCAA CGTT CTAA CTTG CTTC CTTA CAAT CATT TTAA TGTG CTGT CATA CCCA CGGT CTAC CGTG CGTC TACG CTCA CTGA TCTC CTAT CTGG CTCG TAGC CTGC CTCC GTTT GAAA TAAA GGTT GAAC GTTA GCAA GAAG GTAA GATT GAAT TGTT TCTT TATT GTTG GTTC TAAC GCTT GGAA TAAG TCAA TCAT TTGT TTCT TTAT TGGT TCGT TAGT TGCT TCCT TACT TGAT TGAA TTTG TTTC TTTA TTGG TTCG TTAG TTGC TTCC TTAC TTGA TTCA GACA GTAT GGGT GACC GGTA GCAC GACG GTCT TACA GGTG GGTC GTAG GCGT GGAC TCTG TATG GTGG GTCG TACC GTGC GTCC TGTC TGTA TGGG TCGG TAGG TGCG TCCG GAGT GACT GTGA GTCA TCAC TCAG TGAG TGAC GATA GCCA GAGC TAGA GCTG GCGA TATC GGCT GCTC TCTA TGGC TCGC GCAG GAGG GCCT GGAG GGCA GGGA TGCC TCCC CATC GATC GCAT CCAT TCGA TGCA GATG GGAT TCCA TGGA CGCC GCCG CGGG GGCC GCCC GCGG GGCG GGGC 10 proper dinucleotide repeat class AG (-27.48 and 6.6). Table 6: Dinucleotide structural scale values for repeat unit = 1 2 3 with = 3. S ( ) = S ( 1 2 ) + S ( 2 3 ) + S ( 3 1 ). Repeat classes associated with triplet repeat exp Table 5: Dinucleotide structural scale values for repeat unit = 1 2 with = 2. S ( ) = S ( 1 2 ) + S ( 2 1 ). p X X P Base Stacking AT(-10.39) AA(-10.74) CC(-16.52) AG(-16.59) AC(-17.08) CG(-24.28) p X X Propeller Twist CC(-16.22) CG(-21.11) AC(-22.55) AT(-26.86) AG(-27.48) AA(-37.32) X X X P p X X X X X X pansion diseases are in bold. X X Base Stacking AAT (-15.76) AAA (-16.11) ACT (-21.11) AAG(-21.96) AAC (-22.45) ATC (-22.95) CCC (-24.78) AGG (-24.85) ACC (-25.34) AGC(-27.94) ACG (-30.01) CCG(-32.54) Protein Deformability CG (16.1) CC (12.2) AC (12.1) AT (7.9) AG (6.6) AA (5.8) In table 6, we list the dinucleotide scale values for the 12 equivalence classes associated with all possible triplet repeats of the form (XYZ)R . In this special case, we nd the results of (Baldi et al., 1999). The high and low ends of the base stacking energy scale are occupied by the triplet classes AAT (-15.76) and CCG (-32.54) respectively. We nd again a high degree of correlation between the propeller twist and protein deformability scales. If we exclude the classes AAA/TTT (-55.98) and CCC/GGG (-24.33), which are not proper triplet repeat classes, then the maximum and the minimum of the propeller twist spectrum are respectively occupied by the classes CCG (-29.22) and AAG (-46.14). A similar ranking with the same extremal triplets is observed with the protein deformability scale: CCG (22.2) occupies the high end, whereas AAA (8.7) and AAG (9.5) occupy the low end of the spectrum. When considering all three dinucleotide scales, three minima and two maxima are occupied by two of the three repeat classes known to be involved in triplet repeat expansion diseases, namely AAG and CCG. GAA triplet (in the AAG class) expansion is associated with Friedreich's ataxia (Orr et al., 1993; Campuzano et al., 1996; Junck & Fink, 1996; Paulson et al., 1997; Koenig, 1998; Lee, 1998; Orr & Zoghbi, 1998; Paulson, 1998; Pulst, 1998; Stevanin et al., 1998). Abnormal GCC triplet (in the CCG class) expansion is associated with FRAXE mental retardation and abnormal expansion of the CGG triplet with Fragile X syndrome (FRAXA) (Nelson, 1995; Gusella & MacDonald, 1996; Eichler & Nelson, 1998; Skinner et al., 1998; Gecz & Mulley, 1999). The third triplet expansion disease related class, AGC, has average rank in all dinucleotide scales. In table 7, we list the scale values for the 39 equivalence classes associated with all possible tetranucleotide repeats of the form (X1 X2 X3 X4 )R . The maximum of the base stacking scale is occupied by the dinucleotide repeat ATAT (-20.78) and the proper tetranucleotide repeat AAAT (-21.13). The minimum corresponds to CGCG (-48.56) followed by ACGC (-41.36). We again observe a substantial positive correlation between the Propeller Twist CCC (-24.33) CCG(-29.22) ACC (-30.66) AGC(-34.53) AGG (-35.59) ACG (-36.61) ATC (-37.94) ACT (-38.95) AAC (-41.21) AAT (-45.52) AAG(-46.14) AAA (-55.98) Protein Deformability CCG (22.2) ACG (18.9) CCC (18.3) ACC (18.2) AGC (15.9) ATC (15.9) AAC (15.0) AGG (12.7) AAT (10.8) ACT (10.7) AAG (9.5) AAA (8.7) values produced by the propeller twist and protein deformability scales together with a weaker negative correlation with respect to the base stacking energy scale. The high end of the propeller twist scale is occupied by CCCC (-32.44) and CCCG (-37.33) while that of the protein deformability scale is occupied by CGCG (32.2) and CCCG (28.3). The lowest values correspond for both scales to AAAA (-74.64 and 11.6) and AAAG (-64.80 and 12.4). All repeat units of length greater than 4 are made up of shorter cyclic paths in the prex automata and therefore their properties can essentially be predicted from the previous three tables. For all lengths, for instance, the highest level of base stacking energy is achieved by the class ATATATAT... when P is even, and by the class AATATATAT... when P is odd. The lowest level by the class CGCGCG... when P is even, and CCGCGCG... when P is odd. For protein deformability, the maximal level is achieved by the class CGCGCG... when P is even, and by CCGCGCG... when P is odd. The lowest level is associated with poly-A (i.e. (A)P ). Poly-C and poly-A give also the absolute highest and lowest propeller twist angles at all lengths. 3.3 Analysis of DNA Repeats by Trinucleotide Scales In the case of trinucleotide scales, the prex automata contains 16 nodes (Figure 2), each one labeled with a different dinucleotide. All paths, including cycles, of length greater than 16 are composite, i.e. contain at least one cycle of length 16 or less. The trinucleotide scale values for all repeats with periodic unit length P = 2 are given in Table 8. The highest level of bendability is achieved by AT (0.364) and the lowest by AA (-0.548) and CG (-0.154). The highest 11 AA Table 7: Dinucleotide structural scale values for repeat unit = 1 2 3 4 with = 4. S ( ) = S ( 1 2 ) + S ( 2 3 ) + S ( 3 4 ) + S ( 4 1 ). p X X X X X X P p X X TT X X TG AG X X Base Stacking ATAT (-20.78) AAAT (-21.13) AATT (-21.13) AAAA(-21.48) AACT (-26.48) AAGT(-26.48) AGAT (-26.98) AAAG(-27.33) ACAT (-27.47) AAAC(-27.82) AATC (-28.32) AATG (-28.32) ACCT (-29.37) AAGG(-30.22) AACC (-30.71) ATCC (-31.21) AGCT(-31.97) CCCC (-33.04) AGGG(-33.11) AGAG(-33.18) AAGC(-33.31) ACCC (-33.60) ACAG(-33.67) ACTC (-33.67) ACTG(-33.67) ACAC (-34.16) ATGC (-34.30) ACGT(-34.53) AACG(-35.38) ATCG (-35.88) AGCC(-36.20) AGGC(-36.20) ACCG(-38.27) ACGG(-38.27) CCCG(-40.80) CCGG(-40.80) AGCG(-40.87) ACGC(-41.36) CGCG(-48.56) Propeller Twist CCCC (-32.44) CCCG(-37.33) CCGG(-37.33) ACCC (-38.77) CGCG(-42.22) AGCC(-42.64) AGGC(-42.64) ACGC(-43.66) AGGG(-43.70) ACCG(-44.72) ACGG(-44.72) ATGC (-44.99) ACAC (-45.10) ATCC (-46.05) ACCT (-47.06) ACGT(-48.08) AGCG(-48.59) AACC (-49.32) ACAT (-49.41) ACAG(-50.03) ACTC (-50.03) ACTG(-50.03) AGCT(-50.93) ATCG (-52.00) AAGC(-53.19) ATAT (-53.72) AAGG(-54.25) AGAT (-54.34) AGAG(-54.96) AACG(-55.27) AATC (-56.60) AATG (-56.60) AACT (-57.61) AAGT(-57.61) AAAC(-59.87) AAAT (-64.18) AATT (-64.18) AAAG(-64.80) AAAA(-74.64) Protein Deformability CGCG (32.2) CCCG (28.3) CCGG (28.3) ACGC (28.2) ATGC (25.2) ACCG (25.0) ACGG (25.0) CCCC (24.4) ACCC (24.3) ACAC (24.2) ACGT (23.0) AGCG (22.7) ATCG (22.7) AGCC (22.0) AGGC (22.0) ATCC (22.0) AACG (21.8) AACC (21.1) ACAT (20.0) AAGC (18.8) AATC (18.8) AATG (18.8) AGGG (18.8) ACAG (18.7) ACTC (18.7) ACTG (18.7) AAAC (17.9) ACCT (16.8) ATAT (15.8) AAGG (15.6) AGAT (14.5) AGCT (14.5) AAAT (13.7) AATT (13.7) AACT (13.6) AAGT (13.6) AGAG (13.2) AAAG (12.4) AAAA (11.6) 0.175 P p Bendability AT (0.364) AG (0.058) AC (0.034) CC (-0.024) CG (-0.154) AA(-0.548) X X X 0.017 TA CA 0.076 GT CC GG CG GC CT GA Figure 2: Trinucleotide prex automata for the bendability scale. Circle is used for ease of display but does not represent actual connections. The CAG repeat, for instance, is associated with the cycle CA ! AG ! GC ! CA in the graph and has a total bendability value of 0 175+0 017+0 076 = 0 268. It is the highest bendability value for any triplet repeat. Other edges are not shown. : : : : The trinucleotide scale values for all repeats with periodic unit length P = 3 are given in Table 9 (see also (Baldi et al., 1999)). The highest level of bendability is achieved by the class AGC (0.268) and the lowest by AAA (-0.822) and ACC(-0.238). In fact only two classes of repeats (AGC and ATC) have positive bendability and are well separated from the rest. The highest level of position preference is achieved by the class AAA (108) followed by CCG (72), and the lowest by AGG and ACC (21). The class AGC, which contains the CAG repeat responsible for the majority of the known triplet repeat expansion diseases, has the highest bendability. It is the only repeat class for which all three shifted triplets have a high individual bendability. Moreover, this class has relatively low position preference value, another sign of exibility. Therefore one can hypothesize that long CAG repeats correspond to stretches of DNA that are highly exible in all positions. Consistently with their high exibility, CAG/CTG repeats have been found to have the highest anity for histones among all possible triplet repeats (Wang & Grith, 1994; Wang & Grith, 1995; Godde & Wole, 1996). Other DNA sequences can adopt long range curvature only if they contain highly exible triplets in phase with the helical pitch (roughly every 10.5 bp). The exibility of extended CAG repeats has been veried experimentally (Chastain & Sinden, 1998). The CCG class, which contains the diseaserelated triplets CGG and GCC, is found at the high Table 8: Trinucleotide structural scale values for repeat unit = 1 2 with = 2. S ( ) = S ( 1 2 1 ) + S ( 2 1 2 ). X X AT TC level of position preference is achieved by AA (72) and CG (50), and the lowest by AG (17). p AC X X X Position Preference AA (72) CG (50) AT (26) CC (26) AC (23) AG (17) 12 (rigid) end of the position preference scale (72), exceeded only by poly-A. This class is also sti according to the bendability scale (-0.106). This is consistent with the fact that CGG/CCG repeats seem completely unable to form nucleosomes (Wang et al., 1996; Godde et al., 1996). The AAG class, which contains the disease related triplet GAA, occupies the lower (exible) end of the position preference scale (27). It is the second lowest considering that the last two classes have the same value (21). We also note that AAA/TTT is by large the stiest of all possible repeats according to both scales. Such homopolymeric tracts are known from X-ray crystallography to be rigid and straight (Nelson et al., 1987) and they are bad candidates for nucleosome positioning. In fact, a number of promoters in yeast contain homopolymeric dA:dT elements. Studies in two dierent yeast species have shown that the homopolymeric elements destabilize nucleosomes and thereby facilitate the access of transcription factors bound nearby (Iyer & Struhl, 1995; Zhu & Thiele, 1996). Interestingly, the sequence of the IT15 gene involved in Huntington Disease has a repeat containing 18 adenine nucleotides at its 3' end. Whereas the class CCG is extremely rigid according to the trinucleotide scales, it is extremely exible according to the dinucleotide scales. Similarly, the predicted exibility of the AAG class according to the position preference scale is in contradiction with the results obtained using all other di- or trinucleotide scales. Such discrepancies can result from imperfections of the scales, or from the fact that each scale captures a dierent facet of DNA structure. Dinucleotides and tri-nucleotides scales are in good agreement for CAG repeats and homopolymeric poly-A tracts. est level of bendability is achieved by the class ATAT (0.728), which is rather a dinucleotide repeat, followed by the proper tetranucleotide repeat ATGC (0.420). On the opposite end of the scale, we nd AAAA (-1.096) and AAAC (-0.470). For the position preference scale, AAAA (144), AATT (100), CGCG (100), AAAT(99) are on the higher end, AAAT being the rst proper tetranucleotide repeat, while ACGG (23)occupies the lower end of the spectrum. Table 10: Trinucleotide structural scale values for repeat unit = 1 2 3 4 with = 4. S ( ) = S ( 1 2 3 ) + S ( 2 3 4 ) + S ( 3 4 1 ) + S ( 4 1 2 ). p p X X X P p ATAT (0.728) ATGC (0.420) ACAT (0.335) AGGC (0.301) AGCT (0.214) AGAT (0.189) ACAG (0.183) ACTG (0.173) AGAG (0.116) ACTC (0.082) ACAC (0.068) AGCC (0.053) AAGC (0.027) ACCT (0.026) AATG (0.011) ACGC (0.006) ACGT(-0.016) AGGG(-0.025) AGCG(-0.032) CCCC (-0.048) CCGG(-0.058) CCCG(-0.118) AAGG(-0.162) ACGG(-0.169) AAGT(-0.171) AATC (-0.181) ACCG(-0.184) ATCC (-0.209) ATCG (-0.226) AACT (-0.230) ACCC (-0.250) AACG(-0.278) AAAT (-0.304) CGCG(-0.308) AAAG(-0.365) AATT (-0.424) AACC (-0.468) AAAC(-0.470) AAAA(-1.096) X X X X X X triplet repeat expansion diseases are in bold. Bendability AGC (0.268) ATC (0.218) AGG (-0.013) AAT (-0.030) CCC (-0.036) ACG (-0.049) ACT (-0.068) AAG(-0.091) CCG(-0.106) AAC (-0.196) ACC (-0.238) AAA (-0.822) P X X X Bendability Table 9: Trinucleotide structural scale values for repeat unit = 1 2 3 with = 3. S ( ) = S ( 1 2 3 ) + S ( 2 3 1 ) + S ( 3 1 2 ). Repeat classes associated with X X X X X X X X X X Position Preference AAA (108) CCG (72) AAT (63) ACG (47) AGC (40) CCC (39) ACT (35) ACC (33) ATC (33) AAG (27) AAC (21) AGG (21) p X X X X X X Positioning Preference AAAA(144) AATT (100) CGCG(100) AAAT (99) CCGG (94) AGCG (89) AGCT (86) CCCG (85) AGCC (80) ATCG (76) AATG (68) AGGC (68) AAAG (63) ACGC (63) ATGC (62) AAAC (57) AACG (57) AACT (55) AATC (54) AAGC (53) ATAT (52) CCCC (52) ACCG (49) AGAT (47) ACAC (46) ACCC (46) ACTC (44) AAGT (43) ACAT (43) ACCT (40) ATCC (38) AGAG (34) AGGG (34) AACC (31) AAGG (31) ACTG (29) ACGT (28) ACAG (25) ACGG (23) In order to nd the most extreme repeats for a given scale at a given repeat unit length, one would have to explore scale values for repeat units up to length P = 16 (see Section 2.3). Because of particular values of the scale, in some cases the results tabulated above for val- The trinucleotide scale values for all repeats with periodic unit length P = 4 are given in Table 10. The high13 3.4.1 Convergence to Normality ues of P up to 4 only are sucient. For instance, the most bendable repeat with P = 2n is always ATATAT..., while the least bendable is poly-A. Similarly, the highest value of the position preference scale is always occupied by poly-A. Extremal results can also be derived by dynamic programming. In many cases, however, a sequence of interest may have a very high or low score according to a given scale, without being the most extreme. The probabilistic theory provides the means to quantify directly how extreme any given sequence is with respect to a given family or background. Using sampling methods we also studied the convergence of S (s) and S (p) to a normal distribution as the length N or P of the sequences is increased, as predicted by our central limit theorem. In practice, the convergence rate is very fast. As an example, an histogram of bendability values for repeat units of length 5, 10, and 15 is given in Figure 3. Similar results are observed with plain sequences. 300 200 3.4 Probabilistic Analysis of DNA Scales P=5 µ=−0.0923 σ=0.0771 P=10 µ=−0.185 σ=0.154 P=15 µ=−0.277 σ=0.231 100 0 4 x 10 For simplicity, we rst assume a uniform distribution pA = pC = pG = pT . In specic applications, other distributions can be used, such as the background distribution of a given genome or a given class of DNA sequences. We can then use Equations 21-24 to calculate the expectation and variance of S (p) across all possible repeat unit patterns p and all scales S . In particular, E (S (p)) = S P and Var(S (p)) = S P . In Table 11 we list the relevant coecients for the dinucleotide scales. 2.5 2 1.5 1 0.5 0 6 x 10 10 5 0 Figure 3: Histogram of bendability values S ( ) for all possip ble repeat units of length = 5 10 15. Vertical dashed lines represent standard deviation units. P Table 11: Basic intra-scale coecients for dinucleotide scales with repeat unit length (2 ; 1) = 3. P S = E (Yi ) C0 = Var(Yi ) C1 = Cov((Yi Yi+1 ) S = Var(S (p)) ; ; S 3.4.2 Examples of Z-Scores for Disease Triplets BS PT PD -8.08 -12.59 4.96 6.62 9.68 9.62 0.31 2.26 -2.19 7.23 14.20 5.23 We have seen that the triplets involved in expansion diseases often tend to have extremal structural properties. This was assessed by computing the scores S (p). We can now also compute Z-scores using Equations 29 and 30, as in Table 13. In Table 12 we list the relevant coecients for the trinucleotide scales. Table 13: Z-scores for the repeats involved in HD and FRAXA for the bendability and propeller twist scales. ( ) is the Z-score of the repeat unit against the background of all possible repeat units of same length = 3. ( ) is the Z-score of a long repeat containing repeat units against the background of all possible sequences of same length = , including non repetitive sequences. Values of are chosen at the characteristic low end and high end of each disease. Z p Table 12: Basic intra-scale coecients for trinucleotide scales with repeat unit length (2 ; 1) = 5. P P S Z r R N B PP S = E (Yi ) -0.018 13.78 C0 = Var(Yi ) 0.015 103.108 C1 = Cov(Yi Yi+1 ) 0.0015 18.214 C2 = Cov(Yi Yi+2 ) -0.001 -2.558 S = Var(S (p) 0.015 134.42 R Disease Triplet Scale Z (p) R (low end) Z (r) R (high end) Z (r) To double-check the mathematical formula, all the constants above were also obtained independently by exhaustive sampling. 14 HD CAG B 1.60 36 9.00 121 16.52 FRAXA CGG PP 1.56 200 21.48 1000 48.23 RP 3.4.4 Correlations Between Scales: Asymptotic Values When repeat length is taken into consideration, disease causing repeats p appear to be even more extreme because of the N factor in Equation 30. For example, a CGG repeat of length N = 3; 000 (R = 1; 000) observed in FRAXA patients is more than 48 standard deviations away from the mean propeller twist value of random uniform sequences of the same length. If the same correlations are computed by shuing the 4.6 Mbp of the Escherichia coli genome randomly over a length of 31bp, one obtains the numbers given in Table 15 (Pedersen et al., 2000). The correlation between BS and PT is even higher (-0.744) and so is the correlation of AT-content with BS and PT (0.899 and -0.882). Incidentally, when measured on the actual Escherichia coli genome the correlations are even higher. For instance, the correlation between BS and PT becomes -0.825. 3.4.3 Correlations Between Scales at Short Lengths We can use Equations 31-35 to study the correlations between the scales at short lengths and asymptotically. Clearly we can also consider AT-content as a scale. It can be viewed, for instance, as a mononucleotide scale with value 1 for A and T, and 0 for C and G. This scale is trivially reverse-complement invariant and perfectly additive. We include it to see whether it correlates strongly with any of the structural scales, especially asymptotically. Table 15: Correlations between the scales measured over 31 bp random segments from Escherichia coli. AT BS PT PD B PP Table 14: Correlations between the scales at a given position ( 0 (S1 S2 )). AT-content, Base Stacking, Propeller Twist, C ; AT BS 1 0.478 1 PT PD -0.539 -0.294 -0.294 0.043 1 0.668 1 B -0.0098 -0.018 0.249 0.141 1 BS 0.899 1 PT -0.882 -0.744 1 PD -0.777 -0.805 0.801 1 B -0.153 -0.181 0.370 0.108 1 PP 0.023 -0.029 -0.154 0.062 -0.206 1 These results are easily explained by the theory developed here. The asymptotic correlation between the scales computed using Equation 35 are displayed in Table 16. Because the Escherichia coli genome has a nucleotide distribution close to uniform, the results are indeed remarkably similar to Table 15, and would be identical up to sampling uctuations if in Table 16 we had used the precise distribution for E. coli (A=0.2462, C=0.2542, G=0.2537, T=0.2459), instead of a uniform distribution. Protein Deformability, Bendability, Position Preference. AT BS PT PD B PP AT 1 PP 0.040 -0.089 -0.123 -0.036 -0.080 1 Correlations at a given position are given in Table 14. Consistently with (Baldi et al., 1998) and the results above, the correlations between the structural scales are very low with the exception of PT and PD (0.668). BS and PT have also non-trivial opposite correlations with respect to AT-content (0.48 and -0.54). Here correlations between a dinucleotide scale S1 and a trinucleotide scale S2 are computed using sums of the form P X1 X2 X3 S1 (X1 X2 )S2 (X1 X2 X3 ). Because the third nucleotide does not appear in the dinucleotide scale, in (Baldi et al., 1998) correlations between dinucleotide and trinucleotide scales were also computed using, for the dinucleotide scale, the sum S1 (X1 X2 ) + S1 (X2 X3 ). When considering neighboring dinucleotides, the correlation between BS and PT, for instance, increases its magnitude from -0.294 to -0.550. This eect must be caused by correlations that are present in runs of overlapping dinucleotides, but not in the single dinucleotides. Such a phenomenon may arise if the physical reality behind both scales is that the structure actually depends on more than a dinucleotide step, and this is very likely to be the case. Table 16: Asymptotic correlations between the scales using a uniform distribution. AT BS PT PD B PP AT 1 BS 0.914 1 PT -0.891 -0.757 1 PD -0.798 -0.843 0.810 1 B -0.167 -0.193 0.387 0.113 1 PP 0.046 -0.003 -0.175 0.051 -0.225 1 It is essential to notice that the asymptotic values do not require very long sequences but are approximately correct already at a length scale of 15-20 base pairs or so (Figure 4). Asymptotically, and with a uniform distribution, all the dinucleotide scales have strong positive or negative correlations with each other and with ATcontent. Notice that this is not the case for the trinucleotide scales. It is also not necessarily the case if the correlations are measured with respect to other nucleotide distributions. 15 1 0.8 PT/PD 0.4 0.6 0.2 0.4 PT/B Correlation AT/B 0 Correlation 0.2 PD/B PD/PP 0 BS/PP PT/PP BS/B −0.2 −0.2 −0.4 −0.6 B/PP −0.8 −0.4 1 .75 −1 1 −0.6 .50 .75 .50 BS/PT −0.8 A/(A+T) Ratio BS/PD .25 .25 0 0 AT Content −1 0 5 10 15 20 25 30 35 Figure 5: Surface representing the asymptotic correlation N between the bendability scale and the AT content scale for dierent distribution values of AT-content and A/T proportions. Figure 4: Rapid convergence of correlations between pairs of scales to the theoretically predicted asymptotic values as a function of string length with uniform nucleotide distribution and free boundary conditions. Curves start at = 2 for pairs of dinucleotide scales, and = 3 for all other pairs. Correlations are calculated exactly up to = 12, and using a random uniform sample of 70,000,000 points for = 17 22 27 32. N GenBank data base. The rst category, in particular, contains 67 classes of patterns that were found to occur in tandem repeats of total length N 12 with R 3 in at least two dierent length sizes. The second category includes 71 pattern classes that are found to occur in tandem repeats over 12 nucleotides long in only one length size. The last category contains 363 pattern classes that were found not to expand beyond 12 nucleotides. For each pattern class, simple indicators are provided, such as average length of repeats, relative abundance of class, and expandability. In Figure 6 we display the Z-scores for the all the repeat units in the rst category, as a function of repeat unit length (P ), computed with respect to repeat units of the same length using Equation 29. The distributions of repeat classes with respect to each structural scale are approximately symmetric and normal at all lengths, showing no clear-cut bias towards extremal values of any scale. When taking into account the relative abundance of the classes, as quantied in (Jurka & Pethiyagoda, 1995) by using the number of nucleotides occurring in corresponding repetitive sequences, we nevertheless observe that the most abundant class, (poly-A, 33%), corresponds to the stiest repeat at all lengths, for all scales except BS (Figure 7). It is reasonable to assume that the structural properties of poly-A are related to its abundance. The next most frequent repeats, however, (AC=17.94%, AG= 5.38%, and AAAT=3.39%) do not show a clear pattern of extremal values in the ve scales considered. Likewise, we do not nd any obvious correlation between scale values and relative abundance or expandability indicators provided in (Jurka & Pethiyagoda, 1995). N N N ; ; ; Figure 5 shows how the asymptotic correlation of bendability with AT content varies with the underlying distribution. Similar surfaces for all other scales are given in Figure 8 in the Appendix. Notice that in general A=(A + T ) 0:5 in eukaryotic genomic DNA as soon as suciently long stretches of DNA are taken into consideration (Charga's second parity rule) (Prabhu, 1993; Bell & Forsdyke, 1999). This is not necessarily the case with, for instance, relatively short stretches of DNA, synthetic DNA, or with bacterial DNA that contains a strong composition skew associated with independently replicated regions. 3.5 Analysis of a Set of Expandable Repeats in Primate Genomes Because triplets involved in disease expansions seem to have extremal properties which may be related to the expansion mechanism, it is worth testing whether this is a fairly general feature of units associated with tandem repeats. Here we consider the large set of repeat unit classes derived in (Jurka & Pethiyagoda, 1995) corresponding to frequently encountered tandem repeats of multiple lengths (i.e. that are polymorphic) in primate genomes. The data contains 501 unique classes of repeat units ranging in length from P = 1 to P = 6, classied into three categories: expandable, weakly expandable, and non-expandable. These categories were derived from simple statistical criteria calculated over a subset of the 16 BS PT 5 PD 5 35 5 30 0 0 0 −5 −5 1 2 3 4 5 6 Relative Abundance (%) 25 −5 1 2 B 3 4 5 6 1 2 3 4 5 6 PP 5 5 AT AG 20 A 15 AGAT AAGG 10 AC AAAC AAAAG 0 0 5 0 −5 −5 1 2 3 4 5 6 1 2 3 4 5 6 1 2 AAAG AGCCC AAAAT AAAT 3 4 Repeat Unit Length (P) AAAAC AACCCT 5 6 Figure 7: Relative abundance of expandable repeat unit Figure 6: Distribution of expandable repeat classes over classes in (Jurka & Pethiyagoda, 1995) as a function of repeat unit length. The individual contribution of repeat classes totaling more than 1% relative abundance are shown. scale values. Horizontal axis represents repeat unit length ( ). Vertical axis represents scale value distances from expectation for all possible repeat units, normalized by the corresponding standard deviation (see Equation 29). Dots represent a set of 67 expandable repeat classes. The number of classes at each length 1-6 is as follows: 2, 4, 9, 18, 18, and 16. Circles represent the most extreme repeat classes, when considering the full population of repeat patterns at a given length. Note that at each length , only proper tuple repeats are represented in the set found in (Jurka & Pethiyagoda, 1995), excluding repeats with shorter periodicity. For instance poly-A is only represented once as a single letter repeat. It is worth noticing that most extreme positions are actually occupied at almost all length by mono-, di-, or tri-nucleotide repeats. The set of expandable repeats therefore actually covers the full range of each scale when taking into account patterns made up of shorter repeated units. Only 5 circles out of 64 remain unmatched by an expandable repeat class. P P AGC AGG CCG AAT at least over relatively short distances. The precise nature of such distances, however is an open important question that ultimately will have to be addressed experimentally. The structural scales used here should be regarded only as a rst order approximation. The twist angle between bases, for instance, is likely to depend on more than just the two neighboring bases (Dickerson, 1992). A better estimate could be derived using the tetranucleotide consisting of the two bases before and after the twist angle. Unfortunately, the structure of all possible 256 tetranucleotides is not known and represents a considerable experimental challenge. But the methods we have developed are independent of any particular scale, approximation, or oligonucleotide length. They are readily applicable to new scales, tetranucleotide and other, as well as to completely dierent scales dened over other alphabets (codons, RNA, proteins, etc.). Furthermore, the methods are also applicable in conjunction with computationally-derived scales that are parameterized and tted to the data using neural network representations and other statistical machine learning techniques. The theory presented resolves the issue of correlation between scales. For each pair of scales there is not a single correlation number because correlation depends both on background distribution and window length. Given a xed background distribution, the correlations rapidly converge to a xed asymptotic value that can be predicted mathematically. This value is attained here over a characteristic window length of about 10-15 bp, the same over which normality is achieved, and corresponding to a few times the size of the longest of the two P 4 Discussion A general framework for sequence analysis in the presence of one or more additive scales has been developed. The framework solves a number of open issues including: (1) construction of sequences with extremal properties; (2) quantitative evaluation of sequences with respect to a given genomic background; (3) automatic extraction of extremal sequences and proles from genomic data bases; (4) rapid convergence to normal distributions when N increases; and (5) complete analysis of correlations between scales and their rapid convergence towards a xed asymptotic value. The framework has been applied to DNA sequences and structural scales. The fundamental requirement for the application of the framework is the additivity of the scale. This is likely to be a reasonable approximation for many scales, 17 uniform distribution may be the least biased. On the other hand, if correlations are measured over a large number of sequences extracted from genomic data, it is clear that the sequence composition inuences the correlation. Similarly, if the scales are used to pullout signals against a genomic background, it is important to take the statistical composition into consideration. In this respect, it is worth noting that large scale genomic DNA is characterized by strand invariant compositions (pA = pT and pC = pG ) where correlations between empirically-derived scales tend to be smaller in absolute value. The framework, however, applies as well to compositions that are not strand invariant. We have modeled the background distribution using single nucleotide probabilities but it should be clear that the same framework can accommodate more complex probabilistic models, such as Markov models of order k. In fact, it is interesting to note that with a higher order Markov model, some of the correlations between the scales measured in E. coli would be slightly higher. It is fair to suspect, however, that the structural models currently available are somewhat noisy and therefore only marginal gains are to be expected at best from the use of more rened probabilistic models. Taken together, all these results indicate that with the exception of propeller twist and protein deformability, the empirical scales we have used are by and large uncorrelated. DNA 3D structure is a complex phenomenon that cannot be captured locally by a single number, but rather corresponds to a vector of properties. It is therefore likely that the scales we have used represent dierent attempts at capturing various aspects of DNA structure from dierent perspectives. This view is consistent with the simultaneous presence of predominantly low and occasionally high correlations between pairs of scales. In particular, this provides a possible partial explanation for the dierences in interpretation the scales provide for some of the three extremal triplet classes involved in triplet repeat expansion diseases. The CAG-class of repeats is consistently found to be exible according to all the models used here and in agreement with experimental evidence (Chastain & Sinden, 1998). This class is special among triplet repeats, being responsible for a large fraction of the currently known triplet repeat diseases (10 of the 13 mentioned in (Baldi et al., 1999)). Furthermore, in a model study in E. coli, the CAG triplet repeat was found to be the predominant genetic expansion product. In this study, the CAG-class was expanded at least nine times more frequently than any other triplet (Ohshima et al., 1996). This is also the case in the primate data of (Jurka & Pethiyagoda, 1995), as shown in Figure 7. The CGG-class of repeats, on the other hand, seems to be very rigid, except for the propeller twist (and thus scales being compared. This is the range over which local statistical uctuations are stabilized, but it does not correspond necessarily to the range over which the scales are additive. With a uniform background distribution, for instance, the correlation between propeller twist and base stacking varies monotonically from -0.294 to -0.757, as window size is increased from 2 to 10 or so. Thus the increase in window length signicantly changes the measured correlation from slightly negative to substantially negative. An even more striking example is provided by the correlation between base stacking and protein deformability, which varies from 0.043 to -0.843, under the same conditions. Empirical determination of the proper window length for additivity and for measuring correlations may not be easy. It must be noted, however, that these large variations are observed only with the theoretically-derived base stacking energy scale (as well as the AT-content scale). In general, all other empirically-derived scales exhibit pairwise correlations that do not vary dramatically with window length and are relatively small (Figure 4), except for propeller twist and protein deformability. Thus for the empirically-derived scales the local behavior and the aggregate behavior over 10-15 bp is quite similar in terms of correlations, so that the precise selection of a window length may not be a serious obstacle in this case. Propeller twist and protein deformability are highly, but not perfectly, correlated, over both very short and intermediate distances. These scales were derived by crystallography of naked DNA and DNA in complex with protein respectively. This suggests that DNA structures observed in protein-DNA complexes may to some extent be determined at the DNA-sequence level. Or at least that the structure of DNA in the complex has to be consistent with the inherent structural features of the naked DNA. In general, when substantial positive or negative correlation between two scales is observed, two dierent sets of conclusions can be drawn. First, from a practical standpoint, it may be simpler and faster in data base searches to use only one of the two scales, since the results provided by the second one are redundant. Second, high correlation between two completely dierent experimental approaches attempting to quantify the connection between DNA sequence and structure can be taken as a sign that the approaches are measuring the same underlying reality. Thus correlation analysis, in addition to which scale to use in practice, may tell us something about their interpretation and validity. It may at rst seem strange that correlations depend also on background distribution, since structure is a deterministic function of DNA sequence. In this sense, a 18 Our analysis of a large set of tandem repeats from primate genomes reveals that many repeat units do not have salient structural extreme properties according to the models used here. The results suggest that tandem repeats are likely to belong to dierent classes and result from a variety of dierent mechanisms not all of which involve extremal structural proles. This is not to say that structural signatures, rather than extreme patterns, may not be involved in other cases as suggested, for instance, in (Liao et al., 2000). Further evidence for such a possibility is provided by the fact that among the most frequently expanded repeat classes with 3 < P < 7, substrings of mono repeats (particularly poly-A which is sti) seem to be present almost always (Figure 7). Although some of the equations derived are for exact repeats, it should be clear that the framework applies immediately to situations where the repeat is not perfect, either because of small variations in the sequence or because the repetition number R contains a fractional component. Cases are described in the literature (Orr et al., 1993), for instance, where the G of a few isolated CAG triplets within a long CAG repeat region are replaced by a T. Of interest, the CAT triplet belongs to the second highest bendability class, and therefore the exibility properties of such stretches are likely to be preserved. More generally, the scales could perhaps form the basis of new alignment penalties in cases where structure, rather than sequence, is preserved. The methods developed here can be applied more systematically to other repeats including telomeric repeats, non-triplet disease-causing repeats, as well as to database searches for new putative disease-causing repeat classes (Kleiderlein et al., 1998; Baldi et al., 1999). For instance, a repeated twelve-mer upstream of the EPM1 gene displays intergenerational instability and has been associated with myoclonus epilepsy (Lalioti et al., 1997; Lalioti et al., 1998). A similarly unstable, AT-rich, 42 bp-repeat is involved in the fragile site FRA10B (Hewett et al., 1998). In addition, a particular form of Muscular Distropy (FSHD) seems to be caused by DNA contraction, rather than expansion. In FSHD, the repeat units are surprisingly long (3.3 Kb), and located at the tip of the long arm of chromosome IV. The units are repeated 30 to 40 times in normal individuals, and reduced down to 8 repeats in aected individuals (van Deutekom et al., 1993; Winokur et al., 1994; Hewitt et al., 1994). Statistical analysis of long repeats units may benet even more from the techniques developed here. Finally, these techniques can be applied to repeats associated with interspersed elements as well, and, more broadly, to the analysis of structural signals, across entire genomes (Pedersen et al., 2000), or associated with specic regions such as regulatory regions, protein bind- protein deformability) scale. Better structural models may be needed to shed light on such a discrepancy. However, it is important to remember that the models used in this work are based on mutually dierent and also rather indirect investigations of DNA structure. Any single scale is likely to capture correctly only some structural features of some sequence elements. For instance, the enzyme DNase I used to produce the bendability scale preferentially binds and cuts sites where DNA is bent or bendable away from the minor groove. This means that a high DNase I value can be caused by either a very exible piece of DNA (isotropically exible, or anisotropically exible in the right direction), or alternatively by a piece of DNA that is sti but curved with a compressed major groove. The framework derived can be used to study how extreme tandem repeats and other sequences are. In several cases, we nd that commonly encountered repeat units have extremal structural properties. This is the case for the most common repeat poly-A or poly-T but also for the repeats involved in triplet repeat expansion diseases. It is essential to notice that these extremal properties pertain to the repeat unit class, e.g. the triplet and its shifted versions, rather than the repeat unit alone. A triplet that is not extremal for a given scale, may become extremal once its two shifted versions are considered. For example, AGC has relatively low bendability when taken alone, but corresponds to the most bendable class when GCA and CAG are taken into account. When the large repetition numbers associated with repeat diseases are also taken into consideration, the extremality of the corresponding DNA sequences with respect to the general background are even more striking. Incidentally, the triplet repeat class is under-represented amongst primate DNA repeat expansions (Figure 7) suggesting that special expansion mechanisms may be at work for P = 1 and P = 3, at least in a fraction of the cases. How the expansion of disease causing triplets occurs, as well as several related puzzling questions such as why expansion frequency depends on repeat length, remain poorly understood although it is widely assumed that unusual structural characteristics of the repeats may play a role. Several models have been proposed involving base-slippage and alternative DNA structures during DNA replication, recombination, and repair (Wells, 1996; Pearson & Sinden, 1998a; Pearson & Sinden, 1998b; Moore et al., 1999). Growing evidence indicates that the formation of hairpins in Okazaki fragments (during replication of the lagging strand) is probably involved in the expansion process (Chen et al., 1995; Gacy et al., 1995; Wells, 1996; Mariappan et al., 1998; Miret et al., 1998; Pearson & Sinden, 1998b). But many questions remain open to interpretation. 19 ing sites, SARs (scaold attachment regions), or polytene bands in Drosophila. The methods are also being applied to phylogenetic questions and to whether DNA structure may have had any inuence on the origin of the genetic code. While all these problems would benet from improved structural models, the methods are now in place to work in conjunction with any new scale that may become available in the near future with progress in DNA experimental techniques. Table 18: Trinucleotide scales. Triplet AAA/TTT AAC/GTT AAG/CTT AAT/ATT ACA/TGT ACC/GGT ACG/CGT ACT/AGT AGA/TCT AGC/GCT AGG/CCT ATA/TAT ATC/GAT ATG/CAT CAA/TTG CAC/GTG CAG/CTG CCA/TGG CCC/GGG CCG/CGG CGA/TCG CGC/GCG CTA/TAG CTC/GAG GAA/TTC GAC/GTC GCA/TGC GCC/GGC GGA/TCC GTA/TAC TAA/TTA TCA/TGA Acknowledgements The work of PB was initially supported by an NIH SBIR grant to Net-ID, Inc., and currently by a Laurel Wilkening Faculty Innovation award at UCI. We would like to thank Anders Gorm Pedersen and David Ussery for comments on the manuscript. Appendix: Dinucleotide and Trinucleotide Scales The 3 dinucleotide scales (Table 17) and 2 trinucleotide scales (Table 18) used in the text. Table 17: Dinucleotide scales. Pair AA/TT AC/GT AG/CT AT CA/TG CC/GG CG GA/TC GC TA BS -5.37 -10.51 -6.78 -6.57 -6.57 -8.26 -9.69 -9.81 -14.59 -3.82 PT PD -18.66 2.9 -13.10 2.3 -14.00 2.1 -15.01 1.6 -9.45 9.8 -8.11 6.1 -10.03 12.1 -13.48 4.5 -11.08 4.0 -11.85 6.3 B PP -0.274 36 -0.205 6 -0.081 6 -0.280 30 -0.006 6 -0.032 8 -0.033 8 -0.183 11 0.027 9 0.017 25 -0.057 8 0.182 13 -0.110 7 0.134 18 0.015 9 0.040 17 0.175 2 -0.246 8 -0.012 13 -0.136 2 -0.003 31 -0.077 25 0.090 18 0.031 8 -0.037 12 -0.013 8 0.076 13 0.107 45 0.013 5 0.025 6 0.068 20 0.194 8 where denotes the reverse complement permutation, with 0 k P ; 1, and l = 0 or 1. Notice that this group is not commutative. However k l = l P ;k . In both cases, G acts on B and an equivalence relation 1 2 is dened if and only if there exists a g 2 G such that 2 = g1 . The total number of classes or orbits is given by the Burnside lemma and equal to: 1 X jB j (37) jGj g2G g where Bg = fx 2 B jg(x) = xg is the set of all the Appendix: Enumeration of Equivalence Classes elements of B that are xed by g. We thus need to study Bk and Bk . A case by case inspection easily shows that: If (k; P ) = 1, then jBk j = A since only sequences of one repeated letter are stable. If kjP , then jBk j = Ak . If (k; P ) = l, then jBk j = Al . When G is the group of cyclic permutations, jGj = P . Putting these results together gives the right hand side Polya's enumeration theory solves a number of combinatorial questions related to the action of groups on sets. The relevant set here is the set B of all words of length P over the alphabet A with AP elements. The group G of interest is a subgroup of the group of all permutations of B . If we consider circular permutations on one strand only, it is the circular group with P elements, generated by the single right shift operator . If we consider also the reverse complement, then the group G is easily described as the set of all permutations of the form k l , 20 Appendix: Correlations as a Function of Background Distribution of Equation 7, which is equivalent to the left hand side after some algebra. When we take into account the reverse complement, we have jGj = 2P . A case by case inspection again shows that: If P is odd, then Bk is empty. If P is even, then Bk has P=2 degrees of freedom and therefore jBk j = AP=2 . This yields immediately the formula in Equations 9 and 10. If needed, it is also straightforward to count the number of elements inside each type of equivalence class. Examples of surfaces of correlations between the ATcontent scales and the other scales as a function of background distribution are given in Figure 8. Similar curves can be obtained for any pair of scales. Appendix: Repeat Classes for N = 3 and N = 4 Repeat classes for each trinucleotide (Table 19) and each tetranucleotide (Table 20). Table 19: Repeat class for each trinucleotide ( = 3). Classes are numbered in alphabetical order. The rst alphabetical member of each class is in bold. N Triplet Class Triplet Class Triplet Class AAA 1 CCG 12 GTA 7 AAC 2 CCT 9 GTC 6 AAG 3 CGA 6 GTG 5 AAT 4 CGC 12 GTT 2 ACA 2 CGG 12 TAA 4 ACC 5 CGT 6 TAC 7 ACG 6 CTA 7 TAG 7 ACT 7 CTC 9 TAT 4 AGA 3 CTG 8 TCA 10 AGC 8 CTT 3 TCC 9 AGG 9 GAA 3 TCG 6 AGT 7 GAC 6 TCT 3 ATA 4 GAG 9 TGA 10 ATC 10 GAT 10 TGC 8 ATG 10 GCA 8 TGG 5 ATT 4 GCC 12 TGT 2 CAA 2 GCG 12 TTA 4 CAC 5 GCT 8 TTC 3 CAG 8 GGA 9 TTG 2 CAT 10 GGC 12 TTT 1 CCA 5 GGG 11 CCC 11 GGT 5 21 Table 20: Repeat class for each tetranucleotide ( = 4). Classes are numbered in alphabetical order. The rst alphabetical member of each class is in bold. N Quadruplet Class Quadruplet AAAA 1 AGGT AAAC 2 AGTA AAAG 3 AGTC AAAT 4 AGTG AACA 2 AGTT AACC 5 ATAA AACG 6 ATAC AACT 7 ATAG AAGA 3 ATAT AAGC 8 ATCA AAGG 9 ATCC AAGT 10 ATCG AATA 4 ATCT AATC 11 ATGA AATG 12 ATGC AATT 13 ATGG ACAA 2 ATGT ACAC 14 ATTA ACAG 15 ATTC ACAT 16 ATTG ACCA 5 ATTT ACCC 17 CAAA ACCG 18 CAAC ACCT 19 CAAG ACGA 6 CAAT ACGC 20 CACA ACGG 21 CACC ACGT 22 CACG ACTA 7 CACT ACTC 23 CAGA ACTG 24 CAGC ACTT 10 CAGG AGAA 3 CAGT AGAC 15 CATA AGAG 25 CATC AGAT 26 CATG AGCA 8 CATT AGCC 27 CCAA AGCG 28 CCAC AGCT 29 CCAG AGGA 9 CCAT AGGC 30 CCCA AGGG 31 CCCC Class Quadruplet 19 CCCG 10 CCCT 24 CCGA 23 CCGC 7 CCGG 4 CCGT 16 CCTA 26 CCTC 32 CCTG 11 CCTT 33 CGAA 34 CGAC 26 CGAG 12 CGAT 35 CGCA 33 CGCC 16 CGCG 13 CGCT 12 CGGA 11 CGGC 4 CGGG 2 CGGT 5 CGTA 8 CGTC 11 CGTG 14 CGTT 17 CTAA 20 CTAC 23 CTAG 15 CTAT 27 CTCA 30 CTCC 24 CTCG 16 CTCT 33 CTGA 35 CTGC 12 CTGG 5 CTGT 17 CTTA 27 CTTC 33 CTTG 17 CTTT 36 GAAA Class Quadruplet 37 GAAC 31 GAAG 18 GAAT 37 GACA 38 GACC 21 GACG 19 GACT 31 GAGA 30 GAGC 9 GAGG 6 GAGT 18 GATA 28 GATC 34 GATG 20 GATT 37 GCAA 39 GCAC 28 GCAG 21 GCAT 38 GCCA 37 GCCC 18 GCCG 22 GCCT 21 GCGA 20 GCGC 6 GCGG 7 GCGT 19 GCTA 29 GCTC 26 GCTG 23 GCTT 31 GGAA 28 GGAC 25 GGAG 24 GGAT 30 GGCA 27 GGCC 15 GGCG 10 GGCT 9 GGGA 8 GGGC 3 GGGG 3 GGGT 22 Class 6 9 12 15 18 21 24 25 28 31 23 26 34 33 11 8 20 30 35 27 37 38 30 28 39 37 20 29 28 27 8 9 21 31 33 30 38 37 27 31 37 36 17 Quadruplet Class Quadruplet GGTA 19 TCCT GGTC 18 TCGA GGTG 17 TCGC GGTT 5 TCGG GTAA 10 TCGT GTAC 22 TCTA GTAG 19 TCTC GTAT 16 TCTG GTCA 24 TCTT GTCC 21 TGAA GTCG 18 TGAC GTCT 15 TGAG GTGA 23 TGAT GTGC 20 TGCA GTGG 17 TGCC GTGT 14 TGCG GTTA 7 TGCT GTTC 6 TGGA GTTG 5 TGGC GTTT 2 TGGG TAAA 4 TGGT TAAC 7 TGTA TAAG 10 TGTC TAAT 13 TGTG TACA 16 TGTT TACC 19 TTAA TACG 22 TTAC TACT 10 TTAG TAGA 26 TTAT TAGC 29 TTCA TAGG 19 TTCC TAGT 7 TTCG TATA 32 TTCT TATC 26 TTGA TATG 16 TTGC TATT 4 TTGG TCAA 11 TTGT TCAC 23 TTTA TCAG 24 TTTC TCAT 12 TTTG TCCA 33 TTTT TCCC 31 TCCG 21 Class 9 34 28 18 6 26 25 15 3 12 24 23 11 35 30 20 8 33 27 17 5 16 15 14 2 13 10 7 4 12 9 6 3 11 8 5 2 4 3 2 1 1 Correlation AT/PT Correlation AT/BS 1 0.5 0 −0.5 −1 1 0 −0.5 −1 1 .75 .50 .25 0 AT Content 0 .25 .50 .75 .75 1 .50 .25 0 AT Content A/(A+T) Ratio 0 .25 .50 .75 1 A/(A+T) Ratio 1 Correlation AT/PP 1 Correlation AT/PD 0.5 0.5 0 −0.5 −1 1 0.5 0 −0.5 −1 1 .75 .50 .25 AT Content 0 0 .25 .50 .75 .75 1 .50 .25 AT Content A/(A+T) Ratio 0 0 .25 .50 .75 1 A/(A+T) Ratio Figure 8: Surface representing the asymptotic correlation between AT-content and all the other scales for dierent background distributions. References dependent bending propensity of DNA as revealed by DNase I: parameters for trinucleotides. EMBO J., 14, 1812{1818. Calladine, C. R., Drew, H. R. & McCall, M. J. (1988). The intrinsic structure of DNA in solution. J. Mol. Biol., 201, 127{137. Campuzano, V., Montermini, L., Molto, M. D., Pianese, L., Cossee, M., Cavalcanti, F., Montos, E., Rodius, F., Duclos, F., Monticelli, A., Zara, F., Canizares, J., Koutnikova, H., Bidichandani, S. I., Gellera, C., Brice, A., Trouillaas, P., Michele, G. D., Filla, A., Frutos, R. D., Palau, F., Patel, P. I., Donato, S. D., Mandel, J. L., Cocozza, S., Koenig, M. & Pandolfo, M. (1996). Friedreich's ataxia: autosomal recessive disease caused by an intronic GAA triplet repeat expansion. Science , 271, 1423{1427. Chastain, P. D. & Sinden, R. R. (1998). CTG repeats associated with human genetic disease are inherently exible. J. Mol. Biol., 275, 405{411. Chen, X., Mariappan, S. V. S., Catasti, P., Ratli, R., Moyzis, R. K., Ali, L., Smith, S. S., Bradbury, E. M. & Gupta, G. (1995). Hairpins are formed by the single DNA strands of the fragile X triplet repeats: structure and biological implications. Proc. Natl. Acad. Sci. USA, 52, 5199{5203. Dickerson, R. E. (1992). DNA structure from A to Z. Meth. Enz., 211, 67{111. Drew, H. R. & Travers, A. A. (1985). DNA bending and its relation to nucleosome positioning. J. Mol. Biol., 186, 773{790. Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. (1998). Biological Sequence Analysis. Probabilstic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK. Eichler, E. E. & Nelson, D. L. (1998). The FRAXA fragile site and fragile X syndrome. In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 13{50. Feller, W. (1971). An Introduction to Probability Theory and its Applications, volume 2. John Wiley & Sons, New York. Second Edition. Ashley, C. T. & Warren, S. T. (1995). Trinucleotide repeat expansion and human disease. Ann. Rev. Genetics , 29, 703{728. Baldi, P., Brunak, S., Chauvin, Y. & Krogh, A. (1996). Naturally occurring nuclesome positioning signals in human exons and introns. J. Mol. Biol., 263, 503{510. Baldi, P., Brunak, S., Chauvin, Y. & Pedersen, A. G. (1999). Structural basis for triplet repeat disorders: a computational analyis. Bioinformatics , 15, 918{929. Baldi, P., Chauvin, Y., Pedersen, A. G. & Brunak, S. (1998). Computational applications of DNA structural scales. In Proceedings of the 1998 Conference on Intelligent Systems for Molecular Biology (ISMB98). The AAI Press, Menlo Park, CA, pp. 35{42. Baldi, P. & Rinott, Y. (1989). On normal approximations of distributions in terms of dependency graphs. The Annals of Probability , 17, 1646{1650. Bell, S. J. & Forsdyke, D. R. (1999). Accounting units in DNA. Journal of Theoretical Biology , 197, 51{61. Benson, G. (1999). Tandem repeats nder: a program to analyze DNA sequences. Nucleic Acids Research , 27, 573{580. Benson, G. & Waterman, M. S. (1994). A method for fast database search for all k-nucleotide repeats. Nucleic Acids Research , 22, 4828{4836. Blanchard, M. K., Chiapello, H. & Coward, E. (2000). Detecting localized repeats in genomic sequences: a new strategy and its applications to Bacillus subtilis and Arabidopsis thaliana sequences. Computers and Chemistry , 24, 57{70. Breslauer, K. J., Frank, R., Blocker, H. & Marky, L. A. (1986). Predicting DNA duplex stability from the base sequence. Proc. Natl. Acad. Sci. USA, 83, 3746{3750. Brukner, I., Sanchez, R., Suck, D. & Pongor, S. (1995). Sequence- 23 Fye, R. M. & Benham, C. J. (1999). Exact method for numerically analyzing a model of local denaturation in superhelically stressed DNA. Physical Review E , 59, 3408{3426. Gacy, A. M., Goellner, G., Juranic, N., Macura, S. & McMurray, C. T. (1995). Trinucleotide repeats that expand in human disease form hairpin structures in vitro. Cell , 8, 533{540. Gecz, J. & Mulley, J. C. (1999). Characterisation and expression of a large, 13.7 kb fmr2 isoform. Eur. J. Hum. Genet., 7, 157{162. Godde, J. S., Kass, S. U., Hirst, M. C. & Wole, A. P. (1996). Nucleosome assembly on methylated CGG triplet repeats in the fragile X mental retardation gene 1 promoter. J. Biol. Chem., 271, 24325{24328. Godde, J. S. & Wole, A. P. (1996). Nucleosome assembly on CTG triplet repeats. J. Biol. Chem., 271, 15222{15229. Goodsell, D. S. & Dickerson, R. E. (1994). Bending and curvature calculations in B-DNA. Nucl. Acids Res., 22, 5497{5503. Grove, A., Galeone, A., Mayol, L. & Geiduschek, E. P. (1996). Localized DNA exibility contributes to target site selection by DNA-bending proteins. J. Mol. Biol., 260, 120{125. Gusella, J. F. & MacDonald, M. E. (1996). Trinucleotide instability: a repeating theme in human inherited disorders. Ann. Rev. in Medicine , 47, 201{209. Hardy, J. & Gwinn-Hardy, K. (1998). Genetic classication of primary neurodegenerative disease. Science , 282, 1075{1079. Hassan, M. A. E. & Calladine, C. R. (1996). Propeller-twisting of base-pairs and the conformational mobility of dinucleotide steps in DNA. J. Mol. Biol., 259, 95{103. Hewett, D. R., Handt, O., Hobson, L., Mangelsdorf, M., Eyre, H. J., Baker, E., Sutherland, G. R., Schuenhauer, S., Mao, J. I. & Richards, R. I. (1998). FRA10B structure reveals common elements in repeat expansion and chromosomal fragile site genesis. Mol. Cell , 1, 773{781. Hewitt, J. E., Lyle, R., Clark, L. N., Valleley, E. M., Wright, T. J., Wijmenga, C., van Deutekom, J. C., Francis, F., Sharpe, P. T. & et al., M. H. (1994). Analysis of the tandem repeat locus D4Z4 associated with facioscapulohumeral muscular dystrophy. Hum. Mol. Genetics , 3, 1287{1295. Hunter, C. A. (1996). Sequence-dependent DNA structure. Bioessays , 18, 157{162. Iyer, V. & Struhl, K. (1995). Poly (dA:dT), a ubiquitous promoter element that stimulates transcription via its intrinsic DNA structure. EMBO J., 14, 2570{2579. Jereys, A. J. (1997). Spontaneous and induced minisatellite instability in the human genome. Clinical Science , 93, 383{390. Junck, L. & Fink, J. K. (1996). Machado-Joseph disease and SCA3: the genotype meets the phenotypes. Neurology , 46, 4{8. Jurka, J. (1998). Repeats in genomic DNA: mining and meaning. Curr. Opin. Struct. Biol., 8, 333{337. Jurka, J. & Pethiyagoda, C. (1995). Simple repetitive DNA sequences from primates: compilation and analysis. J. Mol. Evol., 40, 120{126. Jurka, J., Walichiewicz, J. & Milosavljevic, A. (1992). Prototypic sequences for human repetitive DNA. J. Mol. Evol., 35, 286{291. Kleiderlein, J. J., Nisson, P. E., Jessee, J., Li, W., Becker, K. G., Derby, M. L., Ross, C. A. & Margolis, R. L. (1998). CCG repeats in cDNAs from human brain. Hum. Genet., 103, 666{673. Koenig, M. (1998). Friedreich's ataxia. In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 219{238. Lahm, A. & Suck, D. (1991). DNase I-induced DNA conformation: 2 A structure of a DNase I-octamer complex. J. Mol. Biol., 222, 645{667. Lalioti, M. D., Scott, H. S., Buresi, C., Rossier, C., Bottani, A., Morris, M. A., Malafosse, A. & Antonarakis, S. E. (1997). Dodecamer repeat expansion in cystatin B gene in progressive myoclonus epilepsy. Nature , 386, 847{851. Lalioti, M. D., Scott, H. S., Genton, P., Grid, D., Ouazzani, R., M'Rabet, A., Ibrahim, S., Gouider, R., Dravet, C., Chkili, T., Bottani, A., Buresi, C., Malafosse, A. & Antonarakis, S. E. (1998). A PCR amplication method reveals instability of the dodecamer repeat in progressive myoclonus epilepsy (EPM1) and no correlation between the size of the repeat and age at onset. Am. J. Hum. Genet., 62, 842{847. Lee, C. C. (1998). Spinocerebellar ataxia type 6 (SCA6). In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 145{ 154. Liao, G., Rehm, E. J. & Rubin, G. M. (2000). Insertion site preferences of the P transposable element in drosophila melanogaster. Proc. Natl. Acad. Sci. USA, 97, 3347{3351. Liu, K. & Stein, A. (1997). DNA sequence encodes information for nucleosome array formation. J. Mol. Biol., 270, 559{573. Lu, Q., Wallrath, L. L. & Elgin, S. C. R. (1994). Nucleosome positioning and gene regulation. J. Cell. Biochem., 55, 83{92. Mariappan, S. V. S., Silks III, L. A., Bradbury, E. M. & Gupta, G. (1998). Fragile X DNA triplet repeats, (GCC )n , form hairpins with single hydrogen-bonded cytosine. cytosine mispairs at the CpG sites: isotope-edited nuclear magnetic resonance spectroscopy on (GCC )n with selective 15 n4-labeled cytosine bases. J. Mol. Biol., 283, 111{120. Milosavljevic, A. & Jurka, J. (1993). Discovering simple DNA sequences by the algorithmic signicance method. CABIOS , 9, 407{411. Miret, J. J., Pessoa-Brandao, L. & Lahue, R. S. (1998). Orientation-dependent and sequence-specic expansions of CTG/CAG trinucleotide repeats in Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA, 95, 12438{12443. Moore, H., Greenwell, P. W., Liu, C. P., Arnheim, N. & Petes, T. D. (1999). Triplet repeats form secondary structures that escape DNA repair in yeast. Proc. Natl. Acad. Sci. USA, 96, 1504{1509. Nelson, D. L. (1995). The fragile X syndrome. Semin. Cell Biol., 6, 5{11. Nelson, H. C. M., Finch, J. T., Luisi, B. F. & Klug, A. (1987). The structure of an oligo(dA)-oligo(dT) tract and its biological implications. Nature , 330, 221{226. Ohshima, K., Kang, S. & Wells, R. D. (1996). CTG triplet repeats from human hereditary diseases are dominant genetic expansion products in Escherichia coli. J. Biol. Chem., 271, 1853{1856. Olson, W. K., Gorin, A. A., Lu, X., Hock, L. M. & Zhurkin, V. B. (1998). DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. Proc. Natl. Acad. Sci. USA, 95, 11163{11168. Ornstein, R. L., Rein, R., Breen, D. L. & MacElroy, R. D. (1978). An optimised potential function for the calculation of nucleic acid interaction energies. I. Base stacking. Biopolymers , 17, 2341{2360. Orr, H. T., Chung, M., Ban, S., Jr., T. J. K., Servadio, A., Beaudet, A. L., McCall, A. E., Duvick, L. A., Ranum, L. P. W. & Zoghbi, H. Y. (1993). Expansion of an unstable trinucleotide CAG repeat in spinocerebellar ataxia type 1. Nature Genet., 4, 221{226. Orr, H. T. & Zoghbi, H. Y. (1998). Polyglutamine tract vs. protein context in SCA1 pathogenesis. In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 105{118. 24 Parvin, J. D., McCormick, R. J., Sharp, P. A. & Fisher, D. E. (1995). Pre-bending of a promoter sequence enhances anity for the TATA-binding factor. Nature , 373, 724{727. Paulson, H. L. (1998). Spinocerebellar ataxia type 3/machadojoseph disease. In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 129{144. Paulson, H. L., Perez, M. K., Trottier, Y., Trojanowski, J. Q., Subramony, S. H., Das, S. S., Vig, P., Mandel, J. L., Fischbeck, K. H. & Pittman, R. N. (1997). Intranuclear inclusions of expanded polyglutamine protein in spinocerebellar ataxia type 3. Neuron , 19, 333{344. Pazin, M. J. & Kadonaga, J. T. (1997). SWI2/SNF2 and related proteins: ATP-driven motors that disrupt protein-DNA interactions? Cell , 88, 737{740. Pearson, C. E. & Sinden, R. R. (1998a). Slipped strand DNA, dynamic mutations and human disease. In Wells, R. D. & Warren, S. T., (eds.) Genetic instabilities and hereditary neurological diseases . Academic Press, New York, pp. 585{621. Pearson, C. E. & Sinden, R. R. (1998b). Trinucleotide repeat DNA structures: dynamic mutations from dynamic DNA. Cur. Opin. Struct. Biol., 8, 321{330. Pedersen, A. G., Baldi, P., Brunak, S. & Chauvin, Y. (1998). DNA structure in human RNA polymerase II promoters. J. Mol. Biol., 281, 663{673. Pedersen, A. G., Jensen, L. J., Brunak, S., Staerfeldt, H. H. & Ussery, D. W. (2000). A DNA structural atlas for escherichia coli. J. Mol. Biol.. In press. Ponomarenko, M. P., Ponomarenko, J. V., Frolov, A. S., Podkolodny, N. L., Savinkova, L. K., Kolchanov, N. A. & Overton, G. C. (1999). Identication of sequence-dependent DNA sites interacting with proteins. Bioinformatics , 15, 687. Prabhu, V. V. (1993). Symmetry observations in long nucleotide sequences. Nucleic Acids Research , 21, 2797{2800. Pulst, S.-M. (1998). Spinocerebellar ataxia type 2. In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 119{128. Rinott, Y. & Dembo, A. (1996). Some examples of normal approximations by Stein's method. In Aldous, D. & Pemantle, R., (eds.) Random Discrete Structures . Springer Verlag, New York, NY, pp. 25{44. Ross, C. A. (1995). When more is less: pathogenesis of glutamine repeat neurodegenerative diseases. Neuron , 15, 493{496. Rubinsztein, D. C. & Amos, B. (1998). Trinucleotide repeat mutation processes. In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 257{268. Rubinsztein, D. C. & Hayden, M. R. (1998). Introduction. In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 1{12. Satchwell, S. C., Drew, H. R. & Travers, A. A. (1986). Sequence periodicities in chicken nucleosome core DNA. J. Mol. Biol., 191, 659{675. Sheridan, S. D., Benham, C. J. & Hateld, G. W. (1998). Activation of gene expression by a novel DNA structural transmission mechanism that requires supercoiling-induced DNA duplex destabilization in an upstream activating sequence. J. Biol. Chem.. Simpson, R. T. (1991). Nucleosome positioning: occurrence, mechanisms, and functional consequences. Prog. in Nucleic Acids Res. and Mol. Biol., 40, 143{184. Sinden, R. R. (1994). DNA structure and function. Academic Press, San Diego, CA. Skinner, J. A., Foss, G. S., Miller, W. J. & Davies, K. E. (1998). Molecular studies of the fragile sites FRAXE and FRAXF. In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 51{60. Starr, D. B., Hoopes, B. C. & Hawley, D. K. (1995). DNA bending is an important component of site-specic recognition by the TATA binding protein. J. Mol. Biol., 250, 434{446. Stevanin, G., Daviid, G., Abbas, N., Durr, A., Holmberg, M., Duyckaerts, C., Giunti, P., Cancel, G., Ruberg, M., Mandel, J.L. & Brice, A. (1998). Spinocerebellar ataxia type 7 (SCA7). In Rubinsztein, D. C. & Hayden, M. R., (eds.) Analysis of triplet repeat disorders . BIOS Scientic Publishers Ltd., Oxford, UK., pp. 155{168. Suck, D. (1994). DNA recognition by Dnase I. J. Mol. Recognition , 7, 65{70. The Huntington's Disease Collaborative Research Group (1993). A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. Cell , 72, 971{983. Tsukiyama, T. & Wu, C. (1997). Chromatin remodeling and transcription. Curr. Opin. Gen. Dev , 7, 182{191. van Deutekom, J. T., Wijmenga, C., van Tienhoven, E. A. E., Gruter, A. M., Hewitt, J. E., Padberg, G. W., van Ommen, G. J. B., Hofker, M. H. & Frants, R. R. (1993). FSHD associated DNA rearrangements are due to deletions of integral copies of a 3.3 kb tandemly repeated unit. Human Molecular Genetics , 2, 2037{2042. Wang, Y.-H., Gellibolian, R., Shimizu, M., Wells, R. D. & Grifth, J. D. (1996). Long CCG triplet repeat blocks exclude nucleosomes|a possible mechanism for the nature of fragile sites in chromosomes. J. Mol. Biol., 263, 511{516. Wang, Y.-H. & Grith, J. D. (1994). Preferential nucleosome assembly at DNA triplet repeats from the myotonic dystrophy gene. Science , 265, 669{671. Wang, Y.-H. & Grith, J. D. (1995). Expanded CTG triplet blocks from the myotonic dystrophy gene create the strongest known natural nucleosome positioning elements. Genomics , 25, 570{573. Wells, R. D. (1996). Molecular basis of triplet repeat diseases. J. Biol. Chem., 271, 2875{2878. Werner, M. H. & Burley, S. K. (1997). Architectural transcription factors: proteins that remodel DNA. Cell , 88, 733{736. Winokur, S. T., Bengtsson, U., Feddersen, J., Mathews, K. D., B.Weienbach, Bailey, H., Markovich, R. P., Murray, J. C., Wasmuth, J. J., Altherr, M. R. & Schutte, B. C. (1994). The DNA rearrangement associated with facioscapulohumeral muscular dystrophy involves a heterochromatin-associated repetitive element: implications for a role of chromatin structure in the pathogenesis of the disease. Chromosome Research , 2, 225{234. Wole, A. P. & Drew, H. R. (1995). DNA structure: implications for chromatin structure and function. In Elgin, S. C. R., (ed.) Chromatin structure and gene expression . IRL Press, Oxford, pp. 27{48. Wole, A. P. & Matzke, M. A. (1999). Epigenetics: regulation through repression. Science , 286, 481{486. Zhu, Z. & Thiele, D. J. (1996). A specialized nucleosome modulates transcription factor access to a c. glabrata metal responsive promoter. Cell , 87, 459{470. 25
© Copyright 2025 Paperzz