Genome Informatics 13: 173–182 (2002) 173 Comparative Analysis of the Genomes of Cyanobacteria and Plants Naoki Sato [email protected] Department of Molecular Biology, Saitama University, Shimo-Ohkubo 255, Saitama 338-8570, Japan Abstract Chloroplast genome originates from the genome of ancestral cyanobacterial endosymbiont. The comparison of the genomes of cyanobacteria and plants has been made possible by the advance in genome sequencing. I report here current results of our computational efforts to compare the genomes of cyanobacteria and plants and to trace the process of evolution of cyanobacteria, chloroplasts and plants. Cyanobacteria form a clearly defined monophyletic clade with reasonable level of diversity and are ideal for testing various approaches of genome comparison. Analysis of short sequence features such as genome signature was found to be useful in characterizing cyanobacterial genomes. Comparison of genome contents was performed by homology grouping of predicted protein coding sequences, rather than orthologue-based comparison, to minimize effects of multidomain proteins and large protein families, both of which are important in cyanobacterial genomes. Comparison of the genomes of six species of cyanobacteria suggests that there are a number of species-specific additions of protein genes, and this information is useful in reconstructing phylogenetic relationship. The homology groups in cyanobacteria were used as a reference to compare plants and non-photosynthetic organisms. The results suggest that 238 groups that are common to all organisms analyzed may define a minimal set of gene groups. In addition, only 80 groups are identified as the gene groups that could not have been acquired by plants without cyanobacterial endosymbiosis. Further study is needed to identify plant genes of cyanobacterial origin. Keywords: chloroplast, cyanobacterial evolution, comparative genomics, genome signature, homology group, plant genomics 1 Introduction Chloroplast genome is believed to originate from the genome of an ancestor of extant cyanobacteria [21, 22, 25]. However, the situation is far more complicated than the simple endosymbiosis. Since the smallest genome of present day cyanobacteria is of about 1.7 Mbp (Prochlorococcus marinus MED4), the size of the original genome of endosymbiont might be of this order. In contrast, the sizes of all known chloroplast genomes are between 80 − 200 kbp. The chloroplast genome encodes genes for rRNA and tRNA, as well as genes for photosystem proteins, subunits of ATP synthase, ribosome subunits, etc. The large difference in the sizes of chloroplast genomes and cyanobacterial genomes is partly due to the vast loss of genes from the original endosymbiont genome early during the evolution of chloroplast. Some of the endosymbiont genes were not just lost, but were actually transferred to the cell nucleus [1, 16]. These nuclear genes encode proteins that are targeted to the chloroplast. Each of these proteins is synthesized with a transit peptide in the N-terminus, which enables it to be recognized by the protein transport machinery now called translocon that located in the chloroplast envelope membrane. The nuclear-encoded chloroplast proteins include many proteins involved in photosynthesis as well as in the gene expression within the chloroplast. In addition, the plant nuclear genome also includes genes that had been transferred from the mitochondrial genome. There are cases in which an originally mitochondrial enzyme encoded in the nuclear genome is also targeted to the chloroplast. The T7-phage type RNA polymerase is a typical example [12]. 174 Sato Comparative genomics is a powerful tool in analyzing relationship of various different organisms. The sequence of rRNA has been used to estimate phylogeny of various organisms including cyanobacteria [9], but the data of whole genome are expected to give more reliable phylogeny [3, 6]. Different ways of using the whole genomic data have been reported, such as genome signature analysis that uses dinucleotide relative abundance [4, 15], genome alignment [27] and comparison of genome content [23], among others. There are two major approaches in comparing genome content. One approach that uses orthologues [23, 24, 27] is straightforward and easy to understand, but orthologues are usually difficult to define because of the presence of protein families as well as multi-domain proteins. Another approach that uses homology groups [10] rather than individual genes can tolerate the problem of gene families. The problem of multi-domain proteins (see below) is still to be solved. I report here initial analysis of the relationship between cyanobacteria and chloroplasts or plants. Traditional approach has been to enumerate enzymes or genes encoded in the plant nucleus or chloroplasts that are similar to the counterpart in cyanobacteria rather than to the counterparts in eukaryotes or bacteria. By this manual comparison, the core subunits and sigma factors of the chloroplast RNA polymerase has been demonstrated to be of cyanobacterial origin. Most of the ribosomal proteins, molecular chaperons, subunits of translocon, enzymes for pigment synthesis are also shown to originate from the counterparts of the endosymbiont. However, this approach is not powerful enough to enumerate all the possible homologues. One problem is that previous such efforts used only Synechocystis sp. PCC 6803 as a representative of cyanobacteria and Arabidopsis thaliana as a representative of plants [1, 16]. As I show in this article below, each cyanobacterial genome includes many speciesspecific genes. We should define first the genes that are common in all cyanobacteria, and then we will be able to compare them with the plant genes. In this comparison, we should also use various genomes of non-photosynthetic organisms as a negative control. The second point is that there are many paralogues in the genomes of cyanobacteria and plants. This makes difficult to identify orthologues of cyanobacteria and plants. In addition, there are a number of multi-domain proteins in the cyanobacteria especially in Anabaena sp. 7120 [19] (also called Nostoc sp. PCC 7120). The multidomain proteins are incredibly abundant in filamentous strains of cyanobacteria, such as Anabaena sp. 7120 and Nostoc punctiforme PCC 73102 [17]. These proteins are difficult to classify and no direct orthologues of these proteins can be found in other species of cyanobacteria or plants. In the homology group method, the multi-domain proteins are hidden under large homology groups. This problem should be solved in the future. In the present report, I will describe results of analysis of short sequence features as well as analysis of homology groups of cyanobacteria. The homology group method was extended to compare cyanobacteria and plants. 2 2.1 Methods Database The following database sequences were used. Cyanobacteria: Synechocystis sp. PCC 6803 (Sy), GenBank AB001339 and Cyanobase Synecho.p.aa and Synecho.p.table [13]; Anabaena sp. 7120 (An), GenBank BA000019 and Cyanobase chromo.p.aa and chromo.p.table [14]; Nostoc punctiforme PCC 73102 (Np), JGI Web site (microbe4) [17]; Prochlorococcus marinus MED4 (Pm1), JGI Web site (microbe2); Prochlorococcus marinus MIT9313 (Pm2), JGI Web site (microbe2b); Synechococcus sp. WH8102 (S81), JGI Web site (2351364). The Synechocystis database was recently reannotated, but the data used in the present study was those from the initial release. The DNA sequence remained unchanged after the first release. The annotation of open reading frame has been updated recently, but there are no large changes of open reading frames. The databases (finished contigs and draft CDS translations) of JGI [29] were retrieved in January 2002. Plant: Arabidopsis thaliana [26] all predicted protein sequences (Ath), faa file from the NCBI genome database site (mirrored in GenomeNet [28] under /pub/db/ncbi/genbank/genomes/A_thaliana/). 175 Comparative Analysis of the Genomes of Cyanobacteria and Plants Other organisms: Escherichia coli K-12 (Ec), GenBank U00096; Bacillus subtilis 168 (Bs), GenBank AL009126; Helicobacter pylori 26695 (Hp), GenBank AE000511; Saccharomyces cerevisiae all predicted protein sequences (Sc), *.faa and *.ptt files from the NCBI genome database site (mirrored in GenomeNet [28] under /pub/db/ncbi/genbank/genomes/S_cerevisiae/). Other draft bacterial sequences (Rhodopseudomonas palustris and Rhodobacter sphaeroides) were also retrieved from the FTP site of JGI [29]. In the following text, the genus names in roman style are used to denote the organisms listed above. The short symbol in two- or three characters such as Sy or Ath will also be used where appropriate such as in the tables and figures. 2.2 Basic Computational Analysis All the sequence manipulations and the analysis of short sequence feature including palindromes (command ‘sites’) and GC skew (command ‘genlist’) were done with the SISEQ package [20], version 130pre24. The dinucleotide relative abundance or genome signature analysis [4, 15] was performed by another software ‘dinucf’. Phylogenetic analysis was performed by the Phylip package [7]. Most of the calculation was done in a Linux workstation with Alpha 21264 processor and MacOS X workstations with G4 processor. 2.3 Homology Group Analysis Homology group analysis was performed essentially according to the BLAST2 (version 2.2.1) scores. All the predicted amino acid sequences of the three species of cyanobacteria (Sy, An and Np) were mutually analyzed by the BLASTP program [2] using the cutoff E value of 10−8 . The results were processed by PERL scripts to obtain homology groups. In this method, a group is formed if a single member has similarity to another member. Because of the presence of various multi-domain proteins, the largest group is very large including various unrelated preteins. But this situation does not seriously influence the genome comparison as described below. Similar analysis was also performed with all the six species of cyanobacteria. Then, homology of the proteins of three species of cyanobacteria were searched against all the predicted protein sequences of other three species of cyanobacteria (Pm1, Pm2, and S81), Ec, Bs, Sc (including mitochondrial genome), and Ath (including mitochondrial and chloroplast genomes) individually. The results were compiled according to the groups established with the three cyanobacterial sequences. Table 1: Low and high frequency sequences in the genomes of Anabaena and selected bacteria. Obs/Exp is the ratio of observed and expected number of occurrence of each sequence. The expected value was estimated from the base composition of the sequence and the GC content of genome. Dinucleotide relative abundance (DRA) was not considered in the estimation of expected values. An, Anabaena sp. PCC 7120; Np, Nostoc punctiforme; Sy, Synechocystis sp. PCC 6803; Hp, Helicobacter pylori; Ec, Escherichia coli K-12; Pm1, Prochlorococcus marinus MED4. RE, restriction enzyme. These values were calculated by the SISEQ package [20]. Genome Size (kbp) An 6,413 41.4 0.006 0.031 0.011 AhaII (NarI, AatII), GCGATCGC 122.1 Np 9,216 41.5 0.113 0.088 0.366 ApaLI, AvaIVP, BglII, GCGATCGC 120.7 Sy 3,573 47.7 0.534 0.418 0.243 BssHII ,MluI 1,667 38.9 0.255 0.130 GCGATCGC 0.639 ScaI, XhoI, AatII, KpnI, GCGATCGC 70.1 Hp Ec 4,639 50.8 0.259 0.303 0.765 AvrII, XbaI (none) Pm1 1,674 30.9 0.601 1.364 0.813 none GCTGCAGC GC% Low frequency RE sites (Obs/Exp) AvaI AvaII AvaIII Other RE sites CYCGRG GGWCC ATGCAT (Obs/Exp <0.05) ApaLI, AvaIVP, AvrII, BamHI, BanII (SacI), BglII, BsiWI (= SplI), BstBI (= AsuII), FspI, NcoI, NspI, PmlI, PstI, SacII, SalI, SphI. NcoI, NdeI, SacI High-frequency palindrome Octamer Obs/Exp sequence SalI, HpaI, SnaBI 10.4 13.3 176 Sato Table 2: Genome signature profile of photosynthetic prokaryotes and selected bacteria. Bold: >1.2; Underline: <0.85. Dinucleotide relative abundances (DRA) was calculated by the dinucf program according to the formulae of Karlin and Cardon [15]. Distance of a pair of genomes was calculated as a mean of the absolute differences in the 16 DRA values [4]. An, Anabaena sp. PCC 7120; Np, Nostoc punctiforme; Sy, Synechocystis sp. PCC 6803; Pm1, Prochlorococcus marinus MED4; S81, Synechococcus sp. WH8102; Bs, Bacillus subtilis; Ec, Escherichia coli K-12; Rp, Rhodopseudomonas palustris; Rs, Rhodobacter sphaeroides. Spe cies GC CG Dinucleotide relative abundances TT CC TC CT AT TA AA GG GA AG Distance (x1000) TG CA AC GT An 1.16 0.78 0.93 0.84 1.15 1.11 0.92 0.99 1.09 0.89 Np 1.24 0.81 0.92 0.81 1.17 1.05 0.94 1.02 1.09 0.86 Sy 1.02 0.75 1.00 0.75 1.32 1.36 0.86 0.85 1.05 0.79 An Np Sy Pm1 S81 Bs Ec Rp 30 115 128 Pm1 1.17 0.51 0.92 0.79 1.17 1.28 1.08 1.09 1.00 0.72 109 109 135 S81 1.10 0.87 1.13 0.43 1.09 0.95 1.08 1.00 1.25 0.85 123 113 199 173 Bs 1.27 1.04 1.02 0.65 1.24 0.97 1.06 0.91 1.08 0.75 115 92 143 141 109 Ec 1.28 1.16 1.10 0.75 1.21 0.91 0.92 0.82 1.12 0.88 107 95 151 202 136 Rp 1.20 1.31 1.43 0.44 1.08 0.75 1.24 0.87 1.02 0.86 217 203 263 252 147 158 154 Rs 1.12 1.16 1.57 0.39 0.99 0.85 1.31 0.99 0.97 0.75 229 220 285 234 153 169 202 82 93 Figure 1: Phylogenetic tree estimated from dinucleotide relative abundance (DRA) distances. This tree was constructed by the neighbor program of the Phylip package. 3 3.1 Results and Discussion Short Sequence Features of Cyanobacterial Genomes As an initial characterization of the cyanobacterial genomes, short sequence features of cyanobacterial genomes were analyzed. Frequency of short palindrome sequences such as restriction sites are summarized in Table 1. Anabaena genome is virtually the only bacterial genome known to date in which many restriction sites are unusually underrepresented. There are only 18 AvaI sites within the whole genome, and this is 0.006 times less than the expected abundance. A number of other restriction sites are also highly underrepresented. I analyzed more than 40 bacterial genomes but only Anabaena, Nostoc and Helicobacter showed such peculiarity. However, Nostoc has much more AvaI and AvaIII sites than does Anabaena. In contrast, Anabaena, Nostoc and Synechocystis share an identical high frequency palindrome GCGATCGC. Again, this is the only very high frequency palindrome ever found in bacterial genomes. This palindrome is found in both coding and non-coding sequences and the reason for this high frequency is still not clear. In E. coli or other bacteria, no palindrome sequence was found that occurs in a frequency higher than expected from random combination. Comparative Analysis of the Genomes of Cyanobacteria and Plants 177 Table 2 shows results of dinucleotide relative abundance (DRA) analysis of cyanobacteria, photosynthetic bacteria, and selected bacteria. The DRA values (or genome signature) is largely similar among cyanobacteria, but Synechococcus is divergent. This is clearly depicted in Fig. 1, a phylogenetic tree drawn with the sum of difference in DRA as distance. Although DRA cannot be used as a marker of phylogenetic relationship in a strict sense, some relationship between the phylogeny and DRA is confirmed. I have analyzed the GC skew of various cyanobacterial genomes as another sequence feature. The cumulative GC skew - AT skew diagram has been successfully used to predict replication origin of some bacteria [8]. The GC skew values are generally very small throughout the whole region of the completely sequenced genomes of Anabaena and Synechocystis. Therefore, no clear-cut turning point in the cumulative GC skew - AT skew diagram was detected in these cyanobacteria. Currently, we have no clue to identify the replication origin of these cyanobacterial genomes. However, there are some regions with discontinuous GC skew values (results not shown). These regions are potential sites of recent horizontal gene transfer. 3.2 Homology Group Analysis of Cyanobacterial Genomes As an initial attempt to compare the genome contents of cyanobacteria, all the predicted protein sequences of Anabaena, Nostoc and Synechocystis were grouped according to the BLASTP score, and the resulting groups are shown in Fig. 2A. It is clear that a large number of groups (1,343) are shared by the three species of cyanobacteria. What is more surprising is that each genome contains speciesspecific groups, and the number of species-specific groups amounted to 20 − 30% of total groups in each genome. Another point is that as many as 827 groups are shared by Anabaena and Nostoc, and these groups are supposed to be related to the ability of these two species of cyanobacteria to differentiate heterocysts, akinetes, and hormogonia. This three-species comparison was then extended to include six species (Tables 3, 4, 5). Inclusion of additional three species did not change the essential points described above. In each species of cyanobacteria, about 17 − 40% of the groups are unique. A close examination of the members of these groups suggests that some of the proteins are similar to predicted proteins of other bacteria. This indicates that at least some of the species-specific protein groups originate from genes imported by horizontal gene transfer [18]. Horizontal gene transfer has been identified mostly for genes of clearly defined function. But the data of the present study suggest that horizontal gene transfer might be more ubiquitous. However, it is not clear if all of the speciesspecific additions had been generated solely by the horizontal gene transfer. Since there are almost unlimited number of unknown bacteria in the world, horizontal gene transfer from these unidentified species could account for the species-specific additions, but it will be difficult to demonstrate this by assessing exactly the number of unidentified bacterial species, and we need further work to analyze the species-specific additions. Similar species-specific additions are also documented in various pairs of closely related bacteria [11]. Relationship of the six species of cyanobacteria is more clearly understood if the homology groups are used to construct a phylogenetic tree by the parsimony method (Fig. 3). Fortunately, the results obtained with two different cutoff E values of BLASTP coincided perfectly, although the actual number of species-specific additions and losses are significantly different. In this analysis, Anabaena and Nostoc form a group sister to Synechocystis, and all these species are paraphyletic to other three species. Curiously, the two strains of P. marinus are not monophyletic, as suggested by the analysis using the 16S rRNA sequences [9]. The two genera Prochlorococcus and Synechococcus are different in the composition of photosynthetic accessory pigments (divinyl chlorophyll a, chlorophyll b and phycobiliproteins). On the other hand, the two strains of P. marinus are also different in their habitats: MED4 is a high light-adapted strain and lives in a shallow zone of sea, while MIT9313 is a low light-adapted strain and lives in a deep ocean under more than 100 m below surface. The genome sizes and therefore the gene contents of these strains are also different. More data will be needed to understand the phylogenetic relationship of Prochlorococcus and Synechococcus. 178 Sato Figure 2: A. Schematic diagram of gene groups in Anabaena (An), Nostoc (Np) and Synechocystis (Sy). Similarity among the 5,381 predicted protein sequences (PPS) of An, 3,173 PPS of Sy, and 7,445 PPS of Np were scored by the blastp program and the PPSs were clustered with a cutoff value of 10−8 . The resulting groups were classified according to the presence of homologues in these three genomes: 834 gene groups (green area) were not present in Sy or Np, 827 groups (blue area) shared by An and Np but not by Sy, 67 groups (uncolored area) were shared by An and Sy but not by Np, 71 groups (uncolored area) were shared by Np and Sy but not by An, and 1,343 groups (pale green area) were common to the three organisms. The area of each field is approximately proportional to the number of groups shared by various combinations of organisms. B. Comparison of cyanobacteria, Arabidopsis and non-photosynthetic organisms. The 1,343 groups shared by An, Np and Sy were further classified with respect to the presence of homologues in other various organisms. The 784 groups (enclosed by a blue line) were shared also with Prochlorococcus marinus MED4 (Pm1), Prochlorococcus marinus MIT9313 (Pm2) and Synechococcus sp. WH8102 (S8102). The 628 groups (enclosed by a green line) were shared also with the Arabidopsis thaliana (Ath) genomes (nuclear, plastid and mitochondrial). The 511 groups were shared by the six cyanobacteria and Arabidopsis. The 267 gene groups were shared also with Escherichia coli, Bacillus subtilis, and Saccharomyces cerevisiae (nuclear and mitochondrial genomes). A fairly large number of groups (762) were shared with at least one of these non-photosynthetic organisms. Finally, 238 groups shown in the central shaded area were shared by all of the organisms analyzed. Comparative Analysis of the Genomes of Cyanobacteria and Plants 3.3 179 Comparison of Homology Groups in Cyanobacteria, Plants and Other Organisms The homology groups of the three species of cyanobacteria (Anabaena, Nostoc and Synechocystis) were used as a reference to compare the presence of homology groups in plants and other organisms (Fig. 2B). Here, the common groups in the three species of cyanobacteria (1,343 groups) were further classified by the presence of homologues in either other three species of cyanobacteria (Pm1, Pm2 and S81), the plant Arabidopsis, or the non-photosynthetic microorganisms (Ec, Bs and Sc). All the six species of cyanobacteria shared 784 groups, while 511 groups are also shared by Arabidopsis. The 238 groups that are shared by all the organisms analyzed are supposed to form the core building blocks of all organisms. The 431 groups that are shared by the six cyanobacteria plus Arabidopsis and at least one of the three non-photosynthetic organisms should be further classified by detailed phylogenetic analysis. It should be noted that there are only 80 groups that are shared by the six cyanobacteria and Arabidopsis but not by non-photosynthetic organisms. These groups include, as expected, many proteins involved in photosynthesis and pigment biosynthesis, but also a number of proteins of unknown function. Some of the photosynthesis genes are still encoded in the chloroplast genome, but many are transferred to the nucleus in Arabidopsis (supplementary information will be available). A complete survey of the genes that have been transferred to the nucleus will be possible if further genome information of algae and plants are available in the near future. The small number of protein groups that are specific to photosynthetic organisms (80 groups) suggests that these are the only proteins that could not be gained by the plants without cyanobacterial endosymbiosis. In other words, all other proteins in the 431 groups also shared by non-photosynthetic organisms could have been present in plants as basic homology groups, or basic building blocks of eukaryotic heritage. This does not mean that the proteins in these groups are not cyanobacterial origin. Some of them are in fact of endosymbiont origin, but we can hypothesize that homologues of these proteins could have been provided by the eukaryotic host cell. A good example of this situation has been reported for glycolytic enzymes, some of which are of cyanobacterial origin while some other are not [5]. Another example is the carbon fixation protein, ribulose-1,5,bisphosphate carboxylase/oxygenase (RuBisCO), which is essential in photosynthesis. The large subunit of this enzyme is also found in some chemoautotrophs as well as organisms that are not normally related to carbon fixation, such as B. subtilis. In many cases, the genes encoding RuBisCO are supposed to be horizontally transferred. This consideration suggests that the homology group method can shed light on the set of building blocks of an organism, and is suitable for comparing very diverse organisms. 4 Conclusion and Prospects Current analysis of different genomes include 6 cyanobacteria, a plant, two bacteria and a eukaryotic microorganism. In the near future, genomes of 15 species of cyanobacteria will be available for bioinformatic analysis. In this respect, cyanobacteria are the only group of organisms, in which detailed comparison of genome contents can be made. Many attempts to compare all known bacterial genomes have been reported [3, 6, 10, 23, 24, 27], but the bacteria are highly divergent and it is not currently possible to draw meaningful conclusion from the comparison of all bacteria. In contrast, cyanobacteria form a well-defined monophyletic clade, but they are more divergent than the pairs of strains of a single species (e.g., E. coli K12 and 0157, or different strains of Helicobacter pylori). In this respect, cyanobacteria are ideal for genomic comparison, and will provide a good test case for future comparison of all bacteria or all organisms. The present study also provides data on the comparison of cyanobacteria and plants. Attempts to compare plants and cyanobacteria and to trace the origin of plant proteins have been using a single cyanobacterium (Synechocystis) and a single plant (Arabidopsis) [1]. As I described in the present report, each cyanobacterium has its own species-specific genes. It is important to compare all available 180 Sato cyanobacterial genomes with the genome of Arabidopsis, which is currently the only plant whose genomic sequence is available with fairly reliable annotation. However, the draft sequences of the genomes of two varieties of rice have become available. The genome sequencing of a unicellular rhodophyte Cyanidioschyzon merolae will be complete soon. The EST projects of a macrophytic rhodophyte Porphyra yezoensis, a green alga Chlamydomonas reinhardtii, a moss Physcomitrella patens, as well as various legumes and cereals will provide important information on expressed genes. Use of these algal and plant sequences as well as sequences of non-photosynthetic organisms including human will shed light on the origin and formation of plant genomes. Development of software for the comparison of genome contents is also continuing. In the present study, the largest cluster, which contains many unrelated proteins that are linked by various multidomain proteins, was not decomposed to subclusters and analyzed in detail. Another software is being developed that can classify this complex cluster. However, comparison of a large number of genomes is limited by the computer resources such as size of memory and the speed of processor. The increase in the number of available genome sequences is faster than the increase and development in computer resources. We will need a way of calculation that can be done with a limited size of hardware resources but can accommodate daily increasing genome data. Table 3: Comparison of six cyanobacterial genomes. All of the predicted protein sequences of the six cyanobacteria were clustered by the homology group method. Species Anabaena sp. 7120 (An) Synechocystis sp. PCC 6803 (Sy) Nostoc punctiforme PCC 73102 (Np) Prochlorococcus marinus MED4 (Pm1) Prochlorococcus marinus MIT9313 (Pm2) Synechococcus sp. WH8102 (S81) Table 4: Groups shared two species Table 4. Groups shared by twobyspecies Species pair An - Sy An - Np Sy - Np Pm1 – Pm2 Sy - S81 Pm1 - S81 Pm2 - S81 Other combinations Number of common groups 40 736 54 32 11 25 105 0-7 Total genes Total groups 5,371 3,167 7,431 1,689 2,195 2,425 3,036 1,957 3,665 1,255 1,580 1,699 22,278 6,220 Additions Groups % 815 453 1,436 208 311 396 26.8 23.1 39.2 16.6 19.7 23.3 Groups Losses shared by (groups) all six 7 2,412 32 1,666 16 3,074 114 1,157 20 1,256 14 1,370 763 groups Table 5: Other major shared groups Species combination An - Sy - Np An - Sy - Np - S81 Pm1 – Pm2 - S81 Number of common groups 353 41 79 Acknowledgments This study was supported in part by Grants-in-Aid for Scientific Research from the Japan Society for the Promotion of Science (JSPS) (nos. 13440234, 12874104), a Grant-in-Aid for Scientific Research on Priority Areas (C) “Genome Biology” from the Ministry of Education, Science, Sports, Culture, and Technology. 181 Comparative Analysis of the Genomes of Cyanobacteria and Plants BLAST threshold = 1 x 10 -8 901 An 801 1524 599 Np BLAST threshold = 1 x 10 -20 1204 An 1095 2121 749 798 Sy 517 Sy Pm1 396 423 Pm2 Np 538 Pm1 572 Pm2 500 changes 184 500 changes 687 S81 519 S81 Figure 3: Genome phylogenetic tree constructed by the homology groups. To evaluate validity of the method, two different threshold values of BLAST E values were used, but the resulting tree was essentially identical in shape. These trees were constructed by the parsimony method with the PAUP program. For the abbreviated names of species, see Table 3. 134 References [1] Abdallah, F., Salamini, F., and Leister, D., A prediction of the size and evolutionary origin of the proteome of chloroplasts of Arabidopsis, Trends Plant Sci., 5:141–142, 2000. [2] Altschul, S.F., Madden, T.L., Schäfer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J., Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic Acids Res., 25:3389–3402, 1997. [3] Bansal, A.K. and Meyer, T.E., Evolutionary analysis by whole-genome comparisons, J. Bacteriol., 184:2261–2272, 2002. [4] Campbell, A., Mrázek, J., and Karlin, S., Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA, Proc. Natl. Acad. Sci. USA, 96:9184–9189, 1999. [5] Canback, B., Andersson, S.G.E., and Kurland, C.G., The global phylogeny of glycolytic enzymes, Proc. Natl. Acad. Sci. USA, 99:6097–6102, 2002. [6] Eisen, J.A., Assessing evolutionary relationships among microbes from whole-genome analysis, Curr. Opinion Microbiol., 3:475–480, 2000. [7] Felsenstein, J., Phylogenies from molecular sequences: Inference and reliability, Annu. Rev. Genet., 22:521–565, 1988. [8] Grigoriev, A., Analyzing genomes with cumulative skew diagrams, Nucleic Acids Res., 26:2286– 2290, 1998. [9] Honda, D., Yokota, A., and Sugiyama, J., Detection of seven major evolutionary lineages in cyanobacteria based on the 16S rRNA gene sequence analysis with new sequences of five marine Synechococcus strains, J. Mol. Evol., 48:723–739, 1999. [10] House, C.H. and Fitz-Gibbon, S.T., Using homolog groups to create a whole-genomic tree of free-living organisms: An update, J. Mol. Evol., 54:539–547, 2002. [11] Janssen, P.J., Audit, B., and Ouzounis, C.A., Strain-specific genes of Helicobacter pylori: distribution, function and dynamics, Nucleic Acids Res., 29:4395–4404, 2001. 182 Sato [12] Kabeya, Y., Hashimoto, K., and Sato, N., Identification and characterization of two phage-type RNA polymerase cDNAs in the moss Physcomitrella patens: Implication of recent evolution of nuclear-encoded RNA polymerase of plastids in plants, Plant Cell Physiol., 43:245–255, 2002. [13] Kaneko, T., Sato, S., Kotani, H., Tanaka, A., Asamizu, E., Nakamura, Y., Miyajima, N., Hirosawa, M., Sugiura, M., Sasamoto, S., Kimura, T., Hosouchi, T., Matsuno, A., Muraki, A., Nakazaki, N., Naruo, K., Okumura, S., Shimpo, S., Takeuchi, C., Wada, T., Watanabe, A., Yamada, M., Yasuda, M., and Tabata, S., Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions, DNA Res., 3:109–136, 1996. [14] Kaneko, T., Nakamura, Y., Wolk, C. P., Kuritz, T., Sasamoto, S., Watanabe, A., Iriguchi, M., Ishikawa, A., Kawashima, K., Kimura, T., Kishida, Y., Kohara, M., Matsumoto, M., Matsuno, A., Muraki, A., Nakazaki, N., Shimpo, S., Sugimoto, M., Takazawa, M., Yamada, M., Yasuda, M., and Tabata, S., Complete genomic sequence of the filamentous nitrogen-fixing cyanobacterium Anabaena sp. strain PCC 7120, DNA Res., 8:205–213, 2001. [15] Karlin, S. and Cardon, L.R., Computational DNA sequence analysis, Annu. Rev. Microbiol., 48:619–654, 1994. [16] Martin, W., Stoebe, B., Goremykin, V., Hansmann, S., Hasegawa, M., and Kowallik, K.V., Gene transfer to the nucleus and the evolution of chloroplasts, Nature, 393:162–165, 1998. [17] Meeks, J.C., Elhai, J., Thiel, T., Potts, M., Larimer, F., Lamerdin, J., Predki, P., and Atlas, R., An overview of the genome of Nostoc punctiforme, a multicellular, symbiotic cyanobacterium, Photosynth. Res., 70:85–106, 2001. [18] Ochman, H., Lawrence, J.G., and Groisman, E.A., Lateral gene transfer and the nature of bacterial innovation, Nature, 405:299–304, 2000. [19] Ohmori, M., Ikeuchi, M., Sato, N., Wolk, P., Kaneko, T., Ogawa, T., Kanehisa, M., Goto, S., Kawashima, S., Okamoto, S., Yoshimura, H., Katoh, H., Fujisawa, T., Ehira, S., Kamei, A., Yoshihara, S., Narikawa, R., and Tabata, S., Characterization of genes encoding multi-domain proteins in the genome of the filamentous nitrogen-fixing cyanobacterium Anabaena sp. strain PCC 7120, DNA Res., 8:271–284, 2001. [20] Sato, N., SISEQ: Manipulation of multiple sequence and large database files for common platforms, Bioinformatics, 16:180–181, 2000. [21] Sato, N., Was the evolution of plastid genetic machinery discontinuous?, Trends Plant Sci., 6:151– 155, 2001. [22] Simpson, C.L. and Stern, D.B., The treasure trove of algal chloroplast genomes, surprises in architecture and gene content, and their functional implications, Plant Physiol., 129:957–966, 2002. [23] Snel, B., Bork, P., and Huynen, M.A., Genome phylogeny based on gene content, Nature Genet., 21:108–110, 1999. [24] Snel, B., Bork, P., and Huynen, M.A., Genomes in flux: The evolution of archaeal and proteobacterial gene content, Genome Res., 12:17–25, 2002. [25] Stoebe, B. and Maier, U.-G., One, two, three: nature’s tool box for building plastids, Psrotoplasma, 219:123–130, 2002. [26] The Arabidopsis Genome Initiative, Analysis of the genome sequence of the flowering plant Arabidopsis thaliana, Nature, 408:796–815, 2000. [27] Wolf, Y.I., Bogozin, I.B., Kondrashov, A.S., and Koonin, E.V., Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genome context, Genome Res., 11:356–372, 2001. [28] GenomeNet web site. http://www.genome.ad.jp/ and ftp://ftp.genome.ad.jp/ [29] Joint Genome Institute web site. http://spider.jgi-psf.org/JGI_microbial/html/
© Copyright 2025 Paperzz