An Alternative Method for RNASeq … Step One • Assemble reads into transcripts with any of several transcript assemblers – Trinity – Soapdenovo-Trans – Oases – Newbler Step Two • Map transcripts to the reference genome with spliced mapping tool – Requires GFF output – Requires ability to map longer sequences – GeneSeqer-5.0 App maps using splicing models for different genomes – GMAP does spliced mapping Step Three • For transcripts from genomes with a reasonable reference sequence and annotation—add that annotation to the de novo transcripts where appropriate – Annotate transcripts, new App in the DE – Matches exons to exons – Transfers feature id’s where exons match – Retains transcript names where there are no matches Step Four • Map reads to the genome with a spliced mapper – GSNAP – Tophat Step Five • Compare BAM files from mapping of the reads to the annotated transcripts (GFF format) and count hits against the exons using BAM to counts (general) • Each hit can be scored against whatever feature id’s are part of transcript GFF file • Output is counts against each gene, transcript, or exon, in tabular file format Bam to Counts Output Step Six • Combine count files with Join multiple tabdelimited files • Count the combined data and determine differential expression with EdgeR with Fisher’s exact test Join multiple tab-delimited files output EdgeR Output How does this pipeline work? • BAM files contain mapping of reads to the genome • Genome features are mapped to the genome, and their locations stored in the annotation.gff file and can be transfered to the transcripts • The overlap of each read with features in the annotation is determined by Bam to Counts SAM format Column -- Field --Type -- Regexp/Range -- Brief description 1 QNAME String [!-?A-~]f1,255g Query template NAME 2 FLAG Int [0,216-1] bitwise FLAG 3 RNAME String \*|[!-()+-<>-~][!-~]* Reference sequence NAME 4 POS Int [0,229-1] 1-based leftmost mapping POSition 5 MAPQ Int [0,28-1] MAPping Quality 6 CIGAR String \*|([0-9]+[MIDNSHPX=])+ CIGAR string 7 RNEXT String \*|=|[!-()+-<>-~][!-~]* Ref. name of the mate/next segment 8 PNEXT Int [0,229-1] Position of the mate/next segment 9 TLEN Int [-229+1,229-1] observed Template LENgth 10 SEQ String \*|[A-Za-z=.]+ segment SEQuence 11 QUAL String [!-~]+ ASCII of Phred-scaled base QUALity+33 GFF file format Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.’ seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. source - name of the program that generated this feature, or the data source (database or project name) feature - feature type name, e.g. Gene, Variation, Similarity start - Start position of the feature, with sequence numbering starting at 1. end - End position of the feature, with sequence numbering starting at 1. score - A floating point value. strand - defined as + (forward) or - (reverse). frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on.. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature. Bam to Counts (general) Open the bam folder of the Tophat Output Community Data/iplantcollaborative/genomeservices/builds/0.2.1/de_support/Arabidops is_thaliana.TAIR10/annotation.gff Select the feature to count Join multiple tab-delimited files EdgeR with Fisher’s Exact Test EdgeR with Fisher’s Exact Test Sample of Hy5_1 Bam file contents • • • • SRR070572.8333644 0 1 3727 50 41M * 0 0 CGGAAACCATTGAAATCGGACGGTTTAGTGAAAATGGAGGA BBCBCCBB@BCBABACBB?@@@B=BB@B7B@?A?BB@A@B@ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:41 YT:Z:UU XS:A:+ NH:i:1 SRR070572.869731 16 1 7953 50 35M248N6M * 0 0 AGAAGATAGGCAAAGACAAACTTCCACAACAGATGCTGAAT ########################>7BB>BBBBB=BBBBCA AS:i:-2 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:11G5C23 YT:Z:UU XS:A:- NH:i:1 SRR070572.8944108 0 1 8247 50 41M * 0 0 CAGTTGCTGGATTAATTGCATTGTAGAGGACGTGTCTATAT BCCBCCBCCBCCCCBCCBCCCC@>C@BBBBBB8A@@BCBBA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:41 YT:Z:UU XS:A:- NH:i:1 SRR070572.8381546 0 1 8301 50 25M91N16M * 0 0 GAAGGATTAAATCGATGAAAATAATCATGCGTTCACACTCG BAABBBCCBCBCABBCBB@ABBB@B>BAA>>>@AA@A@BAB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:41 YT:Z:UU XS:A:- NH:i:1 Sample of annotation.gff • Mt protein_coding exon 273 734 . . gene_id "ATMG00010"; transcript_id "ATMG00010.1"; exon_number "1"; gene_name "ORF153A"; transcript_name "ATMG00010.1"; seqedit "false"; • Mt protein_coding CDS 276 734 . 0 gene_id "ATMG00010"; transcript_id "ATMG00010.1"; exon_number "1"; gene_name "ORF153A"; transcript_name "ATMG00010.1"; protein_id "ATMG00010.1"; • Mt protein_coding start_codon 732 734 . 0 gene_id "ATMG00010"; transcript_id "ATMG00010.1"; exon_number "1"; gene_name "ORF153A"; transcript_name "ATMG00010.1"; • Mt protein_coding stop_codon 273 275 . 0 gene_id "ATMG00010"; transcript_id "ATMG00010.1"; exon_number "1"; gene_name "ORF153A"; transcript_name "ATMG00010.1"; So how does it work? • Bam files map reads to their chromosomal locations • Gff annotation files map gene features (exons, transcripts, CDS, etc) to their chromosomal locations • Bam to Counts (general) finds the overlap between the reads and the gene features
© Copyright 2026 Paperzz