SRR070572.8381546

An Alternative Method for
RNASeq
…
Step One
• Assemble reads into transcripts with any of
several transcript assemblers
– Trinity
– Soapdenovo-Trans
– Oases
– Newbler
Step Two
• Map transcripts to the reference genome with
spliced mapping tool
– Requires GFF output
– Requires ability to map longer sequences
– GeneSeqer-5.0 App maps using splicing models
for different genomes
– GMAP does spliced mapping
Step Three
• For transcripts from genomes with a
reasonable reference sequence and
annotation—add that annotation to the de
novo transcripts where appropriate
– Annotate transcripts, new App in the DE
– Matches exons to exons
– Transfers feature id’s where exons match
– Retains transcript names where there are no
matches
Step Four
• Map reads to the genome with a spliced
mapper
– GSNAP
– Tophat
Step Five
• Compare BAM files from mapping of the reads
to the annotated transcripts (GFF format) and
count hits against the exons using BAM to
counts (general)
• Each hit can be scored against whatever
feature id’s are part of transcript GFF file
• Output is counts against each gene, transcript,
or exon, in tabular file format
Bam to Counts Output
Step Six
• Combine count files with Join multiple tabdelimited files
• Count the combined data and determine
differential expression with EdgeR with
Fisher’s exact test
Join multiple tab-delimited files output
EdgeR Output
How does this pipeline work?
• BAM files contain mapping of reads to the
genome
• Genome features are mapped to the genome,
and their locations stored in the
annotation.gff file and can be transfered to
the transcripts
• The overlap of each read with features in the
annotation is determined by Bam to Counts
SAM format
Column -- Field --Type -- Regexp/Range -- Brief description
1 QNAME String [!-?A-~]f1,255g Query template NAME
2 FLAG Int [0,216-1] bitwise FLAG
3 RNAME String \*|[!-()+-<>-~][!-~]* Reference sequence NAME
4 POS Int [0,229-1] 1-based leftmost mapping POSition
5 MAPQ Int [0,28-1] MAPping Quality
6 CIGAR String \*|([0-9]+[MIDNSHPX=])+ CIGAR string
7 RNEXT String \*|=|[!-()+-<>-~][!-~]* Ref. name of the
mate/next segment
8 PNEXT Int [0,229-1] Position of the mate/next segment
9 TLEN Int [-229+1,229-1] observed Template LENgth
10 SEQ String \*|[A-Za-z=.]+ segment SEQuence
11 QUAL String [!-~]+ ASCII of Phred-scaled base QUALity+33
GFF file format
Fields must be tab-separated. Also, all but the final field in each feature line must
contain a value; "empty" columns should be denoted with a '.’
seqname - name of the chromosome or scaffold; chromosome names can be given
with or without the 'chr' prefix.
source - name of the program that generated this feature, or the data source
(database or project name)
feature - feature type name, e.g. Gene, Variation, Similarity
start - Start position of the feature, with sequence numbering starting at 1.
end - End position of the feature, with sequence numbering starting at 1.
score - A floating point value.
strand - defined as + (forward) or - (reverse).
frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first
base of a codon, '1' that the second base is the first base of a codon, and so on..
attribute - A semicolon-separated list of tag-value pairs, providing additional
information about each feature.
Bam to Counts (general)
Open the bam folder of the Tophat Output
Community
Data/iplantcollaborative/genomeservices/builds/0.2.1/de_support/Arabidops
is_thaliana.TAIR10/annotation.gff
Select the feature to count
Join multiple tab-delimited files
EdgeR with Fisher’s Exact Test
EdgeR with Fisher’s Exact Test
Sample of Hy5_1 Bam file contents
•
•
•
•
SRR070572.8333644
0
1
3727 50 41M *
0
0
CGGAAACCATTGAAATCGGACGGTTTAGTGAAAATGGAGGA
BBCBCCBB@BCBABACBB?@@@B=BB@B7B@?A?BB@A@B@
AS:i:0 XN:i:0 XM:i:0 XO:i:0
XG:i:0 NM:i:0 MD:Z:41 YT:Z:UU XS:A:+ NH:i:1
SRR070572.869731
16 1
7953 50 35M248N6M
*
0
0
AGAAGATAGGCAAAGACAAACTTCCACAACAGATGCTGAAT
########################>7BB>BBBBB=BBBBCA
AS:i:-2 XN:i:0 XM:i:2 XO:i:0 XG:i:0
NM:i:2 MD:Z:11G5C23 YT:Z:UU XS:A:- NH:i:1
SRR070572.8944108
0
1
8247 50 41M *
0
0
CAGTTGCTGGATTAATTGCATTGTAGAGGACGTGTCTATAT
BCCBCCBCCBCCCCBCCBCCCC@>C@BBBBBB8A@@BCBBA
AS:i:0 XN:i:0 XM:i:0 XO:i:0
XG:i:0 NM:i:0 MD:Z:41 YT:Z:UU XS:A:- NH:i:1
SRR070572.8381546
0
1
8301 50 25M91N16M
*
0
0
GAAGGATTAAATCGATGAAAATAATCATGCGTTCACACTCG
BAABBBCCBCBCABBCBB@ABBB@B>BAA>>>@AA@A@BAB
AS:i:0 XN:i:0 XM:i:0 XO:i:0
XG:i:0 NM:i:0 MD:Z:41 YT:Z:UU XS:A:- NH:i:1
Sample of annotation.gff
• Mt protein_coding exon 273 734 .
.
gene_id
"ATMG00010"; transcript_id "ATMG00010.1"; exon_number "1";
gene_name "ORF153A"; transcript_name "ATMG00010.1"; seqedit "false";
• Mt protein_coding CDS 276 734 .
0
gene_id
"ATMG00010"; transcript_id "ATMG00010.1"; exon_number "1";
gene_name "ORF153A"; transcript_name "ATMG00010.1"; protein_id
"ATMG00010.1";
• Mt protein_coding start_codon 732 734 .
0
gene_id
"ATMG00010"; transcript_id "ATMG00010.1"; exon_number "1";
gene_name "ORF153A"; transcript_name "ATMG00010.1";
• Mt protein_coding stop_codon 273 275 .
0
gene_id
"ATMG00010"; transcript_id "ATMG00010.1"; exon_number "1";
gene_name "ORF153A"; transcript_name "ATMG00010.1";
So how does it work?
• Bam files map reads to their chromosomal
locations
• Gff annotation files map gene features (exons,
transcripts, CDS, etc) to their chromosomal
locations
• Bam to Counts (general) finds the overlap
between the reads and the gene features