Question

3' Rna Seq Pipeline

1

Entering edit mode

11.9 years ago

k.nirmalraman ★ 1.1k

The following is a very loud brainstorming for a beginner

A custom library prepration protocol, which uses 3' poly T tail primer and 5' random primer to enrich for 3' ends. and then eventually adds the illumina adaptersat the the end...and performs a Illumina HTSEq single end sequencing... The rationale is that the group is not interested in splice variants but merely in quantification of the expression for differential analysis. (organism: mouse)

I was wondering what would be the pipeline that one would use... in this case!

The reads are of length 95 and I trim the barcodes and adapters away and a fine stretch of 35 bases usually when mapped using bowtie yields 70% total reads aligned to transcriptome.

Bowtie Alignment: bowtie -v 3 --best -t ~/RNAseq/RefLibIndex/mouse -p 10 -f <inputfile.fasta> <output.map>

My output file contains entries in the following format with refseqID. outputfile entry format

After I extract the raw read counts thru a custome script from the bowtie output file and I map the refseqID to genesymbol and extract the average/max for each genesymbol as raw counts and perform DESeq differential analysis.

I would like to know:

Is this approach valid, do u see any loop holes here?
To visualise this data in a genome browser to check for 3' enrichment. However, I see that there is no chromosome or index information in the file. I tried to get some information Visualizing Bowtie Output In Genome Browser. However, i couldnt succeed at that... any suggestions will be great... something like the following for visualization.. This one uses IGV
Is there any reason why only 70% is mapped to transcriptome, if so, do u think it is normal for this approach, do you think I should also try to align the sequences to genome?

Thanks!

rnaseq pipeline deseq bowtie visualization igv • 6.0k views

ADD COMMENT • link updated 10.0 years ago by Biostar 20 • written 11.9 years ago by k.nirmalraman ★ 1.1k

score 2 · Answer 1 · 2013-05-30

2

Entering edit mode

11.9 years ago

bow ▴ 790

I would still try to map to the genome. Genome alignment (as opposed to transcriptome alignment) lets you map your reads to un-annotated transcripts. If using transcriptome alignment only, you're limited to the set of known transcripts. Also, how did you generate your reference transcriptome? How do you handle transcript isoforms? If you simply take whole sequences of the RefSeq sequences and use them as the reference, you will increase the mapping possibilities for each of your 'tag' (since different transcripts may share the most 3' exon). This may also explain why you're only getting 70% mapped reads.
It seems you have two choices here: rename your SAM/BAM reference indices for each transcript, replacing them with the chromosome name; or use your transcriptome as the genome browser's reference (instead of the usual full genome).
see 1.

ADD COMMENT • link 11.9 years ago by bow ▴ 790

0

Entering edit mode

Thanks!!

I have used the refseq sequences to build the transcriptome.... and as much as I understand refseq sequences are basically individual transcripts/isoforms (Correct me if I am wrong).. so this is taken care isnt it...? The reason why I compute average/max of each gene symbol is to derive a better representation of each gene from the data. I shall now try to map unmapped sequences to genome for sure.. thanks for that!
for two can u suggest a little more in detail how can I use the refseq sequences as reference? I am a bit lost here...

arent these just many fasta sequences....This would make make too many references sequences right?? (number of reference sequence = number of refseq Sequences)... unlike genome which is a single long sequence (basically number of sequence here = number of chromosomes, I would assume)...

ADD REPLY • link 11.9 years ago by k.nirmalraman ★ 1.1k

0

Entering edit mode

You can use tophat for alignment (http://tophat.cbcb.umd.edu/) Also take a look at Cufflinks for quantification (http://cufflinks.cbcb.umd.edu/)

ADD REPLY • link 11.9 years ago by Pablo ★ 1.9k

0

Entering edit mode

Hmm..yes, I think you did the right thing. Perhaps another thing to try is to check whether your reads contain adapter sequences (which may affect your mapping percentage since you're doing an end-to-end mapping and also to play around with the -v value.
In IGV, for example, I think you need to create a new .genome file (Genomes -> Create .genome file) out of your reference transcriptome. The same is possible in UCSC genome browser, although you need a local install and setup (which is considerably more complex). There will indeed be many reference sequences, but since your SAM records are indexed by these transcripts, the genome browser needs a reference build of these transcripts too.

ADD REPLY • link 11.9 years ago by bow ▴ 790