Could i please know if we can align reads to a Reference Transcriptome instead of Reference genome and assemble a transcriptome using tophat/cufflinks? Any potential advantages/disadvantages by doing so? Any ideas using BWA ( a non-spliced aligner) for this task?
Please spare me, if i could not put properly. Thanks in advance to your suggestions.
Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file. If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output.
-T/--transcriptome-only
Only align the reads to the transcriptome and report only those mappings as genomic mappings.
So you can choose whether to align to genome only, transcriptome + genome or transcriptome only.
There are some other options in Tophat connected to transcriptome mapping, I recommend to check them too.
From this post I got to learn important things. As I have few query in my mind, that I want to discuss here.
I am working on RNAseq analysis using tophat and worked on default parameters for mapping provided -G GTF file. As for my species, no reference available (Gossypium hirsutum) , therefore I picked the closely related species i.e Gossypium arboreum. Multimapped reads percentage is bit high and uniquely mapped reads are less. As this cotton is polyploid species, Therefore I can't discard the multi-mapped reads. I end up with poor results , my working command is as follows
Now I am working on another strategy where I want to map to gene models rather than mapping against whole reference genome. Providing the -T (transcriptome only) will do mapping against the gene models only or it is other than this? For transcriptome mapping, command should be ...
As this cotton is polyploidy species, Therefore I cant discard the multimapped reads.
Even though its polyploid, only one copy of chromosome will be there in fasta file. Hence, multi mapped reads are not at all related to ploidy of the genome.
And what do you mean by map to gene models?You want to map to transcriptome of closely related species?
-T/--transcriptome-only Only align the reads to the transcriptome and report only those mappings as genomic mappings.
Yes you are right , there will be one copy of chromosome in fasta file. But reason behind not filtering out the multimapped reads against genome is numerous repeats ( extremely high) within it.
Under tophat manual it is given that providing GTF file leads for the --transcriptome-index (here transcriptome means gene provided in GTF file? Am I right? Or it is other than this?)
For Tophat, check manual page: http://tophat.cbcb.umd.edu/manual.shtml
-G/--GTF <gtf gff3="" file="">
Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file. If this option is provided, TopHat will first extract the transcript sequences and use Bowtie to align reads to this virtual transcriptome first. Only the reads that do not fully map to the transcriptome will then be mapped on the genome. The reads that did map on the transcriptome will be converted to genomic mappings (spliced as needed) and merged with the novel mappings and junctions in the final tophat output.
-T/--transcriptome-only
Only align the reads to the transcriptome and report only those mappings as genomic mappings.
So you can choose whether to align to genome only, transcriptome + genome or transcriptome only.
There are some other options in Tophat connected to transcriptome mapping, I recommend to check them too.
you might wanna post this as an answer...
Thank you jockbanan. I would consider it.