I need to check which of the predicted de novo transcripts have polyA at 3'. Are polyA tails filtered at some step of tophat/cufflinks processing or this information is retained? If this info is retained, where do I find it. thank you
I need to check which of the predicted de novo transcripts have polyA at 3'. Are polyA tails filtered at some step of tophat/cufflinks processing or this information is retained? If this info is retained, where do I find it. thank you
It's unclear what TopHat/Cufflinks has to do with the question as you are talking about de novo assembled transcripts. I would just start by using the grep command to get a feeling for whether there are any poly-A stretches. It is easy enough to write a script (Perl/Python/whatever) to look for them as well.
Often there are very very few reads with poly-A in HiSeq RNA-seq data, so you might not find much. For some reason there seem to be more poly-A tails left in MiSeq-produced RNA-seq data.
Most reads from bridge amplification wouldn't contain polyA's nor would they ever make it past alignment. I think you are confusing de novo transcriptome assembly with ab initio isoform discovery.
Yes, I did see a lot of polyA stretched using grep, and also TransDecoder predicted ORFs within some predicted de-novo transcripts.
Yes, software like Trinity that does ab initio transcript assembly is known to retain the polyA tails. But Cufflinks also predicts new transcripts with de-novo exons in intergenic regions, those with 'u' class code. So, I was trying to understand what happens when polyA tail is encountered? If its just discarded as unnamable, then all the reads containing 3' of novel and known transcripts with polyA tail would be in the discarded bin? If my assumptions are correct, what would be the best way to utilize these discarded reads for determining whether or not the novel transcripts predicted by Cufflinks have polyA tails?
I considered ab initio like Trinity, but then this would create issues in how to combine both approaches into a single paper, as I am sure there would be quite a lot of differences between their outputs, including the properties and number of predicted novel transcripts. Would appreciate advice. Thank you.
Tophat uses Bowtie to align as many reads as it can to the reference genome in an unspliced manner. It rummages through any reads that didn't align and sees if they span two contigs (exons) formed by the Bowtie alignment, then adds these to the SAM file in a spliced format. In the case of a reads with a bunch of foreign A's at the end that do not add spanning information - it's unclear if Tophat would really rescue these from the bin.
In this paper the authors manually identified and rescued polyA reads: http://www.biomedcentral.com/1471-2164/11/711
Another alignment tool I would suggest you look at is STAR. STAR replaces Bowtie/Tophat with a fast sensitive spliced-aligner. It explicitly mentions polyA tails:
I opened a separate post on how to use STAR/Cufflinks for dignifying polyA tails for predicted novel transcripts: Identifying polyA tail sequences for predicted novel transcripts using STAR/Cufflinks
thank you
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
thank you for the feedback. your point and the point made by Jeremy shortly after are complementary, so I add a comment to both points under the latter post. thank you.