Hey Everyone,
I'm trying to figure out the appropriate TopHat settings for strand-specific, paired-end rna-seq data. I've read several posts about this but am still uncertain about the right settings. I'm hoping someone can confirm that my understanding of the sense/antisense, forward/reverse reads, is correct. As I understand it:
1) mRNA is an exact match to the DNA coding sequence (aside from U and no introns) and matches the sense strand
2) in library prep (TruSeq stranded in my case) the first strand of the cDNA library, which is antisense to the original gene, is used for sequencing, while the second strand is dUTP marked gets degraded
3) for paired end sequencing, after bridge PCR and sequencing the sense strand becomes read 1 (forward read) and the antisense strand becomes read 2 (reverse read).
So, when running TopHat, read 1 (/1) is used for the forward read and read 2 (/2) for the reverse read, with the library type set to fr=first strand?
My ultimate goal is to use the bam files from TopHat to get raw counts in HTseq, where I understand the appropriate library setting is "reverse". Btw, I'm doing all this in Galaxy, as I'm not very proficient at coding. Thanks in advance for any re-assurance.
Hi Istvan,
Thanks for the input. Regarding my use of forward and reverse, I am a little confused by their meaning. For my general understanding of the process, I tried to follow the sense and antisense strands through library prep and sequencing. I end up with the first sequenced read as matching the sense strand, but I think I might be misunderstanding something about the bridge PCR and sequencing, particularly because you say the sense transcripts should be present in the read 2 files. Either I'm missing something in the sequencing process or I don't understand how read 1 and read 2 are defined. Could you shed some light on this for me, please? Thanks in advance.
My reasoning:
first strand of cDNA is antisense (which is what remains after dUTP degradation) --> adapters are added to this strand and during bridge PCR the complement is created (sense strand) --> after further clustering all sense strands are washed away, leaving only antisense strands --> the sequencing process uses these antisense strands as a template to produce reads that are sense (and presumably these are the first reads in the fasta file (read 1; 1/)
I think this paper has a good explanation (though I can't check since I can't access it from here)
http://www.nature.com/nmeth/journal/v7/n9/abs/nmeth.1491.html
Hi cadeans,
Did you figure out why reads in file 2 corresponds to sense direction of the transcript? I have looked through the article but still don't get it ...
Here is my reasoning: