Hi All,
It is my understanding that mapping repetitive sequences it is error-prone, and I am gearing up to do RNAseq looking at heterochromatin transcription (mostly repetitive sequences, retrotransposons,pseudogens & transposons). I will have a few cell-repair genes that i will use as a control (they should be differentially expressed in my mutant vs control). So, does tophat2 have mapping parameters for heterochromatin regions? All what I can think of is tweaking some basics arguments, i.e., could set the -N/--read-mismatches to 0. Or is there a more bioinformatic related process that i should be aware of?
I am also confused with following, I think tophat uses 20-nt as 'seed' from any given read and try to map that to the genome. Repetitive elements (for example; 20-nt seed of ATATATAT) can map in multiple places, so how does tophat know which of the alignment for this ATATA reads should keep and/or discard?
Lastly, is it worth it it to pay extra and do paired-end sequencing (when looking at this Repetitive elements?). So far we are thing single-end sequencing strand-specific libraries.
Any ideas/advice are welcome. Thanks
-Gonzalo
You are tacking a difficult problem.
You may want to consider using an alternate aligner STAR, BBMap or HISAT2 (instead of TopHat) at this point in time. You would want to get as long reads as you can (I assume you are planning to do Illumina sequencing). PE reads may also help in this case to anchor some fragments. The cost would be more but it is not going to be 2x as compared to single-end reads.
Thanks for your reply and the suggestions.
The longest reads (250-bp) can be obtained from a HiSeq2500 rapid mode (I think). Due to my non-bioinformatic background, it would be easier for me to use tophat, as I had done it in the past. If i go ahead with it, what are the critical arguments that can help me in accurately mapping repetitive sequences?
I think: -N/--read-mismatches to 0
--no-mixed (For paired reads, report only concordant mappings)
Anything else?
If this is however not good enough and STARS is the program that i should use I can outsource the dataset. At this point I am trying to see if i could potentially have a good analysis with the tophat/cufflinks pipeline taking into account the challenges of repetitive sequences.
Thanks again
You can start with TopHat and see what you get. If the works well enough then you can move on to your other experiments.
If that does not work adequately for some reason then use the suggestion from @Brian in this thread to restrict your alignments to perfect matches while using BBMap: Best aligner for reporting all exact matches of multi-mapping short reads? If you have done TopHat on your own in the past then I am confident that you would be able to use BBMap. Follow that up by featureCounts for doing the counting.