Hi All, I am analyzing a RNA-seq data set and the alignment results I have been getting are really baffling me. I have tried exhaustive list of multiple conditions/parameters but none seem to improve my alignment rates significantly. Here are some details for my sample data:
-Data was obtained from total RNA obtained from tumor samples using Nugen Ovation Single Cell RNA-seq kit. We received ~80 million x 2 100bp paired end reads. - I obtained about 40-50% ballpark alignment rate using tophat2 using different parameters.
FastQC suggests high duplication rates. The quality seems ok (no red flags except dropping of quality to ~20 at 3' end of reads). I have used Tophat2 for all my alignments using default settings. I have tried the following conditions.
-Trimming of 8bp from forward reads (as suggested by Nugen library prep kit), trimming of low quality bases (quality>20) at the ends, using different trimming/clipping tools like fastx, fastq-mcf from ea-utils, trim-galore discarding reads below length < 20 bases after adapter/quality trimming.
I have also tried using different -library types for tophat and also changed -r option to reflect my fragment size. I suspect that my RNA-prep could possibly have a significant rRNA fraction and maybe removing the reads mapping to these could possibly improve alignment.
I would appreciate if you you could provide any suggestions for improving my alignment. Tophat is my preferred alignment as I have been using it for years now on other datasets and performs fairly robustly. However, I would be open to switching to other aligners if needed.
Thanks a lot for your help.
Make a short set of rRNA sequences in fasta format, and filter your data with programs that allow you to discard these reads from your files. One possibility is using BBSplit. You have more information in this thread
By the way, this is an straight alternative to that discussed by Michael Ante. BBSplit will use BBMap to map the reads to the reference