Low percentage of mapped reads in TopHat
0
0
Entering edit mode
7.3 years ago
Lucila ▴ 20

Hi all,

I have a dataset of 14 million 50bp single-end reads (Illumina), and used Tophat to map them to a genome (indexed with Bowtie). I do not have the gff of the genome. When I ran it, only 50% of the reads were mapped. I have other libraries (a total of 200 million reads) of the same organism but I ran just one as a first approach.

Does anyone know what could be happening? I used the default parameters, any ideas of which parameters should I modify to obtain better results?

Thank you in advance for your help.

Best, Lucila.

Tophat Bowtie RNA-Seq • 2.7k views
ADD COMMENT
1
Entering edit mode

Ribosomal RNA comes to mind. Did you check that?

ADD REPLY
0
Entering edit mode

Thank you for your answer! No, I didn't check that. Which is the best way of doing that? Perhaps a BLAST using the ribosomal RNA sequence of a related species could work? Anyway, the reads that come from the Ribosomal RNA should also map to the genome, don't they?

ADD REPLY
1
Entering edit mode

Just take a selection of reads (~10 may be enough, don't take directly from top of fastq file) and blast at NCBI against NT. You should be able to see if reads are hitting rDNA.

ADD REPLY
0
Entering edit mode

I did that and I do have contamination with Ribosomal RNA (not only of my organism of interest, also microbial ribosomal RNA). Do you have any ideas of why could this happen? Perhaps a problem during the preparation of the libraries? I would like to take out all these reads before making an assembly of my transcriptome. Do you think this is possible?

Thank you so much for all your help. Best, Lucila

ADD REPLY
1
Entering edit mode

Is there a reference (draft may be ok) available for your genome of interest? Separating reads that belong to your genome may be a better option rather than trying to identify/remove the rRNA reads. Ref this thread: A: querying blast database for rna seq reads

ADD REPLY
0
Entering edit mode

Thank you genomax! I have done that. I don't know if I am using it correctly, first I indexed the genome:

$ bbsplit.sh ref=genome.fasta usemodulo (I had to use usemodulo because I don't have enough memory in my machine)

And then I've done this to find the reads that map to the genome:

$ bbsplit.sh in=reads.fastq ref=genome.fasta out_genome=mapped.fastq outu=unmapped.fastq

I didn't change any parameter. So "mapped.fastq" are the reads I am interested in. I obtained this result:

https://drive.google.com/open?id=0By4Qjv3-cXCDZ0VlREp6b3BFNms

I don't understand clearly what "ambiguous" mean, since I put just one genome as reference (and not a lot of references). Does it mean that there is 20% of the reads that the program is not sure if they map or not? And where the program puts those reads, in "mapped" or "unmapped"?

Thank you again for all your help.

Best, Lucila

ADD REPLY
1
Entering edit mode

For this to answer it would be helpful to know which organism you are in and which tissue/body part/cell type you used for the RNA prep.

The simplest explanation could be that due to the high level of conservation of rRNAs, maybe some of your sequences simply false-positively mapped to microbes, but this is pure guessing from my side.

ADD REPLY
0
Entering edit mode

TopHat is now in maintenance mode and should be avoided if possible (even TopHat developers recommend that you use their new program called HISAT2). There are many other splice-aware aligners (STAR, BBMap etc) that are well suited for the task of RNAseq data analysis.

That said, if you have a sample of the reads that are not aligning, your first stop should be NCBI Blast site to see if they are still aligning to the expected species i.e. rule out possibility of contamination.

ADD REPLY
0
Entering edit mode

Just curious, if rRNAs are not the case, did you check quality of your reads (and probably further trim any adapters, etc)?

ADD REPLY
1
Entering edit mode

It is indeed rRNA. And yes, I trimmed the adapters and check the quality of the reads before running TopHat.

ADD REPLY
0
Entering edit mode

Sorry, I didn't read carefully. At least good to know that it's rRNAs, it's perhaps due the efficiency of the library prep or maybe sample quality (as I assume you didn't do Ribo depletion, but rather Poly-A based prep).

ADD REPLY

Login before adding your answer.

Traffic: 2647 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6