Hi,
I have been working on a Riboseq analysis and I have a problem: I did the first step as per my protocol - I aligned the reads to rRNA using bowtie2 to remove them from further analysis. There was a lot of rRNA in the libraries (~80%).
Then I took the unaligned reads, turned them back to R1 and R2 fastq files using /samtools fastq and re-aligned them to the genome with Tophat2 I added the parameters:
--no-mixed -g 1 --prefilter-multihits
BUT I seem to get very few reads. If I remove the --no-mixed option I get 3 times more reads, but they don't seem to be paired? I mean, if I look for reads where both pairs are mapped there are very few of them...
I did samtools sort -n to checked this and I get unpaired reads:
NB:18:H3:3:22612:15664:16256,
NB:18:H3:3:22612:15664:16256
NB:18:H3:3:22612:15664:16244 <=
NB:18:H3:3:22612:15664:16211
My next step is to remove PCR duplicates and for that I have a script that assumes we have both reads always (and why wouldn't be? I don't understand where one goes in this analysis...). I checked the unmapped,bam file but the missing reads are not there,,,
What am I missing?
Any help will be greatly appreciated
Also, for your information, you should know that the old 'Tuxedo' pipeline of Tophat and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.
Though I generally agree, most pipelines for Riboseq and papers I've encountered reference Tophat for riboseq analysis, specifically: https://pubmed.ncbi.nlm.nih.gov/28579404/
How did you do that ? Did you extract reads that do not map to the rDNA or the reads that do not map in proper pairs ? I suspect you took the first option, and recommend the second one in order to keep read pairs intact.
To get the reads I did:
Then
These were then re-aligned using TopHat and yes, this caused problems...
I didn't quite understand what you mean by - "reads that do not map in proper pairs" - does it not limit me to a subset? Or should I be running bowtie2 ith --un to have unaligned reads into a different file automatically?
I would appreciate if you specify what samtools/bowtie parameters you were referencing :) Thank you.
reads that do not map in proper pairs would be: (proper pair = both ends mapped on the same chromosome in a reasonable distance, defined by the parameters
--mate-inner-div
and--mate-std-dev
in tophat)to make sure that no secondary/supplementary alignment are included, I would use this instead:
But I think your flags look ok too. then, BEFORE running the samtools fastq commands, it is important to group reads by read name. Here I pipe the results of samtools collate directly into samtools fastq.
please let me know if it solves your issue.
It did solve the read names, but it seems to get very few usable reads... I think perhaps the flag is too restrictive
Left reads: Input : 1493587 Mapped : 224369 (15.0% of input) of these: 7305 ( 3.3%) have multiple alignments (1528579 have >1) Right reads: Input : 1822965 Mapped : 248684 (13.6% of input) of these: 7305 ( 2.9%) have multiple alignments (1199415 have >1) 14.3% overall read mapping rate.
Aligned pairs: 44817 of these: 7305 (16.3%) have multiple alignments 31670 (70.7%) are discordant alignments 0.9% concordant pair alignment rate.
You have an unusual high number of discordant alignments. This is weird and perhaps indicates that something got wrong in the process. I suggest that you look at a few discordantly aligned pairs (where mates aligned to different chromosomes) to try to understand what is going on. Where is the first read mapped and where is the second read mapped ? Also, what is your reference genome/transcriptome ? Is there a lot of redundancy in it ? Also, did you properly specify the
--library-type
parameter in tophat2 ?