Question

Riboseq - pairs unaligned to rRNA get mixed up

0

Entering edit mode

3.8 years ago

blur ▴ 280

Hi,

I have been working on a Riboseq analysis and I have a problem: I did the first step as per my protocol - I aligned the reads to rRNA using bowtie2 to remove them from further analysis. There was a lot of rRNA in the libraries (~80%).

Then I took the unaligned reads, turned them back to R1 and R2 fastq files using /samtools fastq and re-aligned them to the genome with Tophat2 I added the parameters:

 --no-mixed -g 1 --prefilter-multihits

BUT I seem to get very few reads. If I remove the --no-mixed option I get 3 times more reads, but they don't seem to be paired? I mean, if I look for reads where both pairs are mapped there are very few of them...

I did samtools sort -n to checked this and I get unpaired reads:

NB:18:H3:3:22612:15664:16256,
NB:18:H3:3:22612:15664:16256  
NB:18:H3:3:22612:15664:16244 <=
NB:18:H3:3:22612:15664:16211

My next step is to remove PCR duplicates and for that I have a script that assumes we have both reads always (and why wouldn't be? I don't understand where one goes in this analysis...). I checked the unmapped,bam file but the missing reads are not there,,,

What am I missing?

Any help will be greatly appreciated

riboseq bwa • 1.9k views

ADD COMMENT • link 3.8 years ago by blur ▴ 280

1

Entering edit mode

Also, for your information, you should know that the old 'Tuxedo' pipeline of Tophat and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.

ADD REPLY • link 3.8 years ago by Carlo Yague 8.9k

0

Entering edit mode

Though I generally agree, most pipelines for Riboseq and papers I've encountered reference Tophat for riboseq analysis, specifically: https://pubmed.ncbi.nlm.nih.gov/28579404/

ADD REPLY • link 3.8 years ago by blur ▴ 280

0

Entering edit mode

Then I took the unaligned reads

How did you do that ? Did you extract reads that do not map to the rDNA or the reads that do not map in proper pairs ? I suspect you took the first option, and recommend the second one in order to keep read pairs intact.

ADD REPLY • link 3.8 years ago by Carlo Yague 8.9k

0

Entering edit mode

To get the reads I did:

samtools view -Sh -f 77 file.sam > R1.sam
samtools view -Sh -f 141 file.sam > R2.sam

Then

/nadata/software/samtools-1.3.1/bin/samtools fastq  R1.sam > R1.fq
/nadata/software/samtools-1.3.1/bin/samtools fastq  R2.sam > R2.fq

These were then re-aligned using TopHat and yes, this caused problems...

I didn't quite understand what you mean by - "reads that do not map in proper pairs" - does it not limit me to a subset? Or should I be running bowtie2 ith --un to have unaligned reads into a different file automatically?

I would appreciate if you specify what samtools/bowtie parameters you were referencing :) Thank you.

ADD REPLY • link 3.8 years ago by blur ▴ 280

0

Entering edit mode

reads that do not map in proper pairs would be: (proper pair = both ends mapped on the same chromosome in a reasonable distance, defined by the parameters --mate-inner-div and --mate-std-dev in tophat)

samtools view -Sh -F 2 file.sam > unmapped.sam

to make sure that no secondary/supplementary alignment are included, I would use this instead:

 samtools view -Sh -F 2306 file.sam > unmapped.sam

But I think your flags look ok too. then, BEFORE running the samtools fastq commands, it is important to group reads by read name. Here I pipe the results of samtools collate directly into samtools fastq.

samtools collate unmapped.sam | samtools fastq - -1 R1.fq -2 R2.fq

please let me know if it solves your issue.

ADD REPLY • link 3.8 years ago by Carlo Yague 8.9k

0

Entering edit mode

It did solve the read names, but it seems to get very few usable reads... I think perhaps the flag is too restrictive

Left reads: Input : 1493587 Mapped : 224369 (15.0% of input) of these: 7305 ( 3.3%) have multiple alignments (1528579 have >1) Right reads: Input : 1822965 Mapped : 248684 (13.6% of input) of these: 7305 ( 2.9%) have multiple alignments (1199415 have >1) 14.3% overall read mapping rate.

Aligned pairs: 44817 of these: 7305 (16.3%) have multiple alignments 31670 (70.7%) are discordant alignments 0.9% concordant pair alignment rate.

ADD REPLY • link 3.8 years ago by blur ▴ 280

0

Entering edit mode

You have an unusual high number of discordant alignments. This is weird and perhaps indicates that something got wrong in the process. I suggest that you look at a few discordantly aligned pairs (where mates aligned to different chromosomes) to try to understand what is going on. Where is the first read mapped and where is the second read mapped ? Also, what is your reference genome/transcriptome ? Is there a lot of redundancy in it ? Also, did you properly specify the --library-type parameter in tophat2 ?

ADD REPLY • link 3.8 years ago by Carlo Yague 8.9k