Lower alignment rate when using collapsed reads
2
0
Entering edit mode
14 months ago
blz ▴ 40

Hello everyone!

I made a RNA-Seq mapping two ways: 1) using all reads, and 2) using collapsed reads.

I got a lower percentage of mapped reads in the second case and I can't understand why: collapsed reads represent unique reads, should they not map to same locations? I mean, the sequence is the same, only the quantity is varying (many per read type in the first case and only one per read type in the second case). Is there a simple explanation to this? (I used fastx_collapser to collapse reads)

bowtie2 all reads mapping

bowtie2 -p 6 -x genome_index -U seq_noAdapt.fastq.gz -S seq_noAdapt.sam 2>seq_noAdapt.log

27557790 reads; of these:
  27557790 (100.00%) were unpaired; of these:
    3488459 (12.66%) aligned 0 times
    4281314 (15.54%) aligned exactly 1 time
    19788017 (71.81%) aligned >1 times
87.34% overall alignment rate

bowtie2 reads collapsed mapping

bowtie2 -p 6 -x genome_index -f seq_noAdapt_collapsed.fasta -S seq_noAdapt_collapsed.sam 2>seq_noAdapt_collapsed.log

1166266 reads; of these:
  1166266 (100.00%) were unpaired; of these:
    504804 (43.28%) aligned 0 times
    192571 (16.51%) aligned exactly 1 time
    468891 (40.20%) aligned >1 times
56.72% overall alignment rate

Some notes about the dataset

  • It's a old one, so it's single-end, 36bp reads.
  • It's intend to identify short RNAs.
  • As you can see above, there is a high number of multi-mapping reads and this is expected: the loci giving rise to them are repetitive (and the genome itself is very repetitive).

Thank you!

short_rna RNA-Seq_mapping collapsed_reads • 764 views
ADD COMMENT
1
Entering edit mode
14 months ago
LChart 4.7k

Sequences are collapsed if they're sufficiently identical. Garbage (random) reads won't collapse. If you collapse the "good" reads and don't collapse the "garbage", then the proportion of "garbage" must go up.

ADD COMMENT
0
Entering edit mode
14 months ago

It looks like fastx_collapser does not support mismatches. As per LChart's answer, you are enriching for low-quality reads by getting rid of exact duplicates. There are two ways of dealing with this. One is to use BBTool's Clumpify which does allow mismatches when removing duplicates, although with such short reads you'd have to use non-default parameters like shorter kmers (here, s is the number of mismatches allowed):

clumpify.sh in=reads.fq out=deduped.fq dedupe k=19 s=2 passes=6

Alternatively, you can deduplicate after mapping:

dedupebymapping.sh in=mapped.sam out=deduped.sam

This is more robust to errors than sequence-based deduplication, although it will still enrich for garbage because the unmapped reads don't get deduplicated. You can also do both!

However, in this case, I don't recommend any deduplication. The reason is that with 36bp SE reads in RNA-seq data you expect a lot of duplicates and there is no way to differentiate PCR duplicates from coincidental duplicates caused by high coverage. Duplicate removal in RNA-seq is questionable in general if you are doing anything quantitative (such as peak-calling), but especially so if you only have SE reads. You can still remove optical duplicates but not PCR duplicates.

ADD COMMENT

Login before adding your answer.

Traffic: 1802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6