Question

Number of reads in different alignments

0

Entering edit mode

6.3 years ago

marongiu.luigi ▴ 730

Dear all,

Just to confirm: it is normal to have a different number of reads when aligning the same reads against two different genomes?

I think the answer is YES because the aligner generates different hits (reads) according to the reference genome. Also, the number of reads of the starting fastq file is different from that of the SAM file, thus SAM does not report a number that is tethered to the starting file.

For instance, I have:

x=$(zcat <file>_1.fq.gz | wc -l)
L=$(($x/4))
echo $L
1 190 389 447 # same number on the paired file
samtools flagstat <align_first_reference>.sam |  grep 'total' | cut -d+ -f1
2 383 795 990
samtools flagstat <align_second_reference>.sam |  grep 'total' | cut -d+ -f1
2 381 177 452

Thank you.

alignment SAM • 1.2k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 6.3 years ago by marongiu.luigi ▴ 730

0

Entering edit mode

You also need to account for secondary alignments that are bound to be different as well.

ADD REPLY • link 6.3 years ago by GenoMax 148k

score 3 · Accepted Answer · 2018-08-30

3

Entering edit mode

6.3 years ago

Joe 21k

If they're different genomes, of course. The sequence won't be exactly the same so it's to be expected that different numbers will map.

ADD COMMENT • link 6.3 years ago by Joe 21k

1

Entering edit mode

Also, filled gaps in newer versions of the reference may cause reads to become multimappers and potentially trigger the aligner to flag them as unmapped, given they align to too many regions. HISAT2 has such an issue if I remember correctly, producing notably different alignments when using hg19 vs hg38. What kind of references are you using?

ADD REPLY • link 6.3 years ago by ATpoint 86k