samtools flagstat returning 0% align reads
1
0
Entering edit mode
7.6 years ago

Hello,

I am using PASH 3.0 for aligning the .fastq file against GRCh37 genome (Using PASH is mandatory here). I am using the following command:

pash3 -g Homo_sapiens.GRCh37.66.fa -r input.fastq -o output.sam -3

When I use samtools flagstat to see the no.of read mapped to the genome I get a lot of warnings like: [W::sam_parse1] unrecognized reference name; treated as unmapped as explained in another Biostar question, this happens whenever you have a read mapping to a chromosome/contig that's not in the header but in my case as can be seen in output.sam file, every chromosome is there in the sam header file. I'm unable to understand why is it still happening!!

The output of

samtools flagstat output.sam

looks like this:

4942996 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

This is the header of my output.sam file

@HD     VN:1.0
@PG     ID:pash3        PN:Pash VN:3.01.02
@SQ     SN:chrMT dna:chromosome chromosome:GRCh37:MT:1:16569:1  LN:16569
@SQ     SN:chrX dna:chromosome chromosome:GRCh37:X:1:155270560:1        LN:155270560
@SQ     SN:chrY dna:chromosome chromosome:GRCh37:Y:2649521:59034049:1   LN:59373566
@SQ     SN:chr1 dna:chromosome chromosome:GRCh37:1:1:249250621:1        LN:249250621
@SQ     SN:chr2 dna:chromosome chromosome:GRCh37:2:1:243199373:1        LN:243199373
@SQ     SN:chr3 dna:chromosome chromosome:GRCh37:3:1:198022430:1        LN:198022430
@SQ     SN:chr4 dna:chromosome chromosome:GRCh37:4:1:191154276:1        LN:191154276
@SQ     SN:chr5 dna:chromosome chromosome:GRCh37:5:1:180915260:1        LN:180915260
@SQ     SN:chr6 dna:chromosome chromosome:GRCh37:6:1:171115067:1        LN:171115067
@SQ     SN:chr7 dna:chromosome chromosome:GRCh37:7:1:159138663:1        LN:159138663
@SQ     SN:chr8 dna:chromosome chromosome:GRCh37:8:1:146364022:1        LN:146364022
@SQ     SN:chr9 dna:chromosome chromosome:GRCh37:9:1:141213431:1        LN:141213431
@SQ     SN:chr10 dna:chromosome chromosome:GRCh37:10:1:135534747:1      LN:135534747
@SQ     SN:chr11 dna:chromosome chromosome:GRCh37:11:1:135006516:1      LN:135006516
@SQ     SN:chr12 dna:chromosome chromosome:GRCh37:12:1:133851895:1      LN:133851895
@SQ     SN:chr13 dna:chromosome chromosome:GRCh37:13:1:115169878:1      LN:115169878
@SQ     SN:chr14 dna:chromosome chromosome:GRCh37:14:1:107349540:1      LN:107349540
@SQ     SN:chr15 dna:chromosome chromosome:GRCh37:15:1:102531392:1      LN:102531392
@SQ     SN:chr16 dna:chromosome chromosome:GRCh37:16:1:90354753:1       LN:90354753
@SQ     SN:chr17 dna:chromosome chromosome:GRCh37:17:1:81195210:1       LN:81195210
@SQ     SN:chr18 dna:chromosome chromosome:GRCh37:18:1:78077248:1       LN:78077248
@SQ     SN:chr19 dna:chromosome chromosome:GRCh37:19:1:59128983:1       LN:59128983
@SQ     SN:chr20 dna:chromosome chromosome:GRCh37:20:1:63025520:1       LN:63025520
@SQ     SN:chr21 dna:chromosome chromosome:GRCh37:21:1:48129895:1       LN:48129895
@SQ     SN:chr22 dna:chromosome chromosome:GRCh37:22:1:51304566:1       LN:51304566
SRR1917223.2807307      0       chrMT   147     99      50M     *       0       0       CATTCTATTATTTATCGCACCTACGTTCAATATTTCAGGCGAACATACTT      CCCCCFFFFFFFGGGGGGGGGGHHGHHHHHHGHHHHHHHHGGGGGHHHHH
SRR1917223.4613215      16      chrMT   15      99      50M     *       0       0       CACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTC      FBCBHGHHBFFGHGEFEEFGHGHGGGCGGGGGGGGGCGFFFFBA3ABBAB
SRR1917223.2209200      16      chrMT   370     99      50M     *       0       0       CCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCA      GHEGGFBEHFFHGGFAFHGHHFDBHGGFGGGGGFGGFEDD?AFFFABABA

Can somebody guide me what am I doing wrong here and how can I overcome the issue. I have tried to use another alignment tool (Bowtie) for the same input.fastq file and when I get alignment statistics by samtools flagstat, I get around 60% of alignment mapped to the genome, so problem is not in the input.fastq file but the PASH 3.0 options. If somebody has faced the same problem then please share your experience. Thank you.

PASH 3.0 samtools alignment • 3.3k views
ADD COMMENT
1
Entering edit mode

It looks like the chromosome names in the sam alignments only have the first part of the chromosome name that is in the header, the part up to the first space in the chromosome name, samtools is probably looking for a perfect match.

ADD REPLY
2
Entering edit mode
7.6 years ago

I wonder if having spaces in your references sequence names is the problem; Maybe Bowtie is okay with them, but PASH is not. I'd fix that, and realign.

ADD COMMENT
0
Entering edit mode

Sorry, I couldn't understand, can you please point out about which spaces you are talking here?

ADD REPLY
0
Entering edit mode

For example, the mitochondrial sequence, '@SQ SN:chrMT dna:chromosome chromosome:GRCh37:MT:1:16569:1' is named 'chrMT dna:chromosome chromosome:GRCh37:MT:1:16569:1' in the header, and just 'chrMT' in the alignments you have shown above.

ADD REPLY
0
Entering edit mode

This is what has been done automatically by PASH. That's how my header of input.fastq looks like:

@SRR1917223.1 1/1
TCACAGTTATTGCCAATTTGGGCTAATTCTTGGTTTTGGAGCACAATATT
+
>>1>11BB3DD3FF11BFG311BF01DFDGG11BFFF0010B0B00B1DF
@SRR1917223.2 2/1
TTTTGTTTTGTTTTTTGGACATAACCTCAAGATTTTATTGTCCTCATAAT
+
>AAA1BCCF1CFGGFE001B1B11BBFE1111BFGG2DF1DAAFE1DE2D
@SRR1917223.3 3/1
TGCCCAGGCCACTCCTGCAGACATTGGGGCGTGGGCGTGGGGTCCCTGAG
+

and this is how the header pf output.sam file looks like, which I obtained from Bowtie:

@HD     VN:1.0  SO:unsorted
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chrX LN:155270560
@SQ     SN:chrY LN:59373566
@SQ     SN:chrM LN:16571
@PG     ID:Bowtie       VN:1.0.0        CL:"bowtie -p 4 -S -a -m 1 -v 1 hg19 input.fastq output.sam"
SRR1917223.1    16      chr10   54759061        255     50M     *       0       0       AATATTGTGCTCCAAAACCAAGAATTAGCCCAAATTGGCAATAACTGTGA      FD1B00B0B0100FFFB11GGDFD10FB113GFB11FF3DD3BB11>1>>      XA:i:0  MD:Z:50 NM:i:0
SRR1917223.4    0       chr2    27274573        255     50M     *       0       0       TGGGGGCCCGGGTGGAGGGCCCGGGTGCCGGGGCCCAAGGTGCGGCCTCG      111111>>>000A00000/AAE///B//B////>>E//>/BBF?/EEEF/      XA:i:0  MD:Z:50 NM:i:0

So, based on this can I say that if I remove the part between SN:chrM and LN:16571 then maybe samtools will report them as aligned reads in the PASH case as well??

ADD REPLY
1
Entering edit mode

This example does not match the one in the original post. Are both giving you the same error?

Spaces in chromosome names is an issue that is going to crop up time and again. I recently had a problem with featureCounts and a sam file which had spaces in chromosome names. (A work-around is to remove the spaces from names before/during alignment). A current issue is open on samtools github site for this.

ADD REPLY
0
Entering edit mode

I think that should work.

Are you using versions of the genome from different sources, though - your Bowtie version calls the mitochondrial sequence chrM and it has a length of 16571, whereas your other example calls the mitochondrial sequence chrMT and the sequence length is 16569? Don't mix the two versions.

ADD REPLY
0
Entering edit mode

For bowtie, I used the pre-compiled genome from UCSC while for PASH I was using GRCh37_r66 from Ensemble ftp site.

ADD REPLY
2
Entering edit mode

Go to the fasta that you downloaded for use with PASH, change all the spaces in the reference names to underscores. Remake the genome index, realign.

ADD REPLY

Login before adding your answer.

Traffic: 1659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6