Different Qualities In Bam And Pileup Files
3
2
Entering edit mode
12.3 years ago
Mus Musculus ▴ 20

I have such a problem. I run samtools this way: mpileup -f <ref.fasta -l contigs.list input.bam > output.pileup Input reads have good base quality, but in pileup file at SNP positions I have poor base quality. I mean, for example

AAAATTTTG - reference

AAACTTTT     qqqqqqqq

AACTTTTG     qqqqqqqq

In pileup file base quality at snp posiiton: !! base quality at ordinary place: qq It occurs not always, not at all snp postions but very often. Thanks.

samtools snp • 4.3k views
ADD COMMENT
0
Entering edit mode

Did you convert your base qualities at any point? Did you specify illumina quality?

ADD REPLY
0
Entering edit mode

Reads are actually SOLiD data. And we didn't convert anything.

ADD REPLY
3
Entering edit mode
12.3 years ago
Andreas ★ 2.5k

As Istvan correctly pointed out, your data looks odd.

But anyway, if you see differences between the quality scores in a pileup (samtools mpileup) and the actual sequences (samtools view), then this is very likely due to samtools automatic BAQ computation, which can downgrade quality scores if a misalignment is likely. BAQ is switched on by default, but you can disable it with mpileup's -B option (not recommended though). See Li, 2001.

Andreas

ADD COMMENT
0
Entering edit mode

I experienced something similar a while ago and BAQ was the cause. So even if the qualities require conversion, don't be surprise if there are still differences.

ADD REPLY
0
Entering edit mode

Thank you, Andreas. The problem was in BAQ computation. It underestimates qualities. With regards our data, they were fine. Text below is just a part of sam file: some insignificant for my question fields were dropped.

ADD REPLY
0
Entering edit mode
12.3 years ago
Mus Musculus ▴ 20

View from bam file:

278_1780_882_F36004GAAAGGACGGAGTGAACGAACTGATGGTTAACAAAGGCqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1277_1512_1769_F36013GAGTGAACGAACTGATGGTTAACAAAGGCCTCATCAAGGAqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
405_1727_636_F36014AGTGAACGAACTGATGGTTAACAAAGGCCTCATCAAGGAATACCGTGAqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
235_427_1021_F36017GAACGAACTGATGGTTAACAAAGGCCTCATCAAGGAATACCGTGAqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
561_1318_1376_F36023ACTGATGGTTAACAAAGGCCTCATCAAGGAATACCGTGACTTTACCGAqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1806_1724_117_F36024CTGATGGTTAACAAAGGCCTCATCAAGGAATACCGTGACTTTACCGqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
282_1435_1468_F36025TGATGGTTAACAAAGGCCTCATCAAGGAATACCGTGACTTTACCqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1271_1653_1368_F36026GATGGTTAACAAAGGCCTCATCAAGGAATACCGTGACTTTACCGAGCGqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
2340_30_76_F36026GATGGTTAACAAAGGCCTCATCAAGGAATACCGTGACTTTACCGAGCqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1008_1283_1513_F36052AATACCGTGACTTTACCGAGCGTTGCTTCCAGGACATTACTCCCGAGGqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1967_902_850_F36054TACCGTGACTTTACCGAGCGTTGCTTCCAGGACATTACTCCCGAGqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
248_393_300_F36059TGCATTTACCGAGCGTTGCTTCCAGGACATTACTCqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
1546_1682_68_F36059TGACTTTACCGAGCGTTGCTTCCAGGACATTACTCCCGAGGAGCAGCAq!!qqqq!!qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

Position in pileup file:

gi|260081398|gb|ACBX02000015.1|6034C12 .AAaaAAaaAAA      !!!!!!!!!!!!
ADD COMMENT
0
Entering edit mode

please edit and add this to the main post then delete it. Right now you have listed it as an answer and people think your problem has been solved.

ADD REPLY
0
Entering edit mode
12.3 years ago

The q is not a valid quality in the Sanger encoding. At the same time what you are listing is not a valid BAM file

ADD COMMENT
0
Entering edit mode

q is ascii 123. Illumina and I think Solid range quality from 64 ... 126. So it looks like you need to convert your quality scores to PHRED scale. If I remember correctly both BWA and Bowtie can be told to convert quality ranges of the FASTQ. Then again I have never worked with solid data.

ADD REPLY
1
Entering edit mode

Current (two year old) Illumina and Solid systems use Sanger (+33) encodings. Older Illumina were indeed on the +64 scale (not sure about Solid) but even on that scale the reported quality measures typically end at 41 (it does not use the entire scale). The q would account for a quality of 49 so that makes it a bit suspicious value even on the older scale. Then tools may exhibit strange behaviors once they get codes that are outside of the expected range - there is little error checking - perhaps rightfully so as typically it would mean billions of wasted checks.

ADD REPLY

Login before adding your answer.

Traffic: 1327 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6