Interpretation Of Consensus Sequence From Samtools Pileup
2
4
Entering edit mode
13.6 years ago
Ian 6.1k

I have generated a consensus sequence using 'pileup' then 'pileup2fq' from samtools. Can anyone tell me exactly what determines whether the resulting sequence is in UPPER or lower case?

An example of the fastq is:

@header
GTTAAGATGAAACATTTACAGGATTTGATTGACGAACCTGATGAtttttcacaacccaat ccatCTtagactagaaaggtaTTTACGGTTGCTaaacattgcgttatgtttaaGACCTCA TGCCAATAGACTGTTTGAATTTTATGAactgtctcctttgggaaacttgttaagtcgtga aastnnnnnnnnnnnnnnncaagggtacttggtcatcagatctaccgcaaaagctCAAGG
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~r oZMH!!!!!!!!!!!!!!!KZo~~Z~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thanks.

samtools pileup • 5.9k views
ADD COMMENT
3
Entering edit mode
13.6 years ago
Ian 6.1k

Thanks for viewing, but i found my answer here:

samtools.pl pileup2fq -D100 > var-X.fq

File var-X.fq is in the multi-line FASTQ format. Bases in lowercase are filtered out due to repeats, being close to indels or insufficient/excessive read depth. The consensus file is essential to the estimate of mutation rate. The pileup2fq command applies fewer filters than varFilter and may not give identical results.

ADD COMMENT
0
Entering edit mode

In this case, I guess you can mark your own answer as the best :-)

ADD REPLY
0
Entering edit mode

I wish i hadn't asked it now....

ADD REPLY
0
Entering edit mode

so just to make sure, the lower case letters were either covered but filtered out due to the aforementioned reasons or they were just not covered in any of the reads?

ADD REPLY
0
Entering edit mode

From my experience, if they are not at all covered then you do not get any bases in the fastq file.

ADD REPLY
0
Entering edit mode

I run the command but the lower case letters are not filtered at all. Why?

ADD REPLY
0
Entering edit mode

Hi Ian. Just curious. After coverting to FASTQ format, how you end up in estimation of mutation rate (which software/tools do you used)?

ADD REPLY
0
Entering edit mode
13.0 years ago

The lower case letter are what is called 'soft masking' (bases in low complexity regions like repeats etc.). I don't know if it would be your case but some people provide the reference genome soft masked to the aligners in order to avoid alignments in this regions, but nowadays this is saw as a not good practice and I think that aligners like BWA does not filter out soft masked regions following this biostars answer and similar posts

ADD COMMENT
1
Entering edit mode

Ian's answer below is correct, not this one. Lowercase letters in samtools output does not mean the same thing as typical soft-masking: http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_protocol#Basic_Protocol_3:_Variant_Calling_with_SAMtools

ADD REPLY
0
Entering edit mode

+1 @Casey: You are right Casey, probably I did not express myself with property. When I was talking about soft-masking I was meaning marking bases to be filtered out. Although the sof-masking rules can be slightly different the concept is the same, as the samtools manual says: "Bases in lowercase are filtered out due to repeats, being close to indels or insufficient/excessive read depth"

ADD REPLY

Login before adding your answer.

Traffic: 2704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6