Question

Interpretation Of Consensus Sequence From Samtools Pileup

4

Entering edit mode

14.3 years ago

Ian 6.1k

I have generated a consensus sequence using 'pileup' then 'pileup2fq' from samtools. Can anyone tell me exactly what determines whether the resulting sequence is in UPPER or lower case?

An example of the fastq is:

@header
GTTAAGATGAAACATTTACAGGATTTGATTGACGAACCTGATGAtttttcacaacccaat ccatCTtagactagaaaggtaTTTACGGTTGCTaaacattgcgttatgtttaaGACCTCA TGCCAATAGACTGTTTGAATTTTATGAactgtctcctttgggaaacttgttaagtcgtga aastnnnnnnnnnnnnnnncaagggtacttggtcatcagatctaccgcaaaagctCAAGG
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~r oZMH!!!!!!!!!!!!!!!KZo~~Z~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thanks.

samtools pileup • 6.5k views

ADD COMMENT • link updated 14.3 years ago by Pablo Marin-Garcia ★ 2.0k • written 14.3 years ago by Ian 6.1k

Istvan Albert · Answer 1 · 2011-04-05

3

Entering edit mode

14.3 years ago

Ian 6.1k

Thanks for viewing, but i found my answer here:

samtools.pl pileup2fq -D100 > var-X.fq

File var-X.fq is in the multi-line FASTQ format. Bases in lowercase are filtered out due to repeats, being close to indels or insufficient/excessive read depth. The consensus file is essential to the estimate of mutation rate. The pileup2fq command applies fewer filters than varFilter and may not give identical results.

ADD COMMENT • link updated 14.3 years ago by Istvan Albert 102k • written 14.3 years ago by Ian 6.1k

0

Entering edit mode

In this case, I guess you can mark your own answer as the best :-)

ADD REPLY • link 14.3 years ago by Neilfws 49k

0

Entering edit mode

I wish i hadn't asked it now....

ADD REPLY • link 14.3 years ago by Ian 6.1k

0

Entering edit mode

so just to make sure, the lower case letters were either covered but filtered out due to the aforementioned reasons or they were just not covered in any of the reads?

ADD REPLY • link 14.3 years ago by Doctoroots ▴ 810

0

Entering edit mode

From my experience, if they are not at all covered then you do not get any bases in the fastq file.

ADD REPLY • link 14.3 years ago by Ian 6.1k

0

Entering edit mode

I run the command but the lower case letters are not filtered at all. Why?

ADD REPLY • link 13.6 years ago by Love ▴ 100

0

Entering edit mode

Hi Ian. Just curious. After coverting to FASTQ format, how you end up in estimation of mutation rate (which software/tools do you used)?

ADD REPLY • link 12.7 years ago by jackuser1979 ▴ 890

Ram · Answer 2 · 2011-12-02

0

Entering edit mode

13.6 years ago

Pablo Marin-Garcia ★ 2.0k

The lower case letter are what is called 'soft masking' (bases in low complexity regions like repeats etc.). I don't know if it would be your case but some people provide the reference genome soft masked to the aligners in order to avoid alignments in this regions, but nowadays this is saw as a not good practice and I think that aligners like BWA does not filter out soft masked regions following this biostars answer and similar posts

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 13.6 years ago by Pablo Marin-Garcia ★ 2.0k

1

Entering edit mode

Ian's answer below is correct, not this one. Lowercase letters in samtools output does not mean the same thing as typical soft-masking: http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_protocol#Basic_Protocol_3:_Variant_Calling_with_SAMtools

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 13.6 years ago by Casey Bergman 18k

0

Entering edit mode

+1 @Casey: You are right Casey, probably I did not express myself with property. When I was talking about soft-masking I was meaning marking bases to be filtered out. Although the sof-masking rules can be slightly different the concept is the same, as the samtools manual says: "Bases in lowercase are filtered out due to repeats, being close to indels or insufficient/excessive read depth"

ADD REPLY • link 13.6 years ago by Pablo Marin-Garcia ★ 2.0k