Some Help Understanding With Mpileup Output
3
1
Entering edit mode
12.3 years ago
Lee Katz ★ 3.2k

I produced the VCF file using mpileup with no options, so that the bases would be displayed. However, I don't understand some of the notations in the displayed bases. The first one is that I don't understand the ^KT notation (or ^Kt) here (questions continued after the example). What does it mean? The sam file shows that there should be 8 reads there, but there are 9 characters or 6 if ^K is one character. I believe that I have also seen ^E characters too but am not showing them here.

$ grep -m 5 '\^K' 1792tmp_q1_v2/pileup.vcf 
CP003069    1    N    3    ^KT^Kt^KT    GGI
CP003069    2    N    8    TtT^KT^Kt^Kt^Kt^Kt    BFIIEHDD
CP003069    3    N    12    TtTTtttt^KT^Kt^Kt^Kt    GIIIHFIIGIGI
CP003069    4    N    20    CcCCccccCccc^Kc^Kc^Kc^Kc^KC^Kc^KC^KC    GGIIEFGIHIHHIIGHHIGH
CP003069    5    N    27    AaAAaaaaAaaaaaaaAaAA^Ka^KA^KA^Ka^Ka^Ka^KA    BIIIHHGIHHHIIEBHHIGHIIHIEIG
$ samtools view 1792tmp_q1_v2/aln.sorted.bam|head|cut -f 2,3,4,5,6
16    CP003069    1    0    4M1I27M
16    CP003069    1    0    4M1I31M
16    CP003069    1    0    4M1I31M
16    CP003069    1    0    4M1I31M
0    CP003069    1    42    36M
16    CP003069    1    0    4M1I31M
16    CP003069    1    42    36M
0    CP003069    1    42    36M
0    CP003069    2    42    36M
16    CP003069    2    42    36M

Does the +1tG mean that there is a T insertion or does it mean that there is a T, and then there is an extra base, a G? If it is the first option, then is the repeating insertion, then G basecall, an indicator that there are multiple alleles present? In my case of a haploid, would it be an indicator of incorrect mapping?

CP003069    111111    N    117    G$G$g$G$GGGggG+1Tg+1tG+1Tg+1tG+1TG+1Tg+1tG+1Tg+1tG+1TG+1TG+1Tg+1tG+1TG+1TG+1TG+1TGg+1tG+1Tg+1tG+1Tg+1tg+1tg+1tg+1tg+1tg+1tg+1tG+1TG+1Tg+1tg+1tg+1tG+1TG+1TG+1TG+1TG+1Tg+1tg+1tg+1tg+1tg+1tG+1Tg+1tg+1tG+1TG+1TG+1Tg+1tg+1tG+1TG+1TG+1TG+1TG+1TG+1TG+1TG+1TG+1Tg+1tG+1Tg+1tg+1tG+1TG+1Tg+1tG+1TG+1TG+1Tg+1tg+1tg+1tg+1tG+1TG+1TG+1Tg+1tG+1TG+1TG+1TG+1Tg+1tg+1tg+1tG+1TG+1TG+1Tg+1tG+1Tg+1tG+1Tg+1tg+1tg+1tg+1tG+1Tg+1ttG+1Tg+1tg+1tg+1t^It^9g+1t^IT^9g+1t

And then last question (for now). Can you help me interpret -1NC? Or is it collapsing reads to one locus and/or bad mapping?

CP003069    111111    N    121    C$c$CcccCcCcccCcCcCCCCCCcccc-1nc-1nc-1nc-1nc-1nC-1NC-1Nc-1nc-1nc-1nc-1nc-1nc-1nC-1Nc-1nc-1nC-1NC-1NC-1Nc-1nC-1Nc-1nC-1Nc-1nC-1NC-1Nc-1nC-1NC-1NC-1NC-1NC-1Nc-1nc-1nc-1nC-1NC-1NC-1NC-1NC-1Nc-1nc-1nc-1nc-1nC-1NC-1NC-1NC-1NC-1Nc-1nC-1NC-1Nc-1nC-1NC-1Nc-1nc-1nc-1nC-1Nc-1nc-1nc-1nC-1Nc-1nc-1nC-1NC-1NC-1Nc-1nC-1Nc-1nc-1nCCCCcCCCCccCCCccCCCcC^@C^@C^@C

Thanks for helping me!!

mpileup samtools • 4.6k views
ADD COMMENT
3
Entering edit mode
12.3 years ago
matted 7.8k

Your questions are answered in the mpileup documentation on the samtools manual page, but in a relatively terse way with a lot of regular expressions.

The ^ and the letter indicates the start of the read and its mapping quality (more precisely, from the docs: "at the read base column, a symbol ‘^’ marks the start of a read. The ASCII of the character following '^' minus 33 gives the mapping quality. A symbol '$' marks the end of a read segment.").

The +1TG indicates an insertion (the +) of length 1 (the 1) with nucleotide T (the T). The G is a base call from another read, unrelated to the +1T. Similarly, the -1NC is a length-1 deletion of an N next to another read giving a C. I'm not sure why it's interleaved so well, though.

And again more precisely from the documentation: "A pattern ‘+[0-9]+[ACGTNacgtn]+’ indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. Similarly, a pattern ‘-[0-9]+[ACGTNacgtn]+’ represents a deletion from the reference."

ADD COMMENT
0
Entering edit mode

Ok, thanks. What does the K mean next to the ^ though (^K)?

ADD REPLY
0
Entering edit mode

I guess the K is the mapping quality of the entire read?

ADD REPLY
0
Entering edit mode

Yes. The ASCII value of K is 75, so following the instructions, subtract 33 to give a mapping quality of 42 [for the entire read, which starts at this position].

ADD REPLY
1
Entering edit mode
10.4 years ago

A couple of things. For one, .vcf stands for variable call format, it's a way of documenting SNVs. That's not what you are looking at there. You are looking at ordinary pileup output, but not a .vcf files. So I'd stop calling it that.

Two, pileups usually should not put all N's in the third column. You probably want to give it the reference file you used to align to, and then it will fill in the correct reference letter. If you thought you did that, but still have N's, you might have a mismatch between the reference you told mpileup, and the reference you actually used. Or, the reference index was made improperly. Unfortunately, mpileup will not tell you if that is the case, it simply proceeded without the proper index. You can remake the index, and check to make sure that the command executes with no errors, with the command

samtools faidx reference.fa

So I'd check to be sure you are using the right index, remake the reference index, then rerun mpileup.

ADD COMMENT
0
Entering edit mode

Do you know where can I download the reference fasta file for human genome build 19 which could be directly used by samtools? You know, some format of fasta file could not be used in samtools -f option.

ADD REPLY
0
Entering edit mode
10.4 years ago

I have one question, for the insertion +1TG, can we find its corresponding base-call quality score? I guess in the base quality column for CP003069 111111 N 117 listed above, there are only 117 characters. So could we find the base quality for insertions?

ADD COMMENT

Login before adding your answer.

Traffic: 2002 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6