Question

Help Understanding Pileup Format

2

Entering edit mode

14.5 years ago

Pi ▴ 520

Hello

I have been given some data in pileup format. Whilst it is beyond the scope of my work to go into the details of how the reads/alignments etc are generated I think it would be remiss of me not to try and understand a bit more about what is going on.

I have read the SAM tools manual and explanation of the pileup format but still don't understand a few things. Here is an example taken from the manual

seq1 272 T 24  ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23  ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23   ,.$....,,.,.,...,,,.,...    7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23  ,$....,,.,.,...,,,.,...^l.  <+;9*<<<<<<<<<=<<:;<<<<
seq1 276 G 22  ...T,,.,.,...,,,.,....  33;+<<7=7<<7<&<<1;<<6<
seq1 277 T 22  ....,,.,.,.C.,,,.,..G.  +7<;<<<<<<<&<=<<:;<<&<
seq1 278 G 23  ....,,.,.,...,,,.,....^k.   %38*<<;<7<<7<=<<<;<<<<<
seq1 279 C 23  A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

I don't understand the significance of the $ and ^ characters. The docs say they mark the start and end of read segments. Take the first line. Looking at the $, is that saying the that there is a read (the third read) whose last base is position 272 and the last base in that read is '.' which is the character after the $? Also looking at the ^ in row 1, is it also saying that the 24th read starts at position 272 and its first base is '.' which is the character after ^+

It also says 'The ASCII of the character following `^' minus 33 gives the mapping quality.' What is the mapping quality? Is it the quality of the mapping of that read to the reference genome? So for row 1 the quality is 10 (43 for ascii code for + minus 33) and the mapping quality for row 4 is 16 (49 -33). I assumed this mapping quality information would be in the SAM file.

If i am looking at variant data, is it safe to discard the $ and ^ data as I don't need to reconstruct the read sequence from pileup? I also don't think I see how I can use the mapping quality of a read segment to assess the quality of my variant data (please correct me if I am wrong!!!) - I can only meaningfully use the base qualities in the next column.

Also, what is a reference skip which is mentioned in pileup user manual

Having never used Samtools I am guessing there will be utilities to extract this data directly from the pileup file rather than my writing a custom parser?

Thank you for your time

pileup • 19k views

ADD COMMENT • link updated 14.5 years ago by Nina ▴ 400 • written 14.5 years ago by Pi ▴ 520

score 6 · Answer 1 · 2011-05-19

6

Entering edit mode

14.5 years ago

Nina ▴ 400

First things first: samtools will call the variants for you, if you include the -v option when generating the pileup. There is also a script called varFilter which will sift through the pileup and only report snps that it believes are likely to be real.

see http://samtools.sourceforge.net/cns0.shtml

(Note that the pileup function is now deprecated and has been replaced with mpileup.)

The purpose of including the $ and ^ characters is so you can actually track the start and end of individual reads while reading through the pileup. So unless you want to do this, you can safely ignore the $ and ^ characters as long as you ALSO ignore the character that comes after ^.

Yes mapping quality is a measure of the "goodness" of the alignment. This quality is primarily based on how many best and second-best places the read could align, and in the case of paired end data, the mapping quality is determined using info about both ends of a pair.

One may wish to exclude reads with low mapping quality from snp calling because if the alignment algorithm is not confident the read is in the right spot, then our confidence in snps resulting from that alignment is also reduced.

Finally, a word of warning: samtools pileup include lots of filtering that's done be default, so be sure you read the docs and you understand how the filtering was done before using the output.

ADD COMMENT • link 14.5 years ago by Nina ▴ 400

1

Entering edit mode

Sorry I mixed up the -c and -v options in my answer. (-v means only report variants, -c means identify potential snps) As you have probably figured out, the samtools pileup file can have several formats depending on the options used to create it. If your pileup file really looks like the example you've provided, then it was not run with the -c option. If your file looks like the format described in the page I linked to, then you already have snp calls and you can run this through varFilter. The varFilter param that relates to mapping quality is -Q, which sets the min root mean square mapQ.

ADD REPLY • link 14.5 years ago by Nina ▴ 400

0

Entering edit mode

Thanks Nina for your reply. I didn't generate this data - i was just given it. I only have the pileup files - not the raw data. Is there a way to filter reads with low mapping quality from the pileup at this point or could/should it have been done before the pileup was created. I read the link you gave but I couldn't see how varFilter distinguishes spurious snps.

ADD REPLY • link 14.5 years ago by Pi ▴ 520

0

Entering edit mode

I should point out a sam file was created with BWA. Variants were called with pileup -c option. I don't have the raw data, only the pileup

ADD REPLY • link 14.5 years ago by Pi ▴ 520

0

Entering edit mode

thanks again. i have read the docs for pileup but cant find any details about the default filtering you refer to?

ADD REPLY • link 14.5 years ago by Pi ▴ 520

0

Entering edit mode

If you run "samtools pileup" with no other parameters, you will get a usage message with a full list of the param options and their default settings. You will see that the bitwise filtering is set with "-m INT" option (it will exclude reads with bits in INT)

In version 0.1.14 the default value for this INT is 0x704, which means if any of the following bitwise flags are set, the read will not be included in the pileup: 0x0400 (aka 1024 or "d") duplicate 0x0200 (aka 512 or "f") failed QC 0x0100 (aka 256 or "s") non primary alignment 0x0004 (aka 4 or "u") unmapped

ADD REPLY • link 14.5 years ago by Nina ▴ 400