Hello
I have been given some data in pileup format. Whilst it is beyond the scope of my work to go into the details of how the reads/alignments etc are generated I think it would be remiss of me not to try and understand a bit more about what is going on.
I have read the SAM tools manual and explanation of the pileup format but still don't understand a few things. Here is an example taken from the manual
seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
I don't understand the significance of the $ and ^ characters. The docs say they mark the start and end of read segments. Take the first line. Looking at the $, is that saying the that there is a read (the third read) whose last base is position 272 and the last base in that read is '.' which is the character after the $? Also looking at the ^ in row 1, is it also saying that the 24th read starts at position 272 and its first base is '.' which is the character after ^+
It also says 'The ASCII of the character following `^' minus 33 gives the mapping quality.' What is the mapping quality? Is it the quality of the mapping of that read to the reference genome? So for row 1 the quality is 10 (43 for ascii code for + minus 33) and the mapping quality for row 4 is 16 (49 -33). I assumed this mapping quality information would be in the SAM file.
If i am looking at variant data, is it safe to discard the $ and ^ data as I don't need to reconstruct the read sequence from pileup? I also don't think I see how I can use the mapping quality of a read segment to assess the quality of my variant data (please correct me if I am wrong!!!) - I can only meaningfully use the base qualities in the next column.
Also, what is a reference skip which is mentioned in pileup user manual
Having never used Samtools I am guessing there will be utilities to extract this data directly from the pileup file rather than my writing a custom parser?
Thank you for your time
Sorry I mixed up the -c and -v options in my answer. (-v means only report variants, -c means identify potential snps) As you have probably figured out, the samtools pileup file can have several formats depending on the options used to create it. If your pileup file really looks like the example you've provided, then it was not run with the -c option. If your file looks like the format described in the page I linked to, then you already have snp calls and you can run this through varFilter. The varFilter param that relates to mapping quality is -Q, which sets the min root mean square mapQ.
Thanks Nina for your reply. I didn't generate this data - i was just given it. I only have the pileup files - not the raw data. Is there a way to filter reads with low mapping quality from the pileup at this point or could/should it have been done before the pileup was created. I read the link you gave but I couldn't see how varFilter distinguishes spurious snps.
I should point out a sam file was created with BWA. Variants were called with pileup -c option. I don't have the raw data, only the pileup
thanks again. i have read the docs for pileup but cant find any details about the default filtering you refer to?
If you run "samtools pileup" with no other parameters, you will get a usage message with a full list of the param options and their default settings. You will see that the bitwise filtering is set with "-m INT" option (it will exclude reads with bits in INT)
In version 0.1.14 the default value for this INT is 0x704, which means if any of the following bitwise flags are set, the read will not be included in the pileup: 0x0400 (aka 1024 or "d") duplicate 0x0200 (aka 512 or "f") failed QC 0x0100 (aka 256 or "s") non primary alignment 0x0004 (aka 4 or "u") unmapped