Question

samtools mpileup output documentation

4

Entering edit mode

8.3 years ago

bec ▴ 40

Hi,

Can anyone provide me with a link to documentation giving an explanation for all of the samtools mpileup output sequence characters?

(ie an explanation for all the characters in output which looks something like this: "gggccgggggggg**C+1G**^]T" in column 5).

I've seen this post: Samtools Mpileup Output

But I'm interested in what the "+" (or "-") in the output might mean?

Thanks!

samtools mpileup output • 18k views

ADD COMMENT • link updated 8.2 years ago by James Bonfield ▴ 170 • written 8.3 years ago by bec ▴ 40

score 3 · Answer 1 · 2017-05-24

We really need to break the monolithic samtools manpage into sub-pages, but the samtools manpage does describe the pileup format and is likely to be more accurate than third-party rewrites. From http://www.htslib.org/doc/samtools.html

In the pileup format (without -u or -g), each line represents a genomic position, consisting of chromosome name, 1-based coordinate, reference base, the number of reads covering the site, read bases, base qualities and alignment mapping qualities. Information on match, mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the read base column. At this column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, a '>' or '<' for a reference skip, ACGTN for a mismatch on the forward strand and acgtn for a mismatch on the reverse strand. A pattern \+[0-9]+[ACGTNacgtn]+ indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. Similarly, a pattern -[0-9]+[ACGTNacgtn]+ represents a deletion from the reference. The deleted bases will be presented as * in the following lines. Also at the read base column, a symbol ^ marks the start of a read. The ASCII of the character following ^ minus 33 gives the mapping quality. A symbol $ marks the end of a read segment

score 1 · Answer 2 · 2017-05-24

This page can help:

https://en.wikipedia.org/wiki/Pileup_format

Specifically

A sequence matching the regular expression \+[0-9]+[ACGTNacgtn]+ denotes an insertion of one or more bases starting from the next position
A sequence matching the regular expression -[0-9]+[ACGTNacgtn]+ denotes a deletion of one or more bases starting from the next position
(asterisk) is a placeholder for a deleted base in a multiple basepair deletion that was mentioned in a previous line by the -[0-9]+[ACGTNacgtn]+ notation