Question

Calculating Coverage From Pileup File To Find Gene Duplication Events

0

Entering edit mode

12.4 years ago

thecuriousbiologist ▴ 550

Hello,

I have a pileup file like below :

seq1 272 T 24  ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23  ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23  ,.$....,,.,.,...,,,.,...    7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23  ,$....,,.,.,...,,,.,...^l.  <+;9*<<<<<<<<<=<<:;<<<<

I have to find the gene coverage from this pileup file and if the gene coverage is above a certain "threshhold" coverage, I want to consider that as a gene duplication event.

How can I go about solving this problem ?

The only file that I have is the pileup file. I don't have a BAM file for this.

pileup coverage gene • 2.9k views

ADD COMMENT • link updated 12.4 years ago by Joseph Hughes ★ 3.0k • written 12.4 years ago by thecuriousbiologist ▴ 550

score 0 · Answer 1 · 2012-11-23

0

Entering edit mode

12.4 years ago

Joseph Hughes ★ 3.0k

The 5th column provides the list of bases at that position. A,T,C,G correspond to alternate alleles and . and , correspond to the reference allele depending on strand. A deleted base is represented by *, $ is for the end of a read, a symbol ‘^’ marks the start of a read and any other character after ^ correspond to the quality of that base. So all you need to do in your favourite scripting language is to sum the number of ,.ACTG in column 5 and that will give you the coverage at that particular position.

Hope that helps, Joseph

ADD COMMENT • link 12.4 years ago by Joseph Hughes ★ 3.0k

0

Entering edit mode

Thanks. Can I just directly use the 4th column to find the mean for specific regions, rather than looking at the 5th column ?

Let's say I have a gene which covers positions 2,3,4 in the above example. Can I not just add 23+23+23 and divide by 3 ? This will mean I have 23X coverage for this gene, is that correct ?