Question

Count References Read From Bam File

2

Entering edit mode

12.4 years ago

Gvj ▴ 470

I want to know how many reads (above certain score ) supporting References allele at each position in the genome from a BAM file. What I am doing is converting BAM to mpileup format and then parsing it. samtools mpileup -d 10000 -C 1 -E -f file.bam > file.mpileup

Of course mpileup file is too big that it need time to process. Is there any tool to get References allele count (EXCLUDING SNP ALLELE READ) above certain score ?

Thanks,

bam • 6.7k views

ADD COMMENT • link updated 12.4 years ago by swbarnes2 15k • written 12.4 years ago by Gvj ▴ 470

score 2 · Answer 1 · 2013-02-19

2

Entering edit mode

12.4 years ago

Irsan ★ 7.8k

in a samtools pileup file, the fifth column contains the read base at a particular nucleotide position. A dot means that the base in the read is the same as the forward strand, a comma means the read base is the same for the reverse strand. So when you count the occurences of a dot or a comma in the 5th column of the pileup file in each line you have the amount of reference bases in each read:

cut -f5 file.mpileup | sed 's/[\.,]//' | awk '{ print length }' > counts.tsv

ADD COMMENT • link 12.4 years ago by Irsan ★ 7.8k

0

Entering edit mode

Thanks Irsan. I read about pileup format ( http://samtools.sourceforge.net/pileup.shtml). Something which I couldn't find is the meaning of ">" and "<" symbols in 5th column.

I was looking for a method where I don't have to make mpileup file (Its really big and takes lot of time to make it), otherwise I was doing like

perl -ne '@a=split " ";$l=$a[4]=~tr/[.,]//; print "$a[0] $a[1] $l\n"' file.mpileup > file.mpileup.ref.count

ADD REPLY • link 12.4 years ago by Gvj ▴ 470

0

Entering edit mode

I found the answer for ">" and "<" at http://samtools.sourceforge.net/samtools.shtml it means for a reference skip I am not deleting this comment, so that someone could benefit out of it

ADD REPLY • link 12.4 years ago by Gvj ▴ 470

0

Entering edit mode

You dont have to save the output of samtools mpileup to a file, you can just redirect it to something that counts the amount of reference calls

samtools mpileup [your options] file.bam | cut -f5 | ...

ADD REPLY • link 12.4 years ago by Irsan ★ 7.8k

score 0 · Answer 2 · 2013-02-19

0

Entering edit mode

12.4 years ago

swbarnes2 15k

The other thing you could try is an all-points vcf. That should output the # of HQ reads at each position in one of the columns.

It's also large, so maybe pipe that into cut, or grep, or awk or something.

ADD COMMENT • link 12.4 years ago by swbarnes2 15k