I want to know how many reads (above certain score ) supporting References allele at each position in the genome from a BAM file. What I am doing is converting BAM to mpileup format and then parsing it.
samtools mpileup -d 10000 -C 1 -E -f file.bam > file.mpileup
Of course mpileup file is too big that it need time to process. Is there any tool to get References allele count (EXCLUDING SNP ALLELE READ) above certain score ?
Thanks,
Thanks Irsan. I read about pileup format ( http://samtools.sourceforge.net/pileup.shtml). Something which I couldn't find is the meaning of ">" and "<" symbols in 5th column.
I was looking for a method where I don't have to make mpileup file (Its really big and takes lot of time to make it), otherwise I was doing like
perl -ne '@a=split " ";$l=$a[4]=~tr/[.,]//; print "$a[0] $a[1] $l\n"' file.mpileup > file.mpileup.ref.count
I found the answer for ">" and "<" at http://samtools.sourceforge.net/samtools.shtml it means for a reference skip I am not deleting this comment, so that someone could benefit out of it
You dont have to save the output of samtools mpileup to a file, you can just redirect it to something that counts the amount of reference calls