Hi everyone,
I have a number of Human Heart RNASeq samples. I have generated a mpileup file on each of them using a bed file containing a list of positions for which I want the mpileup to be made. I have included the -B option to turn off the BAQ computation in mpileup.
Single sample mpileup:
samtools mpileup -uf hg19.fa -B -l snps_chr17.bed sample1.bam > sample1.mpileup
Multiple sample mpileup (all samples):
samtools mpileup -uf hg19.fa -B -l snps_chr17.bed -b bam_files_list.bam > all_samples.mpileup
Now, I want to output the frequency of nucleotides A, T, G and C at each of those positions. There is a perl script pileup2baseindel which generates exactly the output that I want, only that it is incorrect (inconsistent with what is seen in IGV). This program takes a single/multi-sample mpileup file and produces the following output. But like I said, the output is incorrect.
I have searched a lot but can't seem to find any program that does the same. Does anyone know any tool/script that would give me such an output with/without taking the mpileup file as an input. Suggestions on alternatives/options are welcomed.
[UPDATE] There was a bug in the perl script and I have updated the code, it works fine now. I have contacted the author and as soon as the author responds, I will send it to him so that he can update it. I would still like to get suggestions and try out other programs.
[UPDATE] Thanks everyone who helped me with this problem. Each of your suggestions worked but I accepted Chris Miller's answer recommending the use of bam-readcount because of its easy of use and the elaborate output that it generates. It directly takes, as input, a bam file as well as a list of positions in bed format for which you want the nucleotide frequency. So you don't need to create an intermediate mpileup file - as in pileup2baseindel or run the program against positions that you are not interested in - as in pysamstats.
Thanks!
Thanks! I'll give it a try and let you know!
I'm using Bam-readcount, and I notice that the Depth calculated for each position does not correspond at all to the coverage that I observe on IGV. It is much lower. Could someone tell me where does the error? Thank you
Hi Saad.chadi, did you resolve the problem? I'm trying to run Bam-readcount, but i have the same problem. If you have any solution, please let me know. Thank in you in advance for your time. Regards
I was able to run bam-readcount, but in addition to counts of A, C, G, T, N, got lines like these with additional count at the end. What do these fields mean? Thanks!
They are insertions and deletions, IIRC. Easy enough to pull up your reads in IGV and verify that.