This might be a bit duplicate question, but previous posts did not help me.
I have a reference sequence, I map reads on it and get .bam file. Now I would like to get number of reads overlapping every single base in my reference sequence (=coverage per base).
Secondly, I would like to get the number of bases at these positions, like:
pos A C G T
1 A:2 C:3 G:0 T:1
2 A:2 C:3 G:0 T:1
3 A:0 C:0 G:0 T:7
4 ..
..
I have had problem installing Bio::DB::Sam, so let me write my solution here in case someone is interested; First, recompile samtools with CSFLAG -fPIC, as suggested here: http://cpansearch.perl.org/src/LDS/Bio-SamTools-1.36/README This way you get those libbam.a and bam.h files. Then use just perl -MCPAN -e 'install Bio::DB::Sam' and as a path write way to newly compiled samtools. Enjoy library :-)
to do the first part, samtools depth is a direct way to get each position's coverage. much faster than processing the output of samtools mpileup. to do the second part, I would also go for samtools mpileup as mentioned here. if both things are needed I would then try to get all the information from a single samtools mpileup output parsing.
Both the coverage per base and the number of a particular base at each position can be obtained using samtools mpileup in.bam. This results in a file something like this:
Contig18263 110 N 1 T e
Contig18263 111 N 1 C c
Contig18263 112 N 1 T e
Contig18263 113 N 1 T d
Contig18263 114 N 1 T$ f
Contig18263 132 N 3 ^$T^*T^$T gfh
Contig18263 133 N 5 TTT^$T^$T hhhhf
Contig18263 134 N 7 AAAAA^!A^!A hhhhfeg
Contig18263 135 N 7 TTTTTTT hhhhfgf
Contig18263 136 N 7 GGGGGGG hhhhfgg
Contig18263 137 N 7 TTTTTTT hghhfhh
The 4th column gives you the per base coverage. Processing the 5th column can give you the number of bases at each position with something like (in Python) Not the best code but you get the idea:
for line in in.pileup:
linsplit=line.split()
print linsplit[0], linsplit[1], "A:"+str(linsplit[4].count('A')+linsplit[4].count('a')), "C:"+str(linsplit[4].count('C')+linsplit[4].count('c')), "G:"+str(linsplit[4].count('G')+linsplit[4].count('g')), "T:"+str(linsplit[4].count('T')+linsplit[4].count('t'))
One more question:) Now I use samtools mpileup -f reference.fasta -r contig aln.bam >mpileup command. However, results not always start from 1st base (which is desired behaviour in my case). Please, how could I fix this? Right now, I do not see any logic in where it starts reporting.
Well, that what I would expect, but unfortunately there are reads on those first positions (I can see them in .bam file). I guess I will try to contact authors of samtools.
I find bam-readcount to be very useful here - it directly works with the bam file (no need for mpileup). Use it like so:
bam-readcount -f reference.fa file.bam
What is also nice is that bam-readcount can take in a file with only the positions/regions you're interested in (eg -l file.bed) and spit out the split of bases/insertions/deletions at each position.
what did you try ?
I have found that samtools mpileup -f ref.fasta aln_sorted.bam does the first part for me.
check my answer here, I think this is what you are looking for.
I have had problem installing Bio::DB::Sam, so let me write my solution here in case someone is interested; First, recompile samtools with CSFLAG -fPIC, as suggested here: http://cpansearch.perl.org/src/LDS/Bio-SamTools-1.36/README This way you get those libbam.a and bam.h files. Then use just perl -MCPAN -e 'install Bio::DB::Sam' and as a path write way to newly compiled samtools. Enjoy library :-)
to do the first part,
samtools depth
is a direct way to get each position's coverage. much faster than processing the output ofsamtools mpileup
. to do the second part, I would also go forsamtools mpileup
as mentioned here. if both things are needed I would then try to get all the information from a singlesamtools mpileup
output parsing.