I have a set of illumina paired end reads (rone.fq and rtwo.fq), and I have mapped those reads to a reference (ref.fa).
bwa aln ref.fa rone.fq > rtwo.sai
bwa aln ref.fa rtwo.fq > rtwo.sai
bwa sampe ref.fa rone.sai rtwo.sai rone.fq rtwo.fq > aln.sam
#convert sam to bam
samtools view -bS aln.sam > aln.bam
#sort bam file
samtools view aln.bam aln_sorted
Now I am trying to get SNP and INDEL information using the following script
samtools mpileup -ugf ref.fa aln_sorted.bam >aln.bcf
#convert to vcf for viewing
bcftools view aln.bcf > aln.vcf
I get the following VCF (this is just one line of the VCF I actually get)
##fileformat=VCFv4.1
##samtoolsVersion=0.1.17 (r973:277)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT aln_sorted.bam
gi|110227054|gb|AE004091.2| 32 . A C,X 0 . DP=14;I16=9,1,1,0,352,12854,23,529,600,36000,29,841,159,3395,21,441 PL 0,10,189,30,192,202
I am assuming that samtools thinks this is a SNP, because the REF=A and ALT=C,X. My question is: How can I know the coverage depth of this SNP given this VCF? If this VCF does not have information enough to show the SNP coverage depth then how can I creat a VCF that shows that information? Thanks
Nope DP=14 means you have 14 reads that cover gi|110227054. The way you used samtools it will only report information on those loci that have a suggested mutation. use 'Samtools depth' if you want to know the depth for each loci.
I dont quite understand the DP value. For example if DP=14, does that mean that 14 reads support that polymorphism? If so 14 out of how many? I am confused because in position where samtools does not find SNPs the DP value would just mean the read coverage of that position. Thanks
Nope DP=14 means you have 14 reads that cover gi|110227054. The way you used samtools it will only report information for those lines that have a suggested mutation. use 'Samtools depth' if you want to know the depth for each loci.
SO I ran "samtools depth alnsorted.bam>alndepth" and the info I get about position 32 is "gi|110227054|gb|AE004091.2| 32 33". So from what you are saying I get that the coverage of that base is 32X and the reads that support the SNP is 14? Thanks so much again
see my edits. 32 are the raw coverage. 14 is the coverage containing only good reads. I16=9,1,1,0 describes which reads go where. If I am reading the cigar string right this is an insertion. Don't quote me on that.
see my edits above.
Ok I get it. Why do you say it is an insertion? I thought that was a substitution (because of the comma)