Well, I have called variants with samtools mpileup, where:
MAPQ=20 # min map quality
BASEQ=20 # min base quality
and then filtered them by:
MINCOV=5 # min coverage
MAXCOV=30 # maximum coverage
Now, I have to look for the SNP density within the region sequenced. For that, I obtain the depth file (samtools depth).
awk '{nucleotides++} END {print nucleotides}' depthfile.txt
(in order to know the real length)
It should be as simple as divide the number of SNP by the length. However, doing that I would obtain a wrong number of SNP/kb, because I've filtered SNPs by quality and coverage.
So, am I correct if I should filter the depth file by coverage? Quality as well? Or not?
Well, it depends a little bit on the question you are asking. But I would think you are asking the question of how many REAL SNPs do I see in my data per kb. If that is the case then absolutely you want to filter by depth and quality first. Those hard filtering approaches are meant to try and remove false-positives from your dataset. Otherwise you are essentially asking the question of the density of SNPs called by the variant caller per kb. Since that number includes false positives it is a less interesting and informative question.
Thanks for answering. Sure, my SNPs are filtered by quality and coverage. However, I call the command awk to the depth file in order to count how many bases exactly I have (length). So, you think I should count only the bases above 5 of coverage and below 30, such as I've done with SNPs?
I'm taking that to mean you are looking at some sort of targeted re-sequencing data then, either Exome or something like Amplicon sequencing, targeted panel, etc?
Part of this really depends on what you are sequencing. I normally would probably have just gone with the length of my targeted regions for sequencing, although it would skew your density if you had targeted regions that were not sequenced adequately. Otherwise yes, if you want to count bases where you had proper sequence coverage only I would use the same Depth/Quality thresholds for filtering as what you used for filtering SNPs.
If you are doing "Exome or something .." your SNPs/kb will be fairly arbitrary, as your fragments may not be representative of the whole chromosome/genome, therefore not useful for external reference. However as long as you are comparing consistently between your own samples, I don't think it really matters for what you use as your denominator.
cross posted in seqnaswers