Snp Density
1
0
Entering edit mode
10.9 years ago
int11ap1 ▴ 490

Hi guys,

Well, I have called variants with samtools mpileup, where:

MAPQ=20 # min map quality BASEQ=20 # min base quality

and then filtered them by:

MINCOV=5 # min coverage MAXCOV=30 # maximum coverage

Now, I have to look for the SNP density within the region sequenced. For that, I obtain the depth file (samtools depth).

awk '{nucleotides++} END {print nucleotides}' depthfile.txt
(in order to know the real length)

It should be as simple as divide the number of SNP by the length. However, doing that I would obtain a wrong number of SNP/kb, because I've filtered SNPs by quality and coverage.

So, am I correct if I should filter the depth file by coverage? Quality as well? Or not?

Thanks a lot, guys.

snp samtools coverage • 4.9k views
ADD COMMENT
0
Entering edit mode

cross posted in seqnaswers

ADD REPLY
0
Entering edit mode
10.9 years ago
DG 7.3k

Well, it depends a little bit on the question you are asking. But I would think you are asking the question of how many REAL SNPs do I see in my data per kb. If that is the case then absolutely you want to filter by depth and quality first. Those hard filtering approaches are meant to try and remove false-positives from your dataset. Otherwise you are essentially asking the question of the density of SNPs called by the variant caller per kb. Since that number includes false positives it is a less interesting and informative question.

ADD COMMENT
0
Entering edit mode

Thanks for answering. Sure, my SNPs are filtered by quality and coverage. However, I call the command awk to the depth file in order to count how many bases exactly I have (length). So, you think I should count only the bases above 5 of coverage and below 30, such as I've done with SNPs?

ADD REPLY
0
Entering edit mode

I'm taking that to mean you are looking at some sort of targeted re-sequencing data then, either Exome or something like Amplicon sequencing, targeted panel, etc?

Part of this really depends on what you are sequencing. I normally would probably have just gone with the length of my targeted regions for sequencing, although it would skew your density if you had targeted regions that were not sequenced adequately. Otherwise yes, if you want to count bases where you had proper sequence coverage only I would use the same Depth/Quality thresholds for filtering as what you used for filtering SNPs.

ADD REPLY
0
Entering edit mode

I agree with Dan.

If you are doing "Exome or something .." your SNPs/kb will be fairly arbitrary, as your fragments may not be representative of the whole chromosome/genome, therefore not useful for external reference. However as long as you are comparing consistently between your own samples, I don't think it really matters for what you use as your denominator.

ADD REPLY

Login before adding your answer.

Traffic: 1746 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6