Number of SNPs in VCF
2
0
Entering edit mode
7.0 years ago
misterie ▴ 110

Hey,

I have multisample VCF contains a information about 7 breed (in sum about 1000 individuals). Their ID are: SIM001, SIM002, ..., SIM034, FLV001, FLV002, FLV034 ...

I need to obtain a information about sum SNPs for every breed (SIMXXX is a one breed, FLV is a second breed and so on...).

Have you got a information how can do that? Thank you.

SNP vcftools • 4.2k views
ADD COMMENT
0
Entering edit mode

"... about sum SNPs ... "

it's not clear to me. Are you taking about variants ? SNPs ? genotypes, called-only genotypes ? non hom-ref genotypes ?

ADD REPLY
0
Entering edit mode

Sorry, my mistakes. I have only SNPs in my file. So I need. a total number of variants for each breeds (excluding 100% missing for breed). My mistake. Now is it clear?

ADD REPLY
0
Entering edit mode

Now is it clear?

no. The number of variants (=rows) doesn't change with the number of samples/genotypes (=columns). So the "number of variants" will not change.

ADD REPLY
0
Entering edit mode

hmm, how to explain it... I just need to get a information about total number of variants for every breed.

ADD REPLY
0
Entering edit mode

Do you mean the histogram of SNPs in VCF?

ADD REPLY
0
Entering edit mode
7.0 years ago
dyollluap ▴ 310

I think the easiest option is to grep each breed and output the count:

grep 'SIM001' mymultisample.vcf | wc

If you have a list of breeds in your multisample vcf, create a for loop:

for breed in $(<listofbreeds.txt); do grep '$breed' mymultisample.vcf |wc ; done
ADD COMMENT
0
Entering edit mode

I don't understand your command, the word SIM001 will be only displayed in the CHROM header like a sample isn't it ?

ADD REPLY
0
Entering edit mode

Based on previous q&a on the comments, there's an assumption of non-perfect compliance to vcf specification. My guess/interpretation without seeing any example lines, the breed is labeled in each row and the OP wants a count / sub-total of variants for each breed.

ADD REPLY
0
Entering edit mode
7.0 years ago
Russ ▴ 520

So if I understand correctly, you have a VCF file with about 1000 samples that represent 7 different breeds of animal. Each sample is named in a consistent fashion, with a prefix (e.g. "FLV" or "SIM") and then a number.

You could use GATKs SelectVariant tool to subset your VCF into 7 files, one for each breed, by using either the -select_expressions or the -sample_file parameters. I'm not super awesome with regex, but perhaps the following would work:

java -jar GenomeAnalysisTK.jar -R <reference.fa> -T SelectVariants -V <vcf_file> -se 'SIM.+' -o sim.vcf

Alternatively, it may just be easier to use the sample_file flag - make a file for each breed that has a list of the sample names, and then pass that to GATK:

 java -jar GenomeAnalysisTK.jar -R <reference.fa> -T SelectVariants -V <vcf_file> -sf <list_of_all_SIM_samples.txt> -o sim.vcf

And then just repeat for each breed.

ADD COMMENT

Login before adding your answer.

Traffic: 2013 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6