Question

how to extract DP for individuals from the same population and save them in a file

0

Entering edit mode

7.2 years ago

Ana ▴ 200

Hi everyone, I have a vcf-file with 20 millions SNPs for 315 individuals from 26 populations (pop1,pop2,...pop26). I am trying to extract the DP values for each individual and save them for each population in a seperate file to get a final output like this (each column is one individual):

$ head pop1
    .   7   6   .   3   .   5   .   .   .   .
    .   5   6   .   3   .   5   .   .   .   .
    10  6   4   3   5   .   6   13  4   .   10
    10  8   5   5   6   .   8   14  4   .   11
    8   12  5   .   8   .   3   10  3   6   3

What I am doing (which is NOT the best way to do that) is to extract individuals for each population by using vcf-tools using this command

--vcf file.vcf --keep pop1_inds  --recode --recode-INFO-all --out pop1.vcf

and then for each individual in each population, I was running this command separately:

grep -v "^#" pop1.vcf | cut -f 10 | cut -d ':' -f2

so for each individual I got something like this:

 $ head ind.pop1
    .
    .
    .
    6
    6
    4
    6
    5

and finally, I pasted all individuls DP file to gether `

paste ind1.pop1, ind2.pop1 .... > ind.pop1

` for each population to get the final output for pop1 that I showed above. I wonder is there any easier and faster way to do it? I do not want to run vcf-tools. I want to extract directly DP values for individuals from the same population and save them in a file for each population .. I would appreciate any help or suggestion to get this work done easier and faster .

vcf-file depth of coverage • 2.0k views

ADD COMMENT • link updated 7.2 years ago by Pierre Lindenbaum 164k • written 7.2 years ago by Ana ▴ 200

score 0 · Answer 1 · 2017-09-29

0

Entering edit mode

7.2 years ago

Russ ▴ 520

You could subset your original VCF file into 26 pop_X.vcf files (pop_1.vcf, pop_2.vcf, etc) using GATK SelectVariants:

java -jar GenomeAnalysisTK.jar -T SelectVariants -sf <list_of_pop_1_sample_names> -V original.vcf -o pop_1.vcf

and then use GATK VariantsToTable:

java -jar GenomeAnalysisTK.jar -T VariantsToTable -gf DP -V pop_1.vcf -o pop_1_depth.txt

ADD COMMENT • link 7.2 years ago by Russ ▴ 520

0

Entering edit mode

So vcf-tools is doing the same thing for me. I thought there might be a better way. So, lets say I extracted individuals from each population as a seperate file (po1.vcf , pop2.vcf). How can I run

grep -v "^#" pop1.vcf | cut -f 10 | cut -d ':' -f2

for each population file in a loop to get the "pop1" that I showed above? by running grep I get DP for each individual and I just paste them together. I s there easier way to do that and do it for all populations files together!

ADD REPLY • link 7.2 years ago by Ana ▴ 200

0

Entering edit mode

You can wrap your commands into a bash for loop. For the commands I gave, you could do something like:

for x in {1..26}; do java -jar GenomeAnalysisTK.jar -T SelectVariant -sf pop"$x"_inds -V file.vcf -o pop"$x".vcf; java -jar GenomeAnalysisTK.jar -T VariantsToTable -gf DP -V pop"$x".vcf -o pop"$x"_depth.txt; done

You could easily substitute your own commands in place of the GATK commands - just substitute the "$x" variable for the population number.

ADD REPLY • link 7.2 years ago by Russ ▴ 520

score 0 · Answer 2 · 2017-09-29

0

Entering edit mode

7.2 years ago

Pierre Lindenbaum 164k

using bioalcidaejdk:

java -jar dist/bioalcidaejdk.jar -e 'List<String> samples=Arrays.asList("SAMPLE1","SAMPLE1"); stream().forEach(V->{for(int i=0;i<samples.size();i++) print((i==0?"":"\t")+V.getGenotype(samples.get(i)).getDP()); println();});' input.vcf

ADD COMMENT • link 7.2 years ago by Pierre Lindenbaum 164k