how to extract DP for individuals from the same population and save them in a file
2
0
Entering edit mode
7.2 years ago
Ana ▴ 200

Hi everyone, I have a vcf-file with 20 millions SNPs for 315 individuals from 26 populations (pop1,pop2,...pop26). I am trying to extract the DP values for each individual and save them for each population in a seperate file to get a final output like this (each column is one individual):

$ head pop1
    .   7   6   .   3   .   5   .   .   .   .
    .   5   6   .   3   .   5   .   .   .   .
    10  6   4   3   5   .   6   13  4   .   10
    10  8   5   5   6   .   8   14  4   .   11
    8   12  5   .   8   .   3   10  3   6   3

What I am doing (which is NOT the best way to do that) is to extract individuals for each population by using vcf-tools using this command

--vcf file.vcf --keep pop1_inds  --recode --recode-INFO-all --out pop1.vcf

and then for each individual in each population, I was running this command separately:

grep -v "^#" pop1.vcf | cut -f 10 | cut -d ':' -f2

so for each individual I got something like this:

 $ head ind.pop1
    .
    .
    .
    6
    6
    4
    6
    5

and finally, I pasted all individuls DP file to gether `

paste ind1.pop1, ind2.pop1 .... > ind.pop1

` for each population to get the final output for pop1 that I showed above. I wonder is there any easier and faster way to do it? I do not want to run vcf-tools. I want to extract directly DP values for individuals from the same population and save them in a file for each population .. I would appreciate any help or suggestion to get this work done easier and faster .

vcf-file depth of coverage • 1.9k views
ADD COMMENT
0
Entering edit mode
7.2 years ago
Russ ▴ 520

You could subset your original VCF file into 26 pop_X.vcf files (pop_1.vcf, pop_2.vcf, etc) using GATK SelectVariants:

java -jar GenomeAnalysisTK.jar -T SelectVariants -sf <list_of_pop_1_sample_names> -V original.vcf -o pop_1.vcf

and then use GATK VariantsToTable:

java -jar GenomeAnalysisTK.jar -T VariantsToTable -gf DP -V pop_1.vcf -o pop_1_depth.txt
ADD COMMENT
0
Entering edit mode

So vcf-tools is doing the same thing for me. I thought there might be a better way. So, lets say I extracted individuals from each population as a seperate file (po1.vcf , pop2.vcf). How can I run

grep -v "^#" pop1.vcf | cut -f 10 | cut -d ':' -f2

for each population file in a loop to get the "pop1" that I showed above? by running grep I get DP for each individual and I just paste them together. I s there easier way to do that and do it for all populations files together!

ADD REPLY
0
Entering edit mode

You can wrap your commands into a bash for loop. For the commands I gave, you could do something like:

for x in {1..26}; do java -jar GenomeAnalysisTK.jar -T SelectVariant -sf pop"$x"_inds -V file.vcf -o pop"$x".vcf; java -jar GenomeAnalysisTK.jar -T VariantsToTable -gf DP -V pop"$x".vcf -o pop"$x"_depth.txt; done

You could easily substitute your own commands in place of the GATK commands - just substitute the "$x" variable for the population number.

ADD REPLY
0
Entering edit mode
7.2 years ago

using bioalcidaejdk:

java -jar dist/bioalcidaejdk.jar -e 'List<String> samples=Arrays.asList("SAMPLE1","SAMPLE1"); stream().forEach(V->{for(int i=0;i<samples.size();i++) print((i==0?"":"\t")+V.getGenotype(samples.get(i)).getDP()); println();});' input.vcf
ADD COMMENT

Login before adding your answer.

Traffic: 1846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6