Hi everyone, I have a vcf-file with 20 millions SNPs for 315 individuals from 26 populations (pop1,pop2,...pop26). I am trying to extract the DP values for each individual and save them for each population in a seperate file to get a final output like this (each column is one individual):
$ head pop1
. 7 6 . 3 . 5 . . . .
. 5 6 . 3 . 5 . . . .
10 6 4 3 5 . 6 13 4 . 10
10 8 5 5 6 . 8 14 4 . 11
8 12 5 . 8 . 3 10 3 6 3
What I am doing (which is NOT the best way to do that) is to extract individuals for each population by using vcf-tools using this command
--vcf file.vcf --keep pop1_inds --recode --recode-INFO-all --out pop1.vcf
and then for each individual in each population, I was running this command separately:
grep -v "^#" pop1.vcf | cut -f 10 | cut -d ':' -f2
so for each individual I got something like this:
$ head ind.pop1
.
.
.
6
6
4
6
5
and finally, I pasted all individuls DP file to gether `
paste ind1.pop1, ind2.pop1 .... > ind.pop1
` for each population to get the final output for pop1 that I showed above. I wonder is there any easier and faster way to do it? I do not want to run vcf-tools. I want to extract directly DP values for individuals from the same population and save them in a file for each population .. I would appreciate any help or suggestion to get this work done easier and faster .
So vcf-tools is doing the same thing for me. I thought there might be a better way. So, lets say I extracted individuals from each population as a seperate file (po1.vcf , pop2.vcf). How can I run
for each population file in a loop to get the "pop1" that I showed above? by running grep I get DP for each individual and I just paste them together. I s there easier way to do that and do it for all populations files together!
You can wrap your commands into a bash
for
loop. For the commands I gave, you could do something like:You could easily substitute your own commands in place of the GATK commands - just substitute the "$x" variable for the population number.