I dont't know if this is exactly what you want.
For each genotype you are interested run this:
$ bcftools query -i 'GT="0/1"' -f '%CHROM\t%POS\t0/1\t%REF\t%ALT\t[%SAMPLE,]\n' input.vcf|sed 's/,$//'|awk -v OFS="\t" '{counts = split($6, samples, ","); print $1, $2, $3, counts, $4, $5, $6}' > stats_01.csv
In this example bcftools query
looks for samples with genotype 0/1
and print out CHROM, POS, 0/1
, REF, ALT and a comma seperated list of sample names.
sed
removes the additional ,
at the end. awk
counts the number of sample names in each row and add this column.
fin swimmer
FYI: Cross posted on stackexchange.
FYI: I asked for help because I am learning and got myself stuck.
Hello ahmedakhokhar ,
I'm sorry if you find my comment harsh. But it was just the information to other people, who might want to help you, that you asked the exact question somewhere else.
It is frustrating to write a solution and see afterwards that your problem is already solved. This is why it is necessary to tell people where you asked your question as well.
No worries, I understand.
what is group(take) ? what is "shared mutations/GT." ?
I don't understand. Please provide a minimal example of input+output.
Explanation
shared mutations = the mutation positions present (shared) among different samples since not all mutations are present in all the samples. And, get an output file with a list of samples (name and count) with POS, COUNT and GT info.
DATA
"input.vcf" file is a normal vcf file with >300 samples as follow:
Expected "output.txt" of a list of samples (count) with POS and GT info as follow:
Thanks, please let me know if it's still not clear.
what if some samples are 0/1 and 1/1 on the same variant ?
it sounds like a xyz problem... what's your final aim ?
samples with 0/1 and 1/1 for the same positions with go as a separate entry but if I can get info of 1/1 only, that would be sufficient.
The final aim is to know which samples share which mutation (position) and to separate them into groups.