Explanation

Question

group samples based on shared mutations in a single multi samples vcf file

2

Entering edit mode

5.9 years ago

ahmedakhokhar ▴ 150

Hi, I am learning about vcf file formatting and sorry for asking a dumb question.

I have a multi-sample (>300) vcf file and I want to group (take) samples with shared mutations/GT.

Can someone suggest a tool/command/script solve this problem?

Thanks

R vcf python shell • 2.7k views

ADD COMMENT • link updated 5.9 years ago by Pierre Lindenbaum 164k • written 5.9 years ago by ahmedakhokhar ▴ 150

0

Entering edit mode

FYI: Cross posted on stackexchange.

ADD REPLY • link 5.9 years ago by finswimmer 16k

0

Entering edit mode

FYI: I asked for help because I am learning and got myself stuck.

ADD REPLY • link 5.9 years ago by ahmedakhokhar ▴ 150

0

Entering edit mode

Hello ahmedakhokhar ,

I'm sorry if you find my comment harsh. But it was just the information to other people, who might want to help you, that you asked the exact question somewhere else.

It is frustrating to write a solution and see afterwards that your problem is already solved. This is why it is necessary to tell people where you asked your question as well.

ADD REPLY • link 5.9 years ago by finswimmer 16k

0

Entering edit mode

No worries, I understand.

ADD REPLY • link 5.9 years ago by ahmedakhokhar ▴ 150

0

Entering edit mode

what is group(take) ? what is "shared mutations/GT." ?

I don't understand. Please provide a minimal example of input+output.

ADD REPLY • link 5.9 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Explanation

shared mutations = the mutation positions present (shared) among different samples since not all mutations are present in all the samples. And, get an output file with a list of samples (name and count) with POS, COUNT and GT info.

DATA

"input.vcf" file is a normal vcf file with >300 samples as follow:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 ........SAMPLE300

Expected "output.txt" of a list of samples (count) with POS and GT info as follow:

POS GT COUNT SAMPLE

41230 0/1  6 SAMPLE3 SAMPLE4 SAMPLE57 SAMPLE90 SAMPLE17 SAMPLE49...

88099 1/0  4 SAMPLE13 SAMPLE40 SAMPLE5 SAMPLE101...

56737 1/1  5 SAMPLE11 SAMPLE414 SAMPLE7 SAMPLE90 SAMPLE78...

Thanks, please let me know if it's still not clear.

ADD REPLY • link updated 5.9 years ago by Pierre Lindenbaum 164k • written 5.9 years ago by ahmedakhokhar ▴ 150

0

Entering edit mode

what if some samples are 0/1 and 1/1 on the same variant ?

it sounds like a xyz problem... what's your final aim ?

ADD REPLY • link 5.9 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

samples with 0/1 and 1/1 for the same positions with go as a separate entry but if I can get info of 1/1 only, that would be sufficient.

The final aim is to know which samples share which mutation (position) and to separate them into groups.

ADD REPLY • link 5.9 years ago by ahmedakhokhar ▴ 150

1

Entering edit mode

5.9 years ago

Pierre Lindenbaum 164k

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

 java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->{V.getGenotypes().stream().filter(G->G.isCalled() && !G.isHomRef()).collect(Collectors.groupingBy(G->G.getType(),Collectors.mapping(G->G.getSampleName(),Collectors.toSet()))).forEach((X,Y)->println(V.getContig()+" "+V.getStart()+" "+X+" "+String.join(";",Y)));});' input.vcf

RF01 970 HOM_VAR S5
RF02 251 HET S3;S2
RF02 578 HOM_VAR S4
RF02 877 HET S1
RF02 1726 HET S3;S2
RF02 1962 HET S1
RF03 1221 HOM_VAR S3;S2
RF03 1242 HOM_VAR S4
RF03 1688 HOM_VAR S5
RF03 1708 HOM_VAR S5
RF03 2150 HET S1
RF03 2201 HET S3;S2
RF03 2315 HET S3;S4;S2
RF03 2573 HOM_VAR S3;S2
RF04 887 HET S1
RF04 991 HET S4
RF04 1241 HOM_VAR S4
RF04 1259 HET S4
RF04 1857 HET S1
RF04 1900 HOM_VAR S5
RF04 1920 HOM_VAR S1
RF05 41 HOM_VAR S4
RF05 499 HOM_VAR S3;S2
RF05 795 HET S4
RF05 879 HOM_VAR S3;S2
RF05 1297 HOM_VAR S3;S2
RF05 1339 HET S3;S2
RF06 517 HET S1
RF06 543 HOM_VAR S3;S2
RF06 668 HOM_VAR S5
RF06 695 HET S4
RF06 1129 HOM_VAR S4
RF07 98 HET S3;S2
RF07 225 HOM_VAR S3;S2
(...)

ADD COMMENT • link 5.9 years ago by Pierre Lindenbaum 164k

score 1 · Accepted Answer · 2019-01-17

1

Entering edit mode

5.9 years ago

finswimmer 16k

I dont't know if this is exactly what you want.

For each genotype you are interested run this:

$ bcftools query -i 'GT="0/1"' -f '%CHROM\t%POS\t0/1\t%REF\t%ALT\t[%SAMPLE,]\n' input.vcf|sed 's/,$//'|awk -v OFS="\t" '{counts = split($6, samples, ","); print $1, $2, $3, counts, $4, $5, $6}' > stats_01.csv

In this example bcftools query looks for samples with genotype 0/1 and print out CHROM, POS, 0/1, REF, ALT and a comma seperated list of sample names.

sed removes the additional , at the end. awk counts the number of sample names in each row and add this column.

fin swimmer

ADD COMMENT • link 5.9 years ago by finswimmer 16k

0

Entering edit mode

Perfect, that can work nicely, one more piece of help. The sample names start with either "A_" or "B_" (followed by individual names). How can I count the sub-population of samples? Thank you very much.

ADD REPLY • link 5.9 years ago by ahmedakhokhar ▴ 150