Imposing a human reference genome onto a vcf file?
1
0
Entering edit mode
7.1 years ago

Hello!

I have a vcf file that details the alleles of a few hundred individuals' exomes.

I would like to merge this with a human reference genome (exome) in order to see how my individuals' alleles differ from the reference, rather than differ amongst themselves.

Is this possible? I am not sure where to start... I have vcftools ready for analysis.

vcf genome sequencing SNP allele • 2.7k views
ADD COMMENT
0
Entering edit mode

in order to see how my individuals' alleles differ from the reference

the REF column should answer the question. Or give us an example of input/output.

ADD REPLY
0
Entering edit mode

I don't think there is a reference... having a reference is optional I think?

(There is mention of a reference 'Homo_sapiens_assembly19' in the header lines though)

Example:

CHROM   POS N_ALLELES   N_CHR   {ALLELE:COUNT}
1   865625  2   696 G:695   A:1
1   865665  2   698 G:691   A:7
1   865694  2   698 C:693   T:5
1   865734  2   698 G:697   A:1
1   865738  2   698 A:694   G:4
1   865746  2   698 G:697   A:1
1   866422  2   698 C:696   T:2
1   871219  2   698 C:696   T:2
1   874415  2   698 C:688   T:10
ADD REPLY
0
Entering edit mode

I have a vcf file that details the alleles of a few hundred individuals' exomes.

so you're wrong; this is NOT a vcf file.

ADD REPLY
0
Entering edit mode

The example I pasted is from the output of the '--counts' option in vcftools, acting on the vcf file.

The actual vcf file is huge... what part would you like to see?

ADD REPLY
1
Entering edit mode

So you have a vcf, which does contain a REF field?

to see how my individuals' alleles differ from the reference, rather than differ amongst themselves.

In addition, variant calling is usually performed by comparing to the reference genome. Sounds like you already got what you want.

ADD REPLY
0
Entering edit mode

Yes it appears I already had what I wanted... the source of my vcf data claimed otherwise, and I have little experience with these things, so was easily confused!

ADD REPLY
1
Entering edit mode
7.1 years ago

assuming your file is tab delimited and you have a reference genome indexed with samtools faidx with the correct chromosome names.

grep -v CHROM input.txt | while IFS='' read  L; do echo -ne "${L}\t" && echo "${L}" |awk '{printf("%s:%s-%s\n",$1,$2,$2);}' | xargs samtools faidx /path/to/human_hg19.fasta | grep -v '^>' | tr -d '\n'; echo ; done

UPDATE: from a real VCF and using bioalcidaejdk:

java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->{println(V.getContig()+"\t"+V.getStart()+"\t"+V.getReference().getDisplayString()+"\t"+V.getAlleles().size()+"\t"+V.getGenotypes().stream().flatMap(G->G.getAlleles().stream()).filter(A->A.isCalled()).count()+"\t"+V.getGenotypes().stream().flatMap(G->G.getAlleles().stream()).filter(A->A.isCalled()).map(A->A.getDisplayString()).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())));});' input.vcf




rotavirus   51  A   2   8   {A=6, G=2}
rotavirus   91  A   2   8   {A=7, T=1}
rotavirus   130 T   2   8   {C=1, T=7}
rotavirus   232 T   2   8   {A=1, T=7}
rotavirus   267 C   2   8   {C=7, G=1}
rotavirus   424 A   2   8   {A=7, G=1}
rotavirus   520 T   2   8   {A=1, T=7}
rotavirus   536 A   2   8   {A=6, T=2}
rotavirus   562 A   2   8   {A=7, G=1}
rotavirus   583 G   2   8   {C=1, G=7}
rotavirus   661 T   2   8   {A=1, T=7}
rotavirus   693 T   2   8   {T=6, G=2}
rotavirus   738 T   2   8   {A=1, T=7}
rotavirus   799 A   2   8   {A=6, C=2}
rotavirus   812 G   2   8   {T=2, G=6}
rotavirus   833 G   2   8   {A=2, G=6}
rotavirus   916 A   2   8   {A=6, T=2}
rotavirus   946 C   2   8   {A=1, C=7}
rotavirus   961 T   2   8   {A=1, T=7}
rotavirus   1044    A   2   8   {A=6, T=2}
rotavirus   1045    C   2   8   {C=6, G=2}
rotavirus   1054    C   2   8   {C=6, G=2}
rotavirus   1064    G   2   8   {A=2, G=6}
ADD COMMENT
0
Entering edit mode

Thank you very much for the effort, Pierre!

It turns out I was just confused. The source of my vcf data claimed that the first column in the --freq or --counts output was NOT the reference column, but it actually was. So there was no problem to begin with...

ADD REPLY

Login before adding your answer.

Traffic: 2596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6