Question

convert ancestry.com text file to vcf

1

Entering edit mode

5.2 years ago

LTDavid ▴ 50

How do you convert a text file from Ancestry.com to vcf format?

I understand that I could convert from 23andMe to vcf with something like:

bcftools convert -c ID,CHROM,POS,AA -s SampleFile -f reference/Homo_sapiens.GRCh37.dna.primary_assembly.fa --tsv2vcf Data/SampleFile/AncestryDNA.txt -Oz -o Data/SampleFile.vcf.gz

However, Ancestry.com's files are slightly different from 23andMe files. Ancestry.com's files presents as five TAB delimited columns instead of four like 23andMe.

    rsid    chromosome  position    allele1 allele2
rs3131972   1   752721  A   G
rs114525117 1   759036  G   G
rs12124819  1   776546  A   A

l also tried a direct conversion but have something wrong because it's not working:

cat SampleFile.zip|grep -v '#'|grep -v 'rsid'|awk -F'\t' '{ print $1"\t"$2"\t"$3"\t"$4$5; }'|sed s/\\t23\\t/\\tX\\t\/g |sed s/\\t24\\t/\\tY\\t\/g| grep -P -v '\t25\t' >> SampleFile.txt

With Ancestry.com, a generic text file name is within the zip file such that I would need to use the basename that I saved it as for the converted file name. For example:

SampleFile1.zip/AncestryDNA.txt > SampleFile1.txt
SampleFile2.zip/AncestryDNA.txt > SampleFile2.txt

I'm using these files for Beagle 5.1 which has an exception to the vcf format for male chromsomes:

Beagle uses Variant Call Format (VCF) 4.3 for input and output genotype data, except that Beagle requires male non-pseudoautosomal X-chromosome genotypes to be coded as homozygous diploid genotypes.

I'm using Ubuntu 18.04.3 LTS.

vcf AncestryDNA Ancestry.com • 4.9k views

ADD COMMENT • link updated 16 months ago by Ram 44k • written 5.2 years ago by LTDavid ▴ 50

0

Entering edit mode

Hi! thank you for the nice and clear explanation! I have tried to reproduce your example, using the same ref genome. However, LOTS of ALT that are '.', and they do not agree with the example you have produced. For instance, you have:

   #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SampleFile23 
1       752721  rs3131972       A       G       .       . .       GT      0/1 
1       759036  rs114525117     G       A       .  .       .       GT      1/0 
... 
22      51064818  rs762672       T       C       .       .       .       GT      1/1 
22      51064898  rs1106788       G       A       .       .       .       GT     1/0

and I get:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SampleFile23
1       752721  rs3131972       A       G       .       .       .       GT      0/1
1       759036  rs114525117     G       .       .       .       .       GT      0/0
...
22      51064818   rs762672        T       C       .       .       .       GT    1/1
22      51064898   rs1106788       G       .       .       .       .       GT    0/0

Do you have any clue what I missing?

Thanks a lot! Mariana

ADD REPLY • link 4.3 years ago by maribuon • 0

0

Entering edit mode

Hi, Mariana. I'm not sure.

ADD REPLY • link 3.9 years ago by LTDavid ▴ 50

score 1 · Answer 1 · 2019-09-19

I decided to try to convert the ancestry.com txt file to a 23andMe formatted txt file, which may could then be used in the existing bcftools convert command. I got it to work up to converting the format from ancestry.com to 23andMe using this:

7z x SampleFile.zip ; mv AncestryDNA.txt SampleFile.txt
gawk -i inplace -F'\t' '{ print $1"\t"$2"\t"$3"\t"$4$5; }' ${file%.zip}.txt

To load Homo_sapiens.GRCh37.dna.primary_assembly.fa to use in the bcftools convert command.

wget http://ftp.ensembl.org/pub/release 75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

To convert the new Ancestry.com (formatted as 23andMe) to vcf format

bcftools convert -c ID,CHROM,POS,AA -s SampleFile23 --haploid2diploid -f /home/reference/references/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --tsv2vcf SampleFile23.txt -Oz -o SampleFile23.vcf.gz

This seems to work by listing for chrom 1 - 22 (though I haven't compared it to the original Ancestry.com zip file).

...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SampleFile23
1       752721  rs3131972       A       G       .       .       .       GT      0/1
1       759036  rs114525117     G       A       .       .       .       GT      1/0
...
22      51064818        rs762672        T       C       .       .       .       GT      1/1
22      51064898        rs1106788       G       A       .       .       .       GT      1/0
...

Okay, it looks like I provided rudimentary answer to my question.

And this seems to be working as a rudimentary script. The last line is still running but the MergedSamples file has been created with enough to see that it's merging.

@echo off
setlocal EnableDelayedExpansion

for file in inputs/*.zip; do 
        echo "converting to vcf.gz: " $file
        7z x $file
        mv AncestryDNA.txt ${file%.zip}.txt
        gawk -i inplace -F'\t' '{ print $1"\t"$2"\t"$3"\t"$4$5; }' ${file%.zip}.txt  
        bcftools convert -c ID,CHROM,POS,AA -s ${file%.zip} \
                --haploid2diploid \
                -f /home/reference/references/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
                --tsv2vcf ${file%.zip}.txt \
                -Oz -o ${file%.zip}.vcf.gz

done

for file in inputs/*.vcf.gz; do
        echo "indexing sample vcf file" $file
        tabix $file
done

cd inputs
for files in *.vcf.gz; do bcftools merge -o Results/MergedSamples *.vcf.gz; done

The run time for this script was 32 minutes and 26.08 seconds. 4 vCPUs, 3.6 GB memory. 32 samples with about 700,000 SNPs each.