Question

convert GRCh38 FASTA .fa to .vcf

0

Entering edit mode

5.4 years ago

LTDavid ▴ 50

How do you convert the GRCh38.fa FASTA file to a VCF file?

The file is from Ensembl primary_assembly ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/

FASTA vcf • 3.3k views

ADD COMMENT • link updated 5.4 years ago by Ahill ★ 2.0k • written 5.4 years ago by LTDavid ▴ 50

0

Entering edit mode

Perhaps you should first look into which information can be found in those file types, and think about what you aim to achieve. The current question doesn't make sense.

ADD REPLY • link 5.4 years ago by WouterDeCoster 47k

score 3 · Accepted Answer · 2019-07-26

3

Entering edit mode

5.4 years ago

Ahill ★ 2.0k

You can't convert a FASTA file like the ones you've linked to a VCF in any meaningful way. FASTA files contain sequences, but VCF format is designed to capture information about variations in sequence among individuals. Why are you trying to convert from FASTA to VCF?

ADD COMMENT • link 5.4 years ago by Ahill ★ 2.0k

0

Entering edit mode

I'm trying to take raw DNA files downloaded from AncestryDNA and 23andMe and determine relatedness among the sample. I'm using plantimals/2vcf to convert from (txt)zip to vcf.gz. The problem is that there are markers missing in the vcf file that were in the zip file. For example, in my zip file downloaded from Ancestry.com, I have markers rs369202065 and rs199476136, but they do not show up in the output vcf file (with rs199476136 also not showing up in the GRCh37.p13 reference file). 2vcf uses GRCh37.p13 as a reference.vcf so I thought maybe the issue was that I needed to update the reference file. I have the GRCh38 from Ensembl as a FASTA file. I thought maybe if I could convert it to vcf and redirect 2vcf to use this as the reference file that all my markers in the data zip file would transfer over to the new data vcf file. (I then use bcftools to merge, Beagle 4.0 for family-based phasing, Beagle 5.0 for phasing as a comparison, Refined IBD for IBD detection, and IBD Relatedness Estimation for relatedness). I'm open to suggestions for a better method because my next task is to do same for new samples being genotyped at my university lab rather than from AncestryDNA.

Note: The files come as zip files from AncestryDNA. When unzipped, they are txt files. All bare the name. 2vcf use the name of the zip file to rename the vcf file (otherwise, all the unzipped text files have the exact same name: AncestryDNA.txt to AncestryDNA.vcf.gz).

ADD REPLY • link 5.4 years ago by LTDavid ▴ 50

0

Entering edit mode

Yes if you search in the default 2vcf reference VCF on github, those 2 markers (rs369202065 and rs199476136) are not present. The reference VCF appears to be ~4 years old (##fileDate=20151104, ##dbSNP_BUILD_ID=146), so if your new raw data from Ancestry is more recent, missing rsIDs might not be surprising. The issue you filed with the code maintainer may be the best way to get a response. It looks like you would want to point 2vcf to an updated VCF that includes all the sites that are in all your input files. To make an updated VCF, you could start from the NCBI dbSNP database (not an Ensembl reference FASTA). Make sure you have full information on the genome version(s) and dbSNP version(s) that were used to generate the source files.

ADD REPLY • link 5.4 years ago by Ahill ★ 2.0k

0

Entering edit mode

Thank you very much, Ahill. The NCBI dbSNP database is exactly what I needed. They even provide the latest release already in both VCF and JSON formats on their About Reference SNP (rs) page. I greatly appreciate you asking me more details about why I was asking my question and investigating my reply. I was able to run all the programs mentioned in my first reply before. Now I have redirected 2vcf to the new reference and am running 2vcf again.

ADD REPLY • link 5.4 years ago by LTDavid ▴ 50