Modifying fasta file based on vcf information
4
1
Entering edit mode
9.5 years ago
Jautis ▴ 580

Hello, I'm trying to form a fasta file to represent the genome of a sub-population. Currently, I have the initial reference genome (fasta) and a vcf file with two individuals from the sub-population. How can I randomly substitute one of the genotype calls from the vcf into the reference genome at the designated position?

Thank you very much!

reference-genome vcf fasta • 7.6k views
ADD COMMENT
0
Entering edit mode

I know of GATK's AlternateReferenceMaker, but I have been unable to find how it decides which genotype to replace the reference with if multiple are listed. What I'm looking to do is randomly substitute one in (I'd also like to be able to just pick the most common, but that is a separate query).

ADD REPLY
0
Entering edit mode

Hello jautis,

I have the same task; could you guide me how to do it? How did you do this? I mean how to apply all the variations in a VCF file to the reference genome to create a sample genome?

I'm new in bioinformatics.

Thanks

ADD REPLY
1
Entering edit mode

I used the GATK method suggested by Ashutosh Pandey. It worked pretty well

ADD REPLY
3
Entering edit mode
9.5 years ago

You can use the FastaVariant class in the pyfaidx module if you're comfortable with a bit of Python. Sounds like you want something very specific so a script might be your best option.

ADD COMMENT
0
Entering edit mode

The FastaVariant tool does seem like it should suit my purposes, but I'm getting a stack smashing error when I try to run it. Is there a forum for the program where I can post this (I haven't been able to find one)?

ADD REPLY
0
Entering edit mode

Yes, please report this in the github repository issue tracker.

ADD REPLY
0
Entering edit mode

This is working perfectly for me. Thanks!

ADD REPLY
0
Entering edit mode

Since the documentation is very scarce, it's hard to infer that FastaVariant ignores indels. You can see that in the code.

ADD REPLY
2
Entering edit mode
9.5 years ago
Len Trigg ★ 1.6k

The simulation tools that are part of RTG Core should let you do this, depending on exactly what you want. E.g:

# RTG commands use a formatted structure (SDF) for random access to parts
# of the data, so format the reference
rtg format -o reference.sdf reference.fasta

# Annotate your population VCF with allele frequency information from the
# samples (if it doesn't already have it)
rtg vcfannotate -i input.vcf.gz -o popvariants.vcf.gz --fill-an-ac

# Generate a new VCF sample column, randomly generated based on allele
# frequency information in the population vcf, plus an SDF containing the
# genome of that individual
rtg samplesim -t reference.sdf -i popvariants.vcf.gz -o pop_plus_synthetic.vcf.gz --output-sdf synthetic_ind.sdf --sample synthetic_ind

# Extract FASTA from the SDF
rtg sdf2fasta -i synthetic_ind.sdf -o synthetic_ind.fasta.gz

The defaults assume a diploid organism, so the output will contain two copies of each chromosome. If you want to get fancy, you can configure your reference with information about autosomes/sex chromosomes and you will end up with the appropriate sequences in the output. We use these tools for simulating small populations/pedigrees of human genomes, including random population variants (popsim), founder individuals (samplesim), offspring (childsim), and de novo variants (denovosim).

ADD COMMENT
0
Entering edit mode

Thankyou, looking at the literature, that seems like it should work very well. However, I'm having issues getting rtg to run. I have an older version of java in /usr/bin/java and don't have the permissions to change it. Any idea how I can use a version of java saved in ~/java instead?

ADD REPLY
0
Entering edit mode

In the RTG installation directory there is a configuration file rtg.cfg where you can set RTG_JAVA to point at whichever version of java you want. (There is more information in the user manual, and if you have further questions, it may be more appropriate to post them to the rtg-users discussion group)

ADD REPLY
0
Entering edit mode

This might be an old question, but these days you can basically install any version of Java in a conda environment. Since your conda installation can be local, it will not require superuser rights.

ADD REPLY
0
Entering edit mode
9.5 years ago

If there are multiple variants at a position then it will randomly select one of the alleles or genotype calls. See the caveat section: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_fasta_FastaAlternateReferenceMaker.php

ADD COMMENT
0
Entering edit mode

I see that, but I would like the reference to still be an option as well. For example, if the reference was A and the individuals were A/A and A/C I would like a 75% chance alternate reference is A and 25% chance that it is C.

ADD REPLY
0
Entering edit mode
4.3 years ago
skanterakis ▴ 130

Just adding it here for reference:

bcftools consensus will do the job for SNPs and InDels

ADD COMMENT

Login before adding your answer.

Traffic: 1983 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6