Question

Require suggestion for python tools that can implement VCF file into FASTA files

0

Entering edit mode

2.0 years ago

Rachayita ▴ 10

I am working on 1001 genomes from the 1001 genomes project. I wish to implement the VCF file into FASTA files incorporating the indels, etc to better work on the genomes. I know about GATK but is there any other tool that can help me achieve this, preferably in python environment?

fasta vcf • 1.9k views

ADD COMMENT • link updated 24 months ago by Alban Nabla ▴ 30 • written 2.0 years ago by Rachayita ▴ 10

0

Entering edit mode

First do you really mean ‘1001’ genomes or ‘1000’ genomes? Then, what do you mean by ‘implement’, do you really mean convert? Finally, it’s not clear how introducing indels into a fasta makes any sense. Why would having the information in fasta format help?

ADD REPLY • link 2.0 years ago by 4galaxy77 2.9k

1

Entering edit mode

Thank you for your response. 1001 genome project is a catalog of Arabidopsis thaliana genetic variation. I want to incorporate the SNPs of the VCF into the reference genome so that I have separate fasta files for all the individual genomes corresponding to all the individual VCFs.

ADD REPLY • link 2.0 years ago by Rachayita ▴ 10

score 2 · Answer 1 · 2022-11-28

2

Entering edit mode

24 months ago

Jukka Matilainen ▴ 80

Have a look at bcftools consensus:

Create consensus sequence by applying VCF variants to a reference fasta file.

(Not a Python tool, though.)

ADD COMMENT • link 24 months ago by Jukka Matilainen ▴ 80

0

Entering edit mode

Could you please give me an idea on how to do that? Thank you!

ADD REPLY • link 24 months ago by Rachayita ▴ 10

0

Entering edit mode

There's an examples section in the link.

ADD REPLY • link 24 months ago by 4galaxy77 2.9k

score 1 · Answer 2 · 2022-11-29

1

Entering edit mode

24 months ago

Alban Nabla ▴ 30

I think vcf-consensus-builder may be what you are after.

ADD COMMENT • link 24 months ago by Alban Nabla ▴ 30

0

Entering edit mode

Thank you for your response. I looked into vcf consensus builder, but looks like it needs a depth file to work. I have just the vcf files and the reference genome fasta. Any other alternatives are highly appreciated.

ADD REPLY • link 24 months ago by Rachayita ▴ 10

0

Entering edit mode

I am afraid if you don't have access to the .bam files or a depth file, it will be hard to build a consensus.

I drafted a (clumsy) script to integrate the vcfs for one sequence/chromosome:

from Bio.Seq import Seq, MutableSeq
from cyvcf2 import VCF
from Bio import SeqIO

# parse the files and extract one chromosome
genome = SeqIO.parse("genome.fasta", "fasta")
chrom = next(genome)
chrom1A = chrom
chrom1B = chrom
vcf = VCF('variations.vcf')
chromvars = vcf('1')

def integrate(vcf,chrom):
    chromA = MutableSeq(chrom.seq)
    chromB = MutableSeq(chrom.seq)
    shiftA = 0
    shiftB = 0
    for var in vcf:
        if var.var_type == 'snp':
            if len(var.ALT) == 1:
                chromA[var.POS+shiftA-1] = var.ALT[0]
                chromB[var.POS+shiftB-1] = var.ALT[0]
            if len(var.ALT) == 2:
                chromA[var.POS+shiftA-1] = var.ALT[0]
                chromB[var.POS+shiftB-1] = var.ALT[1]
        if var.var_type == 'indel':
            reflen = len(var.REF)
            if len(var.ALT) == 1:
                chromA = chromA[:var.POS+shiftA-1] + var.ALT[0] + chromA[var.POS+shiftA+reflen:]
                chromB = chromB[:var.POS+shiftB-1] + var.ALT[0] + chromB[var.POS+shiftB+reflen:]
                shiftA = shiftA + len(var.ALT[0]) - reflen
                shiftB = shiftB + len(var.ALT[0]) - reflen
            if len(var.ALT) == 2:
                chromA = chromA[:var.POS+shiftA-1] + var.ALT[0] + chromA[var.POS+shiftA+reflen:]
                chromB = chromB[:var.POS+shiftB-1] + var.ALT[1] + chromB[var.POS+shiftB+reflen:]
                shiftA = shiftA + len(var.ALT[0]) - reflen
                shiftB = shiftB + len(var.ALT[1]) - reflen
    seq1 = Seq(chromA)
    seq2 = Seq(chromB)
    return seq1, seq2

# integrate using the above function
chromA, chromB = integrate(chromvars,chrom)
chrom1A.seq = chromA
chrom1B.seq = chromB

# output to fasta
SeqIO.write(chrom1A, 'modchrom1A.fsa', 'fasta')
SeqIO.write(chrom1B, 'modchro1B.fsa', 'fasta')

This is for one chromosome, but you can iterate through a full genome and automate the whole process. I am sure there are better ways but I hope this helps.

ADD REPLY • link 24 months ago by Alban Nabla ▴ 30