Create a pseudo-haploid fasta file based genotype calls for an individual (vcf file)
1
2
Entering edit mode
6.1 years ago
Jautis ▴ 580

Hi, I have a reference fasta file and a vcf file of SNP variant calls. For each individual in the vcf file, I want to create a new, "pseudo-haploid" fasta file where every base is randomly sampled from one of the individual's alleles.

This seems to be a different problem than GATK's FastaAlternateReferenceMaker which inserts the alternate allele (not caring about individual genotypes or ever retaining the reference allele).

Can anybody offer a tool or some advice? Thanks!

Short example:
    Sequence: ATAAATTCCC (10 bp long)
    VCF:
        POS    REF     ALT      Ind1    Ind2
        2      T       C        0/1     1/1
        5      A       G        0/0     0/1
   Output:
        Ind1: A **T** AAATTCCC     or  A **C** AAATTCCC
        Ind2: A*C*AA **A** TTCCC     or  A*C*AA **G** TTCCC
vcf sequence assembly SNP • 2.8k views
ADD COMMENT
0
Entering edit mode
6.1 years ago
Len Trigg ★ 1.6k

Fairly close is rtg samplereplay from RTG Tools, which generates a "pseudo-genome" for an individual based on the genotypes in the VCF, e.g.:

# Make test data
$ cat <<EOF >ref.fa
>seq
ATAAATTCCC
EOF
# RTG likes to store sequence data in it's "SDF" format:
$ rtg format -o ref.sdf ref.fa
Formatting FASTA data
Processing "ref.fa"

Input Data
Files              : ref.fa
Format             : FASTA
Type               : DNA
Number of sequences: 1
Total residues     : 10
Minimum length     : 10
Mean length        : 10
Maximum length     : 10

Output Data
SDF-ID             : 9b0da935-33c8-421d-bf83-54fd8a13ce5f
Number of sequences: 1
Total residues     : 10
Minimum length     : 10
Mean length        : 10
Maximum length     : 10

# Make test VCF (contains embedded tabs)
$ cat <<EOF >ind.vcf
##fileformat=VCFv4.1
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Ind1    Ind2
seq 2   .   T   C   .   .   .   GT  0/1 1/1
seq 5   .   A   G   .   .   .   GT  0/0 0/1
EOF
# Index the variants VCF
$ rtg bgzip ind.vcf && rtg index ind.vcf.gz
Creating index for: ind.vcf.gz (ind.vcf.gz.tbi)

# Generate the pseudo-genome for Ind1 and show us the FASTA
$ rtg samplereplay -t ref.sdf -i ind.vcf.gz -o Ind1.sdf --sample Ind1 && rtg sdf2fasta -i Ind1.sdf -o -
>seq_0
ATAAATTCCC
>seq_1
ACAAATTCCC

# Generate the pseudo-genome for Ind2 and show us the FASTA
$ rtg samplereplay -t ref.sdf -i ind.vcf.gz -o Ind2.sdf --sample Ind2 && rtg sdf2fasta -i Ind2.sdf -o -
>seq_0
ACAAATTCCC
>seq_1
ACAAGTTCCC

The rtg samplereplay command doesn't randomly sample the alleles, it treats the input VCF as though it were phased and produces the two simulated chromosomes (it's usually used with rtg samplesim which produces a VCF of phased simulated genotypes, and rtg readsim to simulate NGS sequencing of the sample genome). Hope this helps with your use case.

ADD COMMENT

Login before adding your answer.

Traffic: 1930 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6