Question

Constructing nucleotide sequence after incorporating SNPs from SIFT file

1

Entering edit mode

6.2 years ago

pixie@bioinfo ★ 1.5k

Hello, I have a file which gives me the position of the altered nucleotide or small insertion or deletion (reference allele and altered allele). Is there any tool which can be used to perform these changes on the reference nucleotide sequence and incorporate the ALT_ALLELE changes ?

CHROM   POS REF_ALLELE  ALT_ALLELE  TRANSCRIPT_ID   GENE_ID GENE_NAME   REGION  VARIANT_TYPE    REF_AMINO   ALT_AMINO   AMINO_POS   SIFT_SCORE  SIFT_MEDIAN NUM_SEQS    dbSNP   SIFT_PREDICTION
SL3.0ch00   723860  A   C   mRNA.Solyc00g005060.1.1 gene.Solyc00g005060.1   Solyc00g005060.1    CDS NONSYNONYMOUS   W   G   52  0   4.32    1   novel   DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00   723867  A   C   mRNA.Solyc00g005060.1.1 gene.Solyc00g005060.1   Solyc00g005060.1    CDS SYNONYMOUS  G   G   49  1   4.32    1   novel   TOLERATED
SL3.0ch00   723903  T   C   mRNA.Solyc00g005060.1.1 gene.Solyc00g005060.1   Solyc00g005060.1    CDS SYNONYMOUS  G   G   37  1   4.32    1   novel   TOLERATED

genome • 1.2k views

ADD COMMENT • link updated 6.2 years ago by finswimmer 16k • written 6.2 years ago by pixie@bioinfo ★ 1.5k

score 3 · Accepted Answer · 2018-10-10

3

Entering edit mode

6.2 years ago

jean.elbers ★ 1.7k

Give bcftools consensus a try

https://samtools.github.io/bcftools/bcftools.html#consensus

or GATK's FastaAlternateReferenceMaker (slower than bcftools consensus)

https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_fasta_FastaAlternateReferenceMaker.php

ADD COMMENT • link 6.2 years ago by jean.elbers ★ 1.7k

score 2 · Accepted Answer · 2018-10-10

Hello,

as jean.elbers says before bcftools consensus is the way you have to go. Therefor you need to convert your file to a valid vcf file.

$ echo "##fileformat=VCFv4.2" > input.vcf
$ echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" >> input.vcf
$ tail -n+2 variants.txt|awk -v OFS="\t" '{print $1,$2,".",$3,$4,".",".","."}'  >> input.vcf

Then compress it with bgzip and index by tabix.

$ bgzip -c input.vcf > input.vcf.gz
$ tabix input.vcf.gz

Now you are ready for bcftools consensus.

$ bcftools consensus -f genome.fa input.vcf.gz -o output.fa

fin swimmer

bcftools consensus may warn you about sequences not found in the vcf. You can safely ignore this. The warning happens if it finds sequence id's in the reference that are not in your vcf. If you want to get rid of those messages each contig of the reference needs to be defined in the vcf header. This can be done like this:

1. Index the reference

$ samtools faidx genome.fa

2. Create headers for the vcf

$ echo "##fileformat=VCFv4.2" > input.vcf
$ awk '{print "##contig=<ID="$1",length="$2">"}' genome.fa.fai >> input.vcf 
$ echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" >> input.vcf

3. Add variant information

$ tail -n+2 variants.txt|awk -v OFS="\t" '{print $1,$2,".",$3,$4,".",".","."}'  >> input.vcf

Continue with bgzip, tabix and bcftools consensus as above.