Constructing nucleotide sequence after incorporating SNPs from SIFT file
2
1
Entering edit mode
6.2 years ago
pixie@bioinfo ★ 1.5k

Hello, I have a file which gives me the position of the altered nucleotide or small insertion or deletion (reference allele and altered allele). Is there any tool which can be used to perform these changes on the reference nucleotide sequence and incorporate the ALT_ALLELE changes ?

CHROM   POS REF_ALLELE  ALT_ALLELE  TRANSCRIPT_ID   GENE_ID GENE_NAME   REGION  VARIANT_TYPE    REF_AMINO   ALT_AMINO   AMINO_POS   SIFT_SCORE  SIFT_MEDIAN NUM_SEQS    dbSNP   SIFT_PREDICTION
SL3.0ch00   723860  A   C   mRNA.Solyc00g005060.1.1 gene.Solyc00g005060.1   Solyc00g005060.1    CDS NONSYNONYMOUS   W   G   52  0   4.32    1   novel   DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00   723867  A   C   mRNA.Solyc00g005060.1.1 gene.Solyc00g005060.1   Solyc00g005060.1    CDS SYNONYMOUS  G   G   49  1   4.32    1   novel   TOLERATED
SL3.0ch00   723903  T   C   mRNA.Solyc00g005060.1.1 gene.Solyc00g005060.1   Solyc00g005060.1    CDS SYNONYMOUS  G   G   37  1   4.32    1   novel   TOLERATED
genome • 1.2k views
ADD COMMENT
3
Entering edit mode
ADD COMMENT
2
Entering edit mode
6.2 years ago

Hello,

as jean.elbers says before bcftools consensus is the way you have to go. Therefor you need to convert your file to a valid vcf file.

$ echo "##fileformat=VCFv4.2" > input.vcf
$ echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" >> input.vcf
$ tail -n+2 variants.txt|awk -v OFS="\t" '{print $1,$2,".",$3,$4,".",".","."}'  >> input.vcf

Then compress it with bgzip and index by tabix.

$ bgzip -c input.vcf > input.vcf.gz
$ tabix input.vcf.gz

Now you are ready for bcftools consensus.

$ bcftools consensus -f genome.fa input.vcf.gz -o output.fa

fin swimmer


bcftools consensus may warn you about sequences not found in the vcf. You can safely ignore this. The warning happens if it finds sequence id's in the reference that are not in your vcf. If you want to get rid of those messages each contig of the reference needs to be defined in the vcf header. This can be done like this:

1. Index the reference

$ samtools faidx genome.fa

2. Create headers for the vcf

$ echo "##fileformat=VCFv4.2" > input.vcf
$ awk '{print "##contig=<ID="$1",length="$2">"}' genome.fa.fai >> input.vcf 
$ echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" >> input.vcf

3. Add variant information

$ tail -n+2 variants.txt|awk -v OFS="\t" '{print $1,$2,".",$3,$4,".",".","."}'  >> input.vcf

Continue with bgzip, tabix and bcftools consensus as above.

ADD COMMENT
0
Entering edit mode

Dear @finswimmer, Thanks so much for the pipeline. It worked smoothly. My ultimate goal is to display the reference and altered sequences gene-wise (based on user's gene query) on a browser. So in the output.fa, how will I keep track of the genes and changed coordinates (due to small INDELs) ?

The other option is to display only the SNPs and leave out the INDELs to make life simple. Do you have any suggestions ?

ADD REPLY

Login before adding your answer.

Traffic: 1800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6