Extracting regions around sites in VCF file and writing them as fasta sequences/files
1
0
Entering edit mode
15 months ago
Aiswarya ▴ 20

Hai,

I am trying to extract 50 base pair sequence before and 50 bp after the coordinates present in my VCF files and write them to a file as fasta sequences. I have both the BAM files and the reference genome.

Can someone help me with the same?

Thanks in advance

VCF fasta • 1.8k views
ADD COMMENT
0
Entering edit mode

See the answers here: https://www.biostars.org/p/46331/

My mistake. Will leave this here since the comment below is nested.

ADD REPLY
0
Entering edit mode

Is she not telling here that her coordinates are in VCF file and sequence in bam or reference genome? If so, the it can be done like this using bedtools.

#Create fasta from bam file.
samtools bam2fq my.bam | seqtk seq -A > my.fasta

cat my.fasta
>gene1
ATGCGCTCGCTGATCGATCATCGATCATCATCGATCGATCGA

#prepare bed file using site from your vcf file
cat my.bed 
gene1   10  15  gene1_subset

getfasta -fi my.fasta -bed my.bed -name
#index file my.fasta.fai not found, generating...
>gene1_subset::gene1:10-15
TGATC
ADD REPLY
0
Entering edit mode

It is unclear if OP wants to get the region from the fastq/alignments or from original/new consensus reference.

ADD REPLY
1
Entering edit mode
15 months ago
bcftools query -f '%CHROM\t%POS0\t%END\n' rotavirus_rf.vcf.gz |\
bedtools slop -b 50 -g rotavirus_rf.fa.fai  |\
bedtools getfasta -fi rotavirus_rf.fa -bed -
ADD COMMENT
0
Entering edit mode

@Pierre does bedtools slop take genome.fa.fai file instead of just genome.fa?

ADD REPLY
1
Entering edit mode

the fai file. It wants a file chrom(tab)length to get the max-size of the chromosomes.

ADD REPLY
0
Entering edit mode

That worked perfectly!

Thank you very much.

ADD REPLY
0
Entering edit mode

Please accept the answer so the question is marked solved on the website. To do that, click on the green check mark on the left side of the answer.

ADD REPLY
0
Entering edit mode

Hello,

I was trying to use the command cited in this post. It workes for me.

bcftools query -f '%CHROM\t%POS0\t%END\n' variation_sorted.vcf |\ bedtools slop -b 150 -g Equus_caballus.EquCab3.0.dna.toplevel.fa.fai |\ bedtools getfasta -fi Equus_caballus.EquCab3.0.dna.toplevel.fa -bed upflank.bed -tab -fo test.txt -name

However, I need the polymorphism. Is there any way to get the final sequence with the polymorphism with IUPAC code?

E.g: ATGGTGATGGGAGGGCACGTGGACCGACGGGTGAACAGCTCTGTGACCGTGGGGCCAACGCTCTCGGGTGAGGCCCTGCCAAGGGGGCGAAACRCTGCCCGSACWGTSCGGGCAGTGGTGGTGAGSCCYCAGGCTGAGGGCTCACCCAGC[C/T] GCAGTCAGGCCCTGGAGCTGCTAAGTAGCCTGGTGCCTGCTGAGCGTAGCCCACCTACYGGCCRGCTTCCTAGGCCCATGGCTGTTGTGCCAAGGAGTCCAGGTCTGGGTCGCTCAGTARGTGAAGCCCTGGGGCAGCTACCTGAGACAG

Help please

ADD REPLY
0
Entering edit mode

a pipe '|' is missing before the first '\'

ADD REPLY
0
Entering edit mode

Hi. You are right. I made it work, but I still do not have what I need: my flanked variants with the polymorphism in IUPAC code. Ideas?

ADD REPLY

Login before adding your answer.

Traffic: 2554 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6