Question

Modify reference fasta using bed file

0

Entering edit mode

5.4 years ago

Rubal ▴ 350

Hello Everyone,

I would like to modify a reference genome fasta file using a list of position in bed file format. The bed file contains positions like this:

chr1    17716   C       G
chr1    17925   A       G
chr1    18115   C       T

So for example I would like to change the position at chr1 17716 from C to G in the fasta file. I have seen some python packages and a GATK tool for modifying a fasta file using a list of VCF positions but was wondering if there was any tool that would do this using a bed file as input rather than a VCF. Or would it be better to convert the bed file to VCF format by adding a VCF header (I'm not sure if this would work?).

Thanks in advance for your help.

fasta bed genome • 1.9k views

ADD COMMENT • link updated 5.4 years ago by AK ★ 2.2k • written 5.4 years ago by Rubal ▴ 350

score 3 · Accepted Answer · 2019-07-05

3

Entering edit mode

5.4 years ago

AK ★ 2.2k

If SNPs only:

$ cat example.fa
>chr1
CCCCCC
>chr2
AAAAAA
$ cat example.bed
chr1    2   C   G
chr1    4   C   G
chr1    6   C   G
chr2    1   A   T
chr2    3   A   T
chr2    5   A   T

$ seqkit fx2tab example.fa > example.tab
$ awk 'FILENAME=="example.tab" {fa[$1]=$2; next} {fa[$1]=substr(fa[$1], 1, $2-1) $4 substr(fa[$1], $2+1, length(fa[$1])-$2)} END {for (id in fa){print ">" id "\n" fa[id]}}' example.tab example.bed
>chr1
CGCGCG
>chr2
TATATA

ADD COMMENT • link 5.4 years ago by AK ★ 2.2k

0

Entering edit mode

this works well thank you! I am trying to run it changing over a million positions in a 2gb fasta file. Do you recommend any change in approach for this? I'm a little concerned as it produces no intermediate output that I can't tell how long it will take. For example would sorting the example.bed speed it up or does it search through the file for each position without sorting making a difference. The only improvement I can think of would be to sort and split the fasta and bed files into smaller pieces, run independently and then merge at the end.

ADD REPLY • link 5.1 years ago by Rubal ▴ 350