Question

fasta to bed format with more information that just chr locations

1

Entering edit mode

6.6 years ago

varsha619 ▴ 90

Hello, I have a fasta file with lines of format -

>FBti0019256 type=transposable_element; loc=2L:22300300..22304444; name=invader2{}555; dbxref=FlyBase_Annotation_IDs:TE19256,FlyBase:FBti0019256; MD5=d9259a0e33aad699215e64916bd47a5b; length=4145; release=r6.19; species=Dmel;

I would like to convert these lines into a bed file of format -

chr2L /t 22300300 /t 22304444 /t invader2

Is there a program that can directly perform this conversion or an awk command that can do this easier? Please let me know, thank you for your help.

fasta to bed • 2.4k views

ADD COMMENT • link updated 6.6 years ago by Daniel ★ 4.0k • written 6.6 years ago by varsha619 ▴ 90

0

Entering edit mode

varsha619 : If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work.
Upvote|Bookmark|Accept

ADD REPLY • link 6.6 years ago by GenoMax 148k

score 4 · Accepted Answer · 2018-05-02

use an associative array in awk to store each component.

I'm just too lazy to extract the chrom/start/end, you get the idea.

awk -F ' ' '/^>/ {delete map;for(i=1;i<=NF;i++) {eq=index($i,"=");if(eq==0) continue;key=substr($i,1,eq-1);val=substr($i,eq+1); gsub(/;$/,"",val); map[key]=val;} OFS="\t"; print map["loc"],map["type"],map["name"],map["dbxref"],map["MD5"];}'

score 4 · Accepted Answer · 2018-05-02

4

Entering edit mode

6.6 years ago

Daniel ★ 4.0k

My approach is lazier... Just split on the punctuation, pull out the columns you want, and separate by "\t".

awk -F'[ *:.]' -v OFS="\t" '{print "chr"$5,$6,$8,$13}' tmp.txt

# Output
chr2L   22300300    22304444    invader2

ADD COMMENT • link 6.6 years ago by Daniel ★ 4.0k

0

Entering edit mode

OP wants the output file to be in BED format with columns separated by tabs \t.

ADD REPLY • link 6.6 years ago by GenoMax 148k

0

Entering edit mode

@Daniel, would you mind explaining how the columns $5 and $10 are defined, when I run the script it does not output the right values. And yes I would like the output values to be tab separated. Thanks!

ADD REPLY • link 6.6 years ago by varsha619 ▴ 90

1

Entering edit mode

The $n parameters are the columns when you divide the file up by what is inside the -F'[ ]' box, so if you split the line every time you saw a space, asterix, colon, or period then you'd want the 5th, 6th, 8th and 13th columns to be printed.

ADD REPLY • link 6.6 years ago by Daniel ★ 4.0k

0

Entering edit mode

Huh, I was certain that bed format was colon and hyphen but the internet doesn't support that... Maybe I'm going mad. Ok, updating.

Edit: I was thinking of UCSC browser formatting apparently.

ADD REPLY • link 6.6 years ago by Daniel ★ 4.0k

score 2 · Accepted Answer · 2018-05-02

2

Entering edit mode

6.6 years ago

cpad0112 21k

Assumption is that chromosomes are numbered, no x and y chromosomes. For fasta headers only, user can use grep

$ sed 's/.*loc.\(\b.*\b\):\(\b.*\b\)\.\.\(\b.*\b\);.*=\(\b.*\b\){.*/chr\1\t\2\t\3\t\4/g' test.txt 
chr2L   22300300    22304444    invader2

$ cut -f2,3 -d";" test.txt| cut -d= -f2,3 | awk -v OFS="\t" -F':|=|;|\\..|{' '{print "chr"$1,$2,$3,$5}'
chr2L   22300300    22304444    invader2

$ cat test.txt 
>FBti0019256 type=transposable_element; loc=2L:22300300..22304444; name=invader2{}555; dbxref=FlyBase_Annotation_IDs:TE19256,FlyBase:FBti0019256; MD5=d9259a0e33aad699215e64916bd47a5b; length=4145; release=r6.19; species=Dmel;

1000 bash cuts:

$ cut -f2,3 -d";" test.txt| cut -d= -f2,3 | cut -f1,2 -d: --output-delimiter=$'\t'| cut -f1,2,3 -d'.' --output-delimiter=$'\t' | cut -f1,2 -d';' --output-delimiter=$'\t'| cut -f1,2 -d'=' --output-delimiter=$'\t' |  cut -f1,2 -d'{' --output-delimiter=$'\t'| cut -f1,2,3,4,6

2L  22300300        22304444    invader2

ADD COMMENT • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

Thank you! I liked the 2nd method the best and seemed to be the easiest to use - cut -f2,3 -d";" test.txt| cut -d= -f2,3 | awk -v OFS="\t" -F':|=|;|\..|{' '{print "chr"$1,$2,$3,$5}'

ADD REPLY • link 6.6 years ago by varsha619 ▴ 90

1

Entering edit mode

Actually, I think @ Daniel's one is best awk solution for OP requirement. In line with @Daniel's solution below, following awk solution is better than one using two cuts above :).

$ awk -v OFS="\t"  -F';|=|:|\\..|{' '/^>/ {print "chr"$4,$5,$6,$8}' test.txt
chr2L   22300300    22304444    invader2

input:

>FBti0019256 type=transposable_element; loc=2L:22300300..22304444; name=invader2{}555; dbxref=FlyBase_Annotation_IDs:TE19256,FlyBase:FBti0019256; MD5=d9259a0e33aad699215e64916bd47a5b; length=4145; release=r6.19; species=Dmel;

ADD REPLY • link 6.6 years ago by cpad0112 21k