how do I convert fa files to bed format?
how do I convert fa files to bed format?
You can't.
A Fa/Fasta file describes a sequence of DNA/Protein:
>Name
ATAGCTACGATTACGACGTACG
ATCGATCGATGCATCAGCTACT
AACTAGTCGATGATGCATACG...
A bed file describes some features mapped on a genome/sequence:
chr1 786 9879 gene1
chr2 486 979 gene2
The only thing you can do is saying that the BED contains only one feature: your sequence:
Name 0 1098 Name
This is a good, though slightly misguided question. If you want to make a BED file from a FASTA sequence, you might do something like this:
Use bowtie or bwa, generate bam format, then use bedtools bamToBed to generate a bed file.
Or use blat and this perl script to convert to bed... https://github.com/mmarchin/utilities/blob/master/parseBlat.pl
@malachig - I don't think BLAT outputs BED. The -out=type is one of: psl - Default. Tab separated format, no sequence pslx - Tab separated format with sequence axt - blastz-associated axt format maf - multiz-associated maf format sim4 - similar to sim4 format wublast - similar to wublast format blast - similar to NCBI blast format blast8- NCBI blast tabular format blast9 - NCBI blast tabul
cat $fastafile | awk '$0 ~ "^>" {name=substr($0, 2); printf name"\t1\t"} $0 !~ "^>" {printf length($0)"\t"name"\n"}'
If each fasta sequence spans several lines, substitute the awk script by:
BEGIN{totallen=-1;} $0 ~ "^>" {if (totallen!=-1) print totallen"\t"name; name=substr($0, 2); printf name"\t1\t"; totallen=0} $0 !~ "^>" {totallen=totallen+length($0);} END{if (totallen!=-1) print totallen"\t"name;}'
Old question, but in the spirit of good science I will post a script that takes any genome fasta file and creates a genome BED file. Very niche but I don't think there is a good converter out there quite yet.
https://github.com/noahaus/Micellaneous-Tools/blob/master/genome2bed.py
I'd read the comments in the script before beginning.
Many thanks for this. Strange as it may seem, I've been looking for a simple way to do this for a while. There were a couple of minor issues with the script (the annotation length includes line feeds, and the last line fails to be included unless there's an extra line feed at the end - these may be OS specific) but it does the job perfectly.
You can accomplish this using faidx -i bed genome.fa > out.bed
. For more details you can check out the documentation: https://github.com/mdshw5/pyfaidx#cli-script-faidx
Sorry, but this script is totally wrong. It calculates sequences' lengths incorrectly and also doesn't write the last sequence. I've created an Issue on GitHub https://github.com/noahaus/Micellaneous-Tools/issues/1 .
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
@michael my guess would be that it is due to a confusion of formats & their purposes. happens to the best!
Noahaus, your script gives different results than faidx does..... I believe faidx as it is a community tool thats been around a while, you may want to double check your script (or maybe file a bug with faidx),