How can I add sequences to my BED file?
0
0
Entering edit mode
4.2 years ago

I have a text file formatted like BED and I want to add the sequences as a column on to it. Bedtools getfasta should be doing that with -bedOut option but as explained in this post (https://github.com/arq5x/bedtools2/issues/571), it doesn't work as it is supposed to. I also tried to download the sequences using UCSC's REST API with requests in Python but it is too slow for the number of lines in my file. My last resort will be to download the whole genome as a FASTA file and try to parse that but it tedious to do and introduces extra steps, files and scripts to my pipeline.

Does anybody know an alternative to bedtools getfasta -bedOut or any suggestions to speed up the downloads from UCSC's REST API?

Thanks in advance.

bedtools BED • 1.5k views
ADD COMMENT
0
Entering edit mode

Maybe if you give us a sample of your input we could try to reproduce the phenomena.

ADD REPLY
0
Entering edit mode

The input looks like this:

chrI       11140   11240   1       NM_058259.4     +
chrI       11240   11340   2       NM_058259.4     +
chrI       11340   11440   3       NM_058259.4     +
chrI       11440   11540   4       NM_058259.4     +
chrI       11540   11640   5       NM_058259.4     +

And I am using this code:

bedtools getfasta -fi chrI.fa -bed genes_divided.bed -fo genes_divided_seqs.bed -bedOut
ADD REPLY
0
Entering edit mode

Perhaps modify this answer for your assembly:

https://bioinformatics.stackexchange.com/questions/14015/how-to-retrieve-the-dna-sequence-from-a-particular-species-and-region-through-co/14030#14030

You'll need to pass in a region file to samtools, where coordinates are in chr:start-stop format. That's easy to do with awk.

The output from samtools can be written to a column of your BED file with paste.

ADD REPLY
0
Entering edit mode

I tried that but I get this error:

[W::fai_get_val] Reference I 11140 11240 not found in FASTA file, returning empty sequence
[faidx] Failed to fetch sequence in I 11140 11240

P.S. Chromosome names in my file look like "I" instead of "chrI". I fixed the fasta files for that.

ADD REPLY
0
Entering edit mode

The chromosome naming scheme you choose will need to be consistent when doing queries. If your BED file and FASTA use different schemes, queries won't work.

ADD REPLY

Login before adding your answer.

Traffic: 1640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6