I have a text file formatted like BED and I want to add the sequences as a column on to it. Bedtools getfasta should be doing that with -bedOut option but as explained in this post (https://github.com/arq5x/bedtools2/issues/571), it doesn't work as it is supposed to. I also tried to download the sequences using UCSC's REST API with requests in Python but it is too slow for the number of lines in my file. My last resort will be to download the whole genome as a FASTA file and try to parse that but it tedious to do and introduces extra steps, files and scripts to my pipeline.
Does anybody know an alternative to bedtools getfasta -bedOut or any suggestions to speed up the downloads from UCSC's REST API?
Thanks in advance.
Maybe if you give us a sample of your input we could try to reproduce the phenomena.
The input looks like this:
And I am using this code:
Perhaps modify this answer for your assembly:
https://bioinformatics.stackexchange.com/questions/14015/how-to-retrieve-the-dna-sequence-from-a-particular-species-and-region-through-co/14030#14030
You'll need to pass in a region file to samtools, where coordinates are in chr:start-stop format. That's easy to do with
awk
.The output from samtools can be written to a column of your BED file with
paste
.I tried that but I get this error:
P.S. Chromosome names in my file look like "I" instead of "chrI". I fixed the fasta files for that.
The chromosome naming scheme you choose will need to be consistent when doing queries. If your BED file and FASTA use different schemes, queries won't work.