I am trying to extract sequences from a .fasta file based on a bed file using bedtools getfasta and I am getting the following error.
The command run was the following:
bedtools getfasta -fi genomic.fasta -bed bedfile.bed -fo output.fasta
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping
This occurs for each sequence contained within my bed file when bedtools attempts to create the index file. I know a lot of people have had this issue and I have tried to use the response to fix the issue myself, but am unable to figure it out. I know for the most part the issue is a mismatch between the chromosome names in the bed file and the fasta file. As far as I can determine, my identifiers are identical and I cannot for the life of me figure out the issue. It is almost certainly something very simple.
My bed file head:
chr1 4309600 4309825
chr1 4310021 4310350
chr1 4310471 4310646
chr1 4310766 4311096
chr1 4311250 4311471
chr1 4311750 4312141
chr1 4312150 4312471
chr1 4312496 4312841
chr1 4312846 4313421
chr1 4313566 4314216
Chromosome identifiers from the fasta file by running:
grep -o -E "^>\w+" "my_genomic.fna" | tr -d ">"
chr1
chr1A
chr2
chr3
chr4
chr4A
chr5
chr6
...
chrZ
chrW
I have opened up both files with simple text editors to make sure I had not added additional spaces or miscellaneous characters and they are the same.
Hoping this is and easy fix. Thanks in advance for any help.
Please provide the command you used.
Also, maybe your fasta file contains some weird newline characters? Try running
dos2unix your.fasta
and then retrybedtools getfasta
. Just guessing...I edited the the initial post to include the command that was run. Will try dos2unix on my fasta. It does not appear that there are an newline characters.
can you please run
and then
samtools faidx genomic.fasta
successfully creates an index file that can be used to obtain the appropriate sequence.yeilds (which is accurate):
I am not just trying to find how to use the
faidx
command on a the bed file to return all fasta sequences. I had done this previously and then went on to use this index with thebedtools getfasta
...found out these are not compatible indexes andgetfasta
will now function, but the fasta sequences returned are not correct.How can I use
faidx
on the entire bed file? or is there a different commend to now utilize to convert the entire bed? I am trying to find this now as well.Thank you for the assistance!
I'd also like to see the output of:
The result of running that command is the following
it looks ok.
Are you aware of a way to use
samtools faidx
or an alternative samtools command to extract sequences in the same way thatbedtools getfasta
does? (i.ebedfile.bed
as input andfastafile.fasta
as output), because I am still at a loss for what the issue is with bedtools index generation and bedtools cannot use the samtools generated index file appropriately.Can you upload a sample from your files? I mean, don't just copy-paste, actually upload a file that I can download and try to reproduce your problem.
Here is a link to a google folder that contains both the
fastafile.fasta
file and abedfile.bed
that should be able to reproduce the issueGoogle Folder