Question

Trouble with bedtools getfasta

1

Entering edit mode

2.8 years ago

gkunz ▴ 30

I am trying to extract sequences from a .fasta file based on a bed file using bedtools getfasta and I am getting the following error.

The command run was the following:

bedtools getfasta -fi genomic.fasta -bed bedfile.bed -fo output.fasta
WARNING. chromosome (chr1) was not found in the FASTA file. Skipping

This occurs for each sequence contained within my bed file when bedtools attempts to create the index file. I know a lot of people have had this issue and I have tried to use the response to fix the issue myself, but am unable to figure it out. I know for the most part the issue is a mismatch between the chromosome names in the bed file and the fasta file. As far as I can determine, my identifiers are identical and I cannot for the life of me figure out the issue. It is almost certainly something very simple.

My bed file head:

chr1    4309600 4309825
chr1    4310021 4310350
chr1    4310471 4310646
chr1    4310766 4311096
chr1    4311250 4311471
chr1    4311750 4312141
chr1    4312150 4312471
chr1    4312496 4312841
chr1    4312846 4313421
chr1    4313566 4314216

Chromosome identifiers from the fasta file by running:

grep -o -E "^>\w+" "my_genomic.fna" | tr -d ">"

chr1   
chr1A   
chr2   
chr3   
chr4   
chr4A   
chr5   
chr6   
...   
chrZ   
chrW

I have opened up both files with simple text editors to make sure I had not added additional spaces or miscellaneous characters and they are the same.

Hoping this is and easy fix. Thanks in advance for any help.

bedtools getfasta • 5.6k views

ADD COMMENT • link written 2.8 years ago by gkunz ▴ 30

0

Entering edit mode

Please provide the command you used.

Also, maybe your fasta file contains some weird newline characters? Try running dos2unix your.fasta and then retry bedtools getfasta. Just guessing...

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 2.8 years ago by liorglic ★ 1.4k

0

Entering edit mode

I edited the the initial post to include the command that was run. Will try dos2unix on my fasta. It does not appear that there are an newline characters.

ADD REPLY • link 2.8 years ago by gkunz ▴ 30

0

Entering edit mode

can you please run

samtools faidx genomic.fasta

and then

samtools faidx genomic.fasta "chr1:4309600-4309825"

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

samtools faidx genomic.fasta successfully creates an index file that can be used to obtain the appropriate sequence.

samtools faidx genomic.fasta "chr1:4309600-4309825"

yeilds (which is accurate):

chr1:4309600-4309825
TGCACGGATGGATGTGGGAATGCTGCATCTCCCTCCCCATCCCACTGATTTCTGTCCCCC CAGGGGACTTAAATGTCCCCCAAGCCCTCTCTGGCTTCATCCCCAGAGCTTTGCCATGAT GGTCCCAAACAGAGCCAGGCACTGAGACCCCTGCATGGCTAATCCACCAGAAAACATCTA

I am not just trying to find how to use the faidx command on a the bed file to return all fasta sequences. I had done this previously and then went on to use this index with the bedtools getfasta...found out these are not compatible indexes and getfasta will now function, but the fasta sequences returned are not correct.

How can I use faidx on the entire bed file? or is there a different commend to now utilize to convert the entire bed? I am trying to find this now as well.

Thank you for the assistance!

ADD REPLY • link 2.8 years ago by gkunz ▴ 30

0

Entering edit mode

I'd also like to see the output of:

head bedfile.bed | cat -vET

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

The result of running that command is the following

chr1^I4309600^I4309825$
chr1^I4310021^I4310350$
chr1^I4310471^I4310646$
chr1^I4310766^I4311096$
chr1^I4311250^I4311471$
chr1^I4311750^I4312141$
chr1^I4312150^I4312471$
chr1^I4312496^I4312841$
chr1^I4312846^I4313421$
chr1^I4313566^I4314216$

ADD REPLY • link 2.8 years ago by gkunz ▴ 30

0

Entering edit mode

it looks ok.

ADD REPLY • link 2.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Are you aware of a way to use samtools faidx or an alternative samtools command to extract sequences in the same way that bedtools getfasta does? (i.e bedfile.bed as input and fastafile.fasta as output), because I am still at a loss for what the issue is with bedtools index generation and bedtools cannot use the samtools generated index file appropriately.

ADD REPLY • link 2.8 years ago by gkunz ▴ 30

0

Entering edit mode

Can you upload a sample from your files? I mean, don't just copy-paste, actually upload a file that I can download and try to reproduce your problem.

ADD REPLY • link 2.8 years ago by liorglic ★ 1.4k

0

Entering edit mode

Here is a link to a google folder that contains both the fastafile.fasta file and a bedfile.bed that should be able to reproduce the issue

Google Folder

ADD REPLY • link 2.8 years ago by gkunz ▴ 30

score 3 · Accepted Answer · 2022-02-11

3

Entering edit mode

2.8 years ago

Pierre Lindenbaum 164k

your fasta file is an ASCII text with CRLF lines

file genomic.fna
genomic.fna: ASCII text, with CRLF line terminators

that's why you had those problems. Removing the \r will fix the problems.

ADD COMMENT • link 2.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

This was in fact the issue. Thank you.

ADD REPLY • link 2.8 years ago by gkunz ▴ 30

0

Entering edit mode

Glad to see the problem was solved. It's still a bit weird that dos2unix didn't work for you - it should remove \r characters...

ADD REPLY • link 2.8 years ago by liorglic ★ 1.4k