I have a problem with the output from bedtools getfasta, it used to work before. My bed-file looks like this -
C1_64090263 131 164 C1_64090263-1:33 V1_87574936:0 - 33 0
And I am using the command -
bedtools getfasta -fi in.fa -bed file.bed -s -name -fo out.fasta
This reference sequence is 200 bases long. But after extraction I get the sequences as follows
>C1_64090263-1:33
TCAAATTCATATGTTGAAATCTTAA>1_
Some invalid characters are formed at the end of sequence, and at the end of the file too many sequences have invalid fasta sequences such as -
>C1_70260404-1:51
\88\F4\BFH\F5\BF?\F6\BF\C8\F6\BF\88\F7\BFH\F8\BF?\F9\BF
There is no problem with the reference fasta as seqtk works on it, but the reverse-strand is also crucial for my sequences. This happens to most of the sequences on the reverse-strand,some on the plus strand too.
I tried without the name and strandedness options, and also replaced the 5th column with 0, but still it produces no proper fasta files.
Is there a solution to this problem?
How did you generate your FASTA file? It seems like you might have Unicode characters and/or missing line breaks between sequences and deflines.
The fasta file are contigs from a genome assembly. I extracted them according to their IDs with grep and processed to remove unwanted lines
What output do you get when you run
samtools faidx [genome].fa C1_64090263-1:33
?samtools faidx
orpyfaidx
will be more sensitive to formatting issues but it's a good suggestion. In fact, I've built some good error messages into pyfaidx that might help you diagnose the issue.I get just the sequence header, no fasta sequence
When I run with just the sequence name, without the cordinates, I can extract the sequence.
But if it's a problem with the genome.fasta reference, why does Seqtk work?