I have a fasta file with 100+ viral genomes. I am only interested in looking at the sequence for one particular gene, for which I have the coordinates. I have tried bedtools getfasta tools as follows:
head UL48.bed
JQ673480.1 103538 105080
head ref.fa
>KT425109.1
ATAAACCAACGAAAAGCGCGGGAACGGGG....
bedtools getfasta -fi ref.fa -bed UL48.bed
I realize that the issue is the bedfile chromosome does not match that found in the ref.fa file. Each genome only has a unique identifier, however, so they will never match, and it would be extremely time-intensive to manually make a bedfile with each of the unique identifiers. I don't want to edit my ref.fa file to have uniform identifiers, however, because I need this information for downstream processes. Is there any way to use grep or a similar command line tool to do this? So, I need to grep each line beginning with > and the characters ~103538 to ~105080 from each entry in the ref.fa file.
You can easily grep out the fasta headers and then create a bed file with those.
After you get a multiple fasta file with sequences you can split the files into individual (if you need to) using: How to split fasta by '>' into a file each containing one sequence, and have the name of that file be the ID?