Following the instructions as shown in Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences for retrieving a list of sequences by ID, the only instance this works correctly is for the last ID in the list file.
Please note BioPerl is not an option (HPCC management teams inability to install correctly).
Given a FASTA file Fec2_001.fa:
>M02471:1:000000000-A6VM6:1:1101:18323:1737 1:N:0:6
GTCACCAAGACCGCGCAGACCGGCGAAATGTATTATTCCTTGCCGCAGCGCATGATTCTGCCGGGCTACAACCCGACCACCAAGGCGCACGGTCGCGTGCT
>M02471:1:000000000-A6VM6:1:1101:14489:1760 1:N:0:6
CACATGCCCTTGTATCAGGAGCTGCGTCGTCGCATGGATGTCGGCGAGTTCGGCCGCATGAATCTCGCTCAGCTCAACTTTGGCAGCTATAAGGAGTACGG
>M02471:1:000000000-A6VM6:1:1101:14668:1789 1:N:0:6
AGCTCTTGAGCTCGCTCGTATCGCCCTTGTCCTTCTTCGTCTTCCAGTAATGACCCGTCACGCTTCCGTCTTGAACCGTATAGGTCAGATCTTCGACGCG
And list of ID's that need to be extracted in BLASTids.txt
:
M02471:1:000000000-A6VM6:1:1101:18323:1737 1:N:0:6
M02471:1:000000000-A6VM6:1:1101:14668:1789 1:N:0:6
I've tried the following, but each time the only sequence returned is that matching the last ID. I'm assuming the newline characters are cause of this, but my lack perl programming isn't leading me to any solution.
First:
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' BLAST.txt Fec2_001.fa
Second (from biostar link above):
use warnings;
use strict;
my $lookup = shift @ARGV; # ID to extract
local $/ = "\n>"; # read by FASTA record
while (my $seq = <>) {
chomp $seq;
my ($id) = $seq =~ /^>*(\S+)/; # parse ID as first word in FASTA header
if ($id eq $lookup) {
print "$seq\n";
last;
}
}
These two are among several other solutions other's have shown to work. Any guidance is greatly appreciated!
Wonderful. Thank you much!
Note that the characters in my id file after the space were removed, otherwise resulting in only the first base to be printed.
In the fasta sequence format the entry identifier is the first token (non-whitespace "word") on the header line. Any information on the header line after that is assumed to be a description, and is not part of the identifier.
Thus to use standard sequence manipulation tools (e.g. EMBOSS, NCBI BLAST+, etc.) with identifiers that contain whitespace, the whitespace will need to be replaced (typically with '_' or '-').