Question

Extract sequence number "X" from multifasta file

1

Entering edit mode

10.0 years ago

user230613 ▴ 380

Hi there,
I know there are different ways to extract some "wanted" sequences from multifasta file. In general, having one or more IDs, you can retrieve the sequence from multifasta file using bioperl script or another kind of software as ones pointed out here (Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences). In my case I know that I want the sequence number 1, 5 and 10 from a multifasta file. Any idea how can I achieve this using perl/bioperl preferably?

Thanks in advance.
P.D: I'm a beginner in bioinformatics word, sorry if it is a very basic question.

fasta • 3.9k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by user230613 ▴ 380

1

Entering edit mode

10.0 years ago

Brian Bushnell 20k

There is a program in BBTools which will do that:

getreads.sh in=data.fasta out=filtered.fasta id=1,5,10

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by Brian Bushnell 20k

1

Entering edit mode

10.0 years ago

Pierre Lindenbaum 164k

I want the sequence number 1, 5 and 10 from a multifasta file

awk '/^>/ {N++;} {if(N==1 || N==5 ||N==10) print;}' in.fa

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I might be mistaken, but I think this only prints the lines containing the sequence identifiers, correct?

ADD REPLY • link 10.0 years ago by Matt Shirley 10k

0

Entering edit mode

No, there is no 'next' statement in

{N++; }

so awk continues to scan the matching patterns. It would only print the headers with

{N++; next;}

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Thank you Pierre. I'm still impressed by how concise most awk scripts can be.

ADD REPLY • link 10.0 years ago by Matt Shirley 10k

1

Entering edit mode

10.0 years ago

Matt Shirley 10k

I know you are looking for a Perl solution, but you could use my pyfaidx module for this. In the latest version I've added numeric indexing, so you can do:

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by Matt Shirley 10k

Ram · Accepted Answer · 2014-12-14

3

Entering edit mode

10.0 years ago

SES 8.6k

Here is a BioPerl example since you asked:

I put this in a script so it is easier to read, but in practice, I would just do this at the command line with a one-liner (that's how I wrote it):

perl -MBio::SeqIO -M'List::MoreUtils qw(any)' -E '@ids = (1, 5, 10); $num = 0; $seqio = Bio::SeqIO->new(-fh => \*STDIN); while ($seq = $seqio->next_seq) { $num++; say join "\n", ">".$seq->id, $seq->seq if any { $_ == $num } @ids; }' < seqs.fas

That will save you from opening an editor. It's not as readable but it works the same. :)

ADD COMMENT • link 10.0 years ago by SES 8.6k

0

Entering edit mode

Thank you a lot SES. One last question, could you explain me a little bit the meaning of this last two lines?:

say join "\n", ">".$seq->id, $seq->seq
if any { $_ == $num } @ids;

And another thing, using "say" is possible to print to a file like with "print?

Thanks again.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by user230613 ▴ 380

1

Entering edit mode

say was added in perl version 5.10, it just adds a newline to your print, so it saves you some typing. If you have something as old as 5.10, you can enable it with use feature 'say'; or you can enable all new features at the command line with -E.

join should be self-explanatory, it just adds a newline after the sequence ID and the sequence itself. In Perl, statements are terminated by a semi-colon, so that last statement is one line. That style is called post-fix, with the conditional at the end, and this makes the code much more readable (esp. when put on separate lines) by removing unnecessary braces. That statement should be easy to read, it says, "print the record if any IDs match those in the list." And that sentence is almost working code! That is a good reason to use the functions from List::Util or List::MoreUtils, they are very expressive and they make the code easy to read (and they are fast, as they are all written in C).

Note that I imported the any function from the package List::MoreUtils. All of the functions in that package are now part of List::Utils, which is shipped with Perl (the lastest version includes these functions). I used the List::MoreUtils package so the code will work with any version of Perl, not just the latest. I think it will simplify things going forward now that all the functions are in one package and in core Perl.

ADD REPLY • link 10.0 years ago by SES 8.6k