Question

Extracting multiple sequences by ID from a FASTA file

1

Entering edit mode

10.6 years ago

Allan ▴ 10

Following the instructions as shown in Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences for retrieving a list of sequences by ID, the only instance this works correctly is for the last ID in the list file.

Please note BioPerl is not an option (HPCC management teams inability to install correctly).

Given a FASTA file Fec2_001.fa:

>M02471:1:000000000-A6VM6:1:1101:18323:1737 1:N:0:6
GTCACCAAGACCGCGCAGACCGGCGAAATGTATTATTCCTTGCCGCAGCGCATGATTCTGCCGGGCTACAACCCGACCACCAAGGCGCACGGTCGCGTGCT
>M02471:1:000000000-A6VM6:1:1101:14489:1760 1:N:0:6
CACATGCCCTTGTATCAGGAGCTGCGTCGTCGCATGGATGTCGGCGAGTTCGGCCGCATGAATCTCGCTCAGCTCAACTTTGGCAGCTATAAGGAGTACGG
>M02471:1:000000000-A6VM6:1:1101:14668:1789 1:N:0:6
AGCTCTTGAGCTCGCTCGTATCGCCCTTGTCCTTCTTCGTCTTCCAGTAATGACCCGTCACGCTTCCGTCTTGAACCGTATAGGTCAGATCTTCGACGCG

And list of ID's that need to be extracted in BLASTids.txt:

M02471:1:000000000-A6VM6:1:1101:18323:1737 1:N:0:6
M02471:1:000000000-A6VM6:1:1101:14668:1789 1:N:0:6

I've tried the following, but each time the only sequence returned is that matching the last ID. I'm assuming the newline characters are cause of this, but my lack perl programming isn't leading me to any solution.

First:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' BLAST.txt Fec2_001.fa

Second (from biostar link above):

use warnings;
use strict;

my $lookup = shift @ARGV;  # ID to extract

local $/ = "\n>";  # read by FASTA record

while (my $seq = <>) {
    chomp $seq;
    my ($id) = $seq =~ /^>*(\S+)/;  # parse ID as first word in FASTA header
    if ($id eq $lookup) {
        print "$seq\n";
        last;
    }
}

These two are among several other solutions other's have shown to work. Any guidance is greatly appreciated!

FASTA • 9.7k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.6 years ago by Allan ▴ 10

Ram · Answer 1 · 2014-04-15

5

Entering edit mode

10.6 years ago

cts ★ 1.7k

Use seqtk which has a subseq command for extracting multiple reads using their headers

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by cts ★ 1.7k

0

Entering edit mode

Wonderful. Thank you much!

Note that the characters in my id file after the space were removed, otherwise resulting in only the first base to be printed.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Allan ▴ 10

0

Entering edit mode

In the fasta sequence format the entry identifier is the first token (non-whitespace "word") on the header line. Any information on the header line after that is assumed to be a description, and is not part of the identifier.

Thus to use standard sequence manipulation tools (e.g. EMBOSS, NCBI BLAST+, etc.) with identifiers that contain whitespace, the whitespace will need to be replaced (typically with '_' or '-').

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by hpmcwill ★ 1.2k

Ram · Answer 2 · 2014-04-15

I tried the same in perl.

open FH,"idlist.txt";

  while(<FH>)
    {
    $_ =~ s/^\s+|\s+$//; #remove leading and trailing space
    $Uniqheader{$_}='0'; #Hash
    }

open HF,"fasta.fa";

  while(<HF>)
    {
    $id=$_;
    $seq=<HF>;
    if($id=~/^\>(.+)\s*$/ && (exists($Uniqheader{$1})))
        {
        print "$id$seq";
        }
    }
    close(FH);
    close(HF);

Ram · Answer 3 · 2014-04-16

0

Entering edit mode

10.6 years ago

andrew.j.skelton73 6.6k

Have you considered gffread from the Tuxedo guys? http://cufflinks.cbcb.umd.edu/gff.html

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

It looks like that requires a gff file, as the name implies, but I could be wrong. I'm not sure that will be helpful in this case since the input is just a list of IDs.

ADD REPLY • link 10.6 years ago by SES 8.6k

score 0 · Answer 4 · 2014-04-16

0

Entering edit mode

10.6 years ago

Matt Shirley 10k

In addition to using seqtk, see this post: Pyfaidx: Efficient, "Pythonic" Random Access To Fasta Files Using Samtools-Compatible Indexing. You can use pyfaidx either by importing the `Fasta` class and using it like a dictionary in a Python script, or using the CLI script in the same way as seqtk.

ADD COMMENT • link 10.6 years ago by Matt Shirley 10k

score 0 · Answer 5 · 2017-02-10

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-f", "--fasta", help="Input file (.fasta format)",  required=True)
parser.add_argument("-i", "--id", help="Seq id",  required=True)
args = parser.parse_args()

for record in SeqIO.parse(args.fasta,'fasta'):
    if record.id == args.id:
        print record.seq
        break

python extract_id_fasta.py -f fastafile.fa -i seq_id