Hi,
How to extract fasta sequence from an huge 3gb fasta file by giving sequence id as input using perl, Thanks in advance.
Hi,
How to extract fasta sequence from an huge 3gb fasta file by giving sequence id as input using perl, Thanks in advance.
If I read your question correctly, you have many sequences in a large file and you want to retrieve certain sequences by some ID.
One way to do this using Perl is first to index the file. If you install Bioperl and its accessory scripts, you can do this using bp_index.pl:
This example assumes that your fasta sequences are in myfile.fa in the current directory and you want to create the index file, myIndex, also in the current directory:
bp_index.pl -dir . -fmt fasta myIndex myfile.fa
You can then retrieve by ID using bp_fetch.pl. Assuming that you are in the same directory and you want the sequence with ID myID, something like:
bp_fetch.pl -dir . -fmt fasta myIndex:myID
It's been some time since I used these tools, so you should check the syntax and read up on them at the Bioperl website. In particular, it's possible to set up environment variables which point to a location with sequence indices, which can make life easier.
Hi Neilfws,
I used below cmd to extract RNA seq from fasta file using ID number from text file:
bp_index.pl -dir . -fmt fasta myIndex ./nr_latest.fasta
time bp_fetch.pl -dir . -fmt fasta myIndex:sequence.gi.txt
### Error: Sequence sequence.gi.txt in Database myIndex is not present.
Can you please give me any suggestion.
I would really appreciate your help.
Thanka in advance!
Naresh
What about calling fastacmd through a system command?
First build the database using
formatdb -i pathtodb -p F -o T
the last option is important.
Next you can use
fastacmd -d pathtodatabase -s seqID1,seqID2,...
option -L start,stop
allows to retrieve subsequence
integrating in a system()
command makes it useable in perl code.
This Also works if you have BioPerl. You need to have the IDs that you want to extract them from fasta file in a separate file. You can use grep to generate that file.
#!/usr/bin/perl
#######################################################
# Author: Hamid Ashrafi #
# email: ashrafi@ucdavis.edu
# Pupose: It reads the fasta file and another file with
# the IDs that you want to extract from the FASTA file
# then print the IDs and seqencing that are matched in
# the FASTA file.
#
#Requirement. Bio Perl
######################################################
use strict;
use Bio::DB::Fasta;
my $database;
my $fasta_library;
my %records;
open IDFILE, "NBS_LRR_IDs.txt" or die $!;
open OUTPUT, ">NC_003070_NBS-LRR.faa" or die $!;
# name of the library file - (here it is hardcoded)
$fasta_library = 'NC_003070_all_chr.faa';
# creates the database of the library, based on the file
$database = Bio::DB::Fasta->new("$fasta_library") or die "Failed to creat Fasta DP object on fasta library\n";
# now, it parses the file with the fasta headers you want to get
while (<IDFILE>) {
my ($id) = (/^>*(\S+)/); # capture the id string (without the initial ">")
my $header = $database->header($id);
#print "$header\n";
print ">$header\n", $database->seq( $id ), "\n";
print OUTPUT ">$header\n", $database->seq( $id ), "\n";
}
exit;
Actually, I use this perl script to extract some genes from genome, but I don't know it would be working in your case.
#!/usr/bin/perl
use Bio::Perl;
use Bio::SeqIO;
use IO::String;
use Bio::SearchIO;
#you can input you genes' name in the array or read it from other files.
my @genes_name=qw(FGSG_01562 FGSG_02646 FGSG_05604 FGSG_05896 FGSG_06871 FGSG_07409 FGSG_07952 FGSG_08962 FGSG_09852 FGSG_09945 FGSG_10286 FGSG_11792);
my $filename='fusarium_graminearum_3_proteins.fasta';
my $gb = Bio::SeqIO->new(-file => "<$filename",
-format => "fasta");
my $fa = Bio::SeqIO->new(-file => ">gball.fa",
-format => "fasta",
-flush => 0); # go as fast as we can!
while($seq = $gb->next_seq) {
#Sorry! Here would be with problem, if we use this "if (grep {$_=$seq->id} @genes_name;"
$fa->write_seq($seq) if (grep {$_ eq $seq->id} @genes_name);
}
If you can't install BioPerl, this vanilla Perl should work quite fast if you expect only one exact ID match:
#!/usr/bin/perl
# usage: extractSeqByID.pl SEQ123 < huge.fsa > my.fsa
use warnings;
use strict;
my $lookup = shift @ARGV; # ID to extract
local $/ = "\n>"; # read by FASTA record
while (my $seq = <>) {
chomp $seq;
my ($id) = $seq =~ /^>*(\S+)/; # parse ID as first word in FASTA header
if ($id eq $lookup) {
$seq =~ s/^>*.+\n//; # remove FASTA header
$seq =~ s/\n//g; # remove endlines
print "$seq\n";
last;
}
}
Simplest and I'd wager the fastest method:
Use Jim Kent's utils.
Specifically, faSomeRecords
:
---> faSomeRecords
faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa
options:
-exclude - output sequences not in the list file.
Or, if it really is only one sequence you are after, use faOneRecord
---> faOneRecord
faOneRecord - Extract a single record from a .FA file
usage:
faOneRecord in.fa recordName
I needed to do the same thing, but I want to use a file with all the IDs in it, that is taken as an argument of the program (at the end I want to create automatically IDfiles and to a pipeline)
I am a perl beginner, so it might be not clean, but I modified the script above and in my hands it is working.
And, also, thank you for this script, H. Ashrafi!
#!/usr/bin/perl
#######################################################
# Author: Hamid Ashrafi
# email: ashrafi@ucdavis.edu
# Pupose: It reads the fasta file and another file with the IDs that you want to extract from the FASTA file
# then print the IDs and seqencing that are matched in the FASTA file.
#
# Requirement. Bio Perl
#
# Modified by Aurelie K
# -> put the names of files in arguments (and also use the STDOUT for the output file)
# -> suppr index file generated during the script
######################################################
#usage = path_to/fasta_getseq.pl path_to/fasta_library path_to/IDFILE > log
if ($#ARGV != 1) {
print "usage: fasta_getseq.pl IDFILE fasta_library > outputfile\n";
exit;
}
use strict;
use Bio::DB::Fasta;
my $database;
my $fasta_library = $ARGV[1]; #path to fastalibrary in the second argument
my %records;
open IDFILE, "<$ARGV[0]" or die $!; #first argument is the path of the file containing all the IDs you need to extract
open OUTPUT, <STDOUT>;
# creates the database of the library, based on the file
$database = Bio::DB::Fasta->new("$fasta_library") or die "Failed to creat Fasta DP object on fasta library\n";
# now, it parses the file with the fasta headers you want to get
while (<IDFILE>) {
my ($id) = (/^>*(\S+)/); # capture the id string (without the initial ">")
my $header = $database->header($id);
#print "$header\n";
print ">$header\n", $database->seq( $id ), "\n";
print OUTPUT ">$header\n", $database->seq( $id ), "\n";
}
#remove the index file that is useless for user
unlink "$fasta_library.index";
#close the filehandles
close IDFILE;
close OUTPUT;
exit;
I use the following codes to extract fastas from a database and print each of my targets into different fastas.
#!/usr/bin/perl
use Bio::Perl;
use Bio::SeqIO;
use IO::String;
use Bio::SearchIO;
open TF,'File_contain_target_fasta_IDs.txt'
or die "Couldn't open this file: $!";
chomp (@genes_name=<TF>);
my $filename='Fasta_database.fasta';
my $gb = Bio::SeqIO->new(-file => "<$filename",
-format => "fasta");
while($seq = $gb->next_seq) {
my @temp_array=grep {$seq->id eq $_}@genes_name;
#this step is import for correct output
foreach (@temp_array){
#Only one element is given to $string, which is the key point for writing single sequence into one fasta file
my $string=$_;
#print $string,"\n";
my $stringio = IO::String->new($string);
#If the foreach function is not used here, the defaut $_, actually, here would be many $_ been transfered into $string, and this is resulted in wrong sequences written into one fasta file.
my $out = Bio::SeqIO->new(-fh => $stringio,
-format => 'fasta');
# output goes into $string
$out->write_seq($seq);
# modify $string
#$string =~ s|(>)(\w+)|$1$2|g;
# print into STDOUT
#print $string;
#subrountin for writing sequences into separated fasta file
write_into_fasta($seq->id,$string);
}
}
sub write_into_fasta {
my $fh=$_[0];
open $fh,">$fh.fasta" or die "Couldn't open this file: $!";
print $fh $_[1],"n";
close $fh;
}
in terms of performance, from all open-source options that I've tried, the best so far has been cdbfasta:
http://sourceforge.net/projects/cdbfasta/
You first create an index, and then put all your sequence ids in a text file, and use cdbyank to fetch them like this:
/path/to/cdbyank big_file.fasta.cidx < list_of_ids.txt -o my_ids.fasta
but if i have multiple sequences to be extracted then what should i do? if i put all sequence id in one text file and give it as input then the above program is not working
I've written a tool (called xcerpt) which takes a list of read ID's, and then scans a fasta file to extract them. No indexing or other work necessary, but of course, every time you call it, it will scan the whole file until all requested seqences are found (3GB ~ half a minute). http://hackage.haskell.org/package/clustertools
I have changed a few of cheng zhongshan's script.just added to read from a file where i expect the ids should by separated by new line. here it goes:
#!/usr/bin/perl -w
use Bio::Perl;
use Bio::SeqIO;
use IO::String;
use Bio::SearchIO;
my $input = shift;
my $id_file = shift;
my $output = shift;
my @genes_name;
open(FILE,$id_file) or die "Can't open $id_file";
while (<FILE>) {
my $id =$_;
chomp($id);
push @genes_name, $id;
}
my $gb = Bio::SeqIO->new(-file => "<$input",
-format => "fasta");
my $fa = Bio::SeqIO->new(-file => ">$output",
-format => "fasta",
-flush => 0); # go as fast as we can!
while($seq = $gb->next_seq) {
#Sorry! Here would be with problem, if we use this "if (grep {$_=$seq->id} @genes_name;"
$fa->write_seq($seq) if (grep {$_ eq $seq->id} @genes_name);
}
It's not Perl, but pyfasta
is really easy to use for this purpose. From the help text:
Usage: extract some sequences from a fasta file. e.g.:
pyfasta extract --fasta some.fasta --header at2g26540 at3g45640
The easiest is probably to use 'agrep':
agrep -d '^>' regex
will extract all sequences matching the regular expression regex. If you have a rough idea of sequence length, and you don't have agrep
installed, you can use -A
option to grep to look for the header line and extract a fixed number of lines after it. Or you can use awk
:
awk 'BEGIN { RS=">" } /regex/ { print }'
which is equivalent to the agrep command.
This code, although interesting, is risky to use. First, it removes the '>' in front of the sequence name. Second, it matches anything between the record markers, not only the name. What if my sequence name is ATT and ATT is found in many of the sequences or, worst, one of the sequences is ATT? Third, unless you specify a beginning of line and end of line in your regex, it will also recover partial matches (like getting sequence222 and sequence22 when searching sequence2).
Hi all, so I know this post is from a long time ago, but I,ve been searching for an easier way to do this, and in the original question he says he wants to use perl... but there's the, not that new, easel miniapps that comes with HMMER, first you need to index your fasta with
esl-sfetch --Index seq_file.fasta
and then you simply do this command
esl-sfetch seq_file.fasta "headeryouwanttoretrieve"
or from a list file with the headers of the sequences you want (with no ">")
esl-sfetch -f seq_file.fasta myhits.list > myhits.fasta
you can see more of this in here:
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I've modified your original question, as it was not very clear. You should put an example of your input file and an example of your output. Is your input file a fasta file?