how to subset sequences based on IDs, but order shall remain preserved?
3
1
Entering edit mode
6.8 years ago
majeedaasim ▴ 60

I have a fasta file and an ID list file. I try to subset fasta sequences from the fasta file based on ID list using

seqtk subseq <fastafile> ID_list > out_file

However in the output file the order of the sequences is not preserved. I need the sequences in the outfile in the same order as in the ID list . Thanks

seqtk • 3.5k views
ADD COMMENT
0
Entering edit mode
6.8 years ago
Prakash ★ 2.2k

I hope this perl script will solve your purpose Usage : perl script.pl Id_file file.fasta Output.fasta

#!/usr/bin/perl

@file1 = `cat $ARGV[0]`;
@file2 = `cat $ARGV[1]`;
%hash;
$id ;
foreach $fasta(@file2){
        chomp($fasta);
        if($fasta =~ '>'){
                $id = $fasta;}
        if($fasta !~ '>'){
                push @{$hash{$id}},$fasta;      }
}
open(OUT,">$ARGV[2]") or die;
foreach $id(@file1){
        chomp($id);
        $id =~s/^/>/;
        #print "$id\n";
        @seq;
        print  OUT "$id\n";
        foreach $key(keys %hash){
                if($id eq $key){
                        @seq = @{$hash{$key}};}}
                for($i =0; $i<scalar(@seq);$i++){
                        print OUT "$seq[$i]\n";}
}
ADD COMMENT
0
Entering edit mode

Thanks prakash but it produces an unusual result It doesn't subset properly

ADD REPLY
0
Entering edit mode

If I am not wrong you need fasta sequence for selected Id ? right?

ADD REPLY
0
Entering edit mode

yes I have fasta sequence file and an ID file. THe sequences need to be extracted from the sequence file according to the entries in the ID file but the order of sequences in the outfile shall be same as ID file entries.

ADD REPLY
0
Entering edit mode

can you please provide few example of Id ?

ADD REPLY
0
Entering edit mode

sequence file

>abc
atgcggatacagataca
>def
atgcggccaaaatt
>ghi
atggggaattcacaac
>jkl
aggatattatatccg

Now filter according to Ids list like

abc
jkl
ADD REPLY
0
Entering edit mode

I have edited the script, I hope its working fine now

ADD REPLY
0
Entering edit mode

I tried it but it produces the same contents as ID file. NO sequences are extracted.

ADD REPLY
0
Entering edit mode

I think your Id file contains sequence id as >abc , >jkl. So accordingly i have edited the script. If that is the case , then the script should give intended result.

ADD REPLY
0
Entering edit mode

My ID file entries are like

Tw_Wn_DN127031_c0_g1_i5
Tw_Wn_DN118882_c0_g1_i2
Tw_Wn_DN121684_c0_g1_i2
Tw_Wn_DN127037_c0_g1_i1
Tw_Wn_DN134530_c2_g4_i7
Tw_Wn_DN117797_c0_g1_i1

the script says Died at script.pl line 14.

ADD REPLY
0
Entering edit mode

The above script should work for input file provided below:

Fasta_file

>EG01
AATTCAGAAAAGTGCCATAAAACAGGCAAATATAGCACCTATCTTCGATTCCCGATTGAGAAAGCTCGAATCATTAAGAA
GATGTAGGTTGGGTTGAGGAACGAGACCCAGCAATTCGATTCCATGTTGGATTAAACAATATCAACCAAGAGTTTGATGT
>EG02
CAGGATAACTCCAATTTTTCTTTTAGAGTCAAGCAGCCATTACAAGGAAACACAATTAGATTGGGCTAGTCCTTACCTTA
AAAAGAATTGTCACAAGTTCCTTGGCTGTAAAAGGTTTTAATAGGAATGCTGCAACTCCACCACTTTTTTCTACTTCTGC

Sequence_id_file

EG01

Output_filename

>EG01
AATTCAGAAAAGTGCCATAAAACAGGCAAATATAGCACCTATCTTCGATTCCCGATTGAGAAAGCTCGAATCATTAAGAA
GATGTAGGTTGGGTTGAGGAACGAGACCCAGCAATTCGATTCCATGTTGGATTAAACAATATCAACCAAGAGTTTGATGT

Follow below usage to avoid any error

perl script usage : perl script.pl Sequence_id_file  Fasta_file Output_filename
ADD REPLY
0
Entering edit mode
6.8 years ago
Hugo ▴ 380

Hello,

we are currently developing SEDA http://www.sing-group.org/seda, a multi-platform application for processing and filtering FASTA files. I guess that what you need can be achieved with the "Pattern filtering" option, section 3.2 of the manual: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-1.0.0.pdf. In this option, you should select "Header" and then load your ID list using the "Import patterns" button. This way only sequences that contains a ID in your list will be kept.

Regards,

Hugo.

ADD COMMENT
0
Entering edit mode

Thanks, but link is not working

ADD REPLY
0
Entering edit mode

They should be working now.

ADD REPLY
0
Entering edit mode
6.8 years ago
GenoMax 147k

I am certain faSomeRecords utility from Jim Kent (UCSC) will do what you need.

ADD COMMENT

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6