!/usr/bin/perl -w

Question

Extract Multiple Fasta Sequences From A File Containing Many Sequences In Lines And Rows

2

Entering edit mode

14.0 years ago

Chris ▴ 30

Hi, I have this problem. I have a fasta file containing the whole genome of a arabidopsis plant. I also have a list of genes I need to extract the sequences from the fasta file. Each line contains 2 or 3 or 4 sequences. I need to create a fasta file for each of the sequences in the each line. Any help? I visualize it for simplicity.

sequences I need to get fast for:

AT1G63760    AT1G05890    AT2G31510
AT3G30122    AT1G08410    AT2G27200
AT1G56390    AT1G09210    
AT1G18191    
AT3G32445    AT1G18500    AT1G74040
AT2G34760    AT1G20010    AT1G75780    AT1G74040
AT1G60480    AT1G23490    AT1G70490
AT4G10414    AT1G33230    AT4G10430
AT4G06748    AT1G52410    AT3G15950
AT2G06822    AT1G52930    AT3G15460
AT3G43820    AT1G62810    AT4G12290
AT2G11280    AT1G63640    AT5G41310
AT2G15120    AT2G15090    AT4G34250

>AT1G63760
ATGATGGATTCCGATGATGATATGCTCGATGCCCACGATATGGACTCGGTAGATTATGATTTTGACAGCGGCGGCACCGATGATGACAAC

>AT1G05890
GCAAAAGTCCATATATCTTTCTCTCATCTACTCGATTTCTGAATCGCGAACGCAGCGAGC
GATCCGGAATCGAGAGAGAGAGCTACCAATTTTTCCAACTTGTTCGGTTCCTTATAAAGC
TGTTTACTTTCCCTTGCGAATTCTCTCTCTCTCTACTATAAAGCAATACCCTTTTCTTTC
TTTTTTGGTAACCCCCAAACCCTACTTTGTCCAGCGAGAAAGGAAGAGGGGTTTCGTCTG

>AT2G31510
ATGGATTCTGAAGAAGACATGCTCGATGCGCACGATATGGAGTCTGGAGAGGATGATTTC
TACAGCGGTGGAACTGATGATTGTAATGATAGTGATGATGGTGAACCTGATTATGGGTTT
GTTGAGGAAGATGCTGATGATTCTGCTATGATCGCCTCTCATCGCTCTCAGGTGGGTTTT
TGTTTTTGTTTCTTAATTCATTTTGGTGTTGGTTGTTGAAGCTAATTGCTTTTAGAACTC

>AT1G18191
AAAATTGAAGACGAAGAGTTTAAGACTCTATCAGATCAGAAGGTCTCTATTTTCTCGAAACGTCGGCTTTAGAGAACCAAAACGTC

>AT1G56390
GCAATGGTGAAACTAGACTCTAAACTCATCTCTATGATTGTTTTCGGTATCGTGGTAATCGTCTCTGCTG

>AT1G09210
GCTAGCTCCTCTCCTCGCGGTATATATAAGCTCCAGGTCTTGTACATCTTCATCATCTGA
TCTCGGGGAAGCTCCGATCTGAGTTTTTTTTAGCAATGGCGAAAATGATTCCTAGCCTCG
TCTCTCTAATTCTTATCGGTCTTGTTGCGATCGCCTCCGCCGCAGTTATTTTCGAGGAGC
GCTTTGATGGTATCTAATTTCTACATCTCTATCTCTATACTCTATCTTCCTGATGATGCG

>AT2G34760
ATGCGTAAACAAAGTTTTAAGATAGTTATGATTCGTTTTTTGAGAGTCAATAACAAAAATTATCCGATC

>AT1G20010
AACAGAGACAATTGGTTATATTAGCTGTCACTCCCATCTTTCATATTCCTTCACCATCTC
TCTCTCTCTCGATCTTGTGAACCACTACACACACTAACACAATGAGAGAGATCCTTCACA
TTCAAGGTGGTCAATGCGGTAACCAGATTGGTTCCAAGTTCTGGGAAGTCATCTGCGACG
AGCACGGCATCGATTCCACCGGACGTTACAGTGGAGACACTGCAGATCTCCAGCTTGAAC
GTATCAATGTCTATTACAATGAAGCTTCAGGTGGAAGATACGTTCCTCGTGCTGTTCT

>AT1G75780
ATCTCCAGATCCCAAAATCTTCATCGATCATCATCATCATGAGAGAAATCCTCCACGTCC
AAGGCGGCCAATGCGGTAACCAAATCGGTTCCAAATTCTGGGAAGTTATCTGCGACGAAC
ACGGCGTTGATCCCACCGGACGTTACAACGGTGATTCCGCCGATCTTCAGCTCGAACGTA
TCAATGTTTATTACAATGAAGCTTCTGGTGGTCGTTACGTTCCTCGTGCTGTTCTCATGG

>AT1G74040
AAAGTAGTAACCAGAGACACTGTGCCGTCGCCCGTCGCCGCCGCCGCCACACTATCATCT
CTCTCAGGTTTTTGATTTTCCACGGCAATGGAGTCTTCGATTCTCAAAAGCCCTAATCTC
TCTTCACCATCGTTCGGTGTACCTTCAATTCCCGCCTTATCCTCCTCCTCCACCTCACCA
TTTTCATCTCTTCATCTCCGATCACAGAACCACCGTACCATCTCTCTTACCACCGCCGGA
AAATTCCGTGTCTCGTATTCTCTCTCCGCTTCTTCACCTCTACCACCTCATGCTCCTCGC
CGTCGTCCCAATTACATCCCTAACCGTATATCCGATCCCAATTACGTCAGAATCTT

sequence retrieval fasta • 9.7k views

ADD COMMENT • link updated 14.0 years ago by DG 7.3k • written 14.0 years ago by Chris ▴ 30

1

Entering edit mode

What have you tried so far? Have you looked at the related questions on the right?

ADD REPLY • link 14.0 years ago by Michael Kuhn 5.0k

0

Entering edit mode

Hi have script pick up a sequence per line. Not sure how to modify it so that I can have many sequences per line picked up and then saved in many files each file for 1 line of sequences.

ADD REPLY • link 14.0 years ago by Chris ▴ 30

0

Entering edit mode

!/usr/bin/perl -w

use strict;

my $idsfile = "gene1.txt"; my $seqfile = "TAIR9seq.fasta"; my %ids = ();

open (OUT,">selectionseqs.fasta");

open FILE, $idsfile; while([?]) { chomp; $ids{$_} += 1; } close FILE;

local $/ = "n>";

open FASTA, $seqfile; while ([?]) { chomp; my $seq = $_; my ($id) = $seq =~ /^>(S+)/; if (exists($ids{$id})) { $seq =~ s/^>.+n//; $seq =~ s/n//g; print "$seqn"; print "$seqn"; print OUT ">$idn$seqn"; } } close FASTA;

ADD REPLY • link 14.0 years ago by Chris ▴ 30

score 2 · Answer 1 · 2011-07-13

If I understand the question correctly, you want to create a separate fasta file for each line in your list of IDs and the file will contain the number of sequences listed on the line.

You have 2 choices when trying to match an ID list to a fasta file of sequences.

Get each ID, loop through the sequences, match ID to header and retrieve sequence
Index the sequence file first, then loop through the IDs and retrieve each sequence

The first approach becomes very slow for many sequences, since you have to loop through all sequences for each ID. So it's generally better to index the sequence file.

Install Bioperl and index the fasta file, seqs.fa, using this command:

bp_index -dir . seqs.idx seqs.fa

Now you can open the file of IDs, ids.txt, split each line on space and use bp_fetch to pull each ID in turn from the index and write to a new fasta file. Something like this should work:

#!/usr/bin/perl -w

use strict;

open INFILE, "ids.txt";
while(<INFILE>) {
  chomp;
  my @ids = split(/\s+/);
  my $outfile = join("_", @ids).".fa";
  open OUTFILE, ">$outfile";
  foreach my $id(@ids) {
    print OUTFILE `bp_fetch -dir . seqs.idx:$id`;
  }
  close OUTFILE;
}
close INFILE

In this example, each new fasta file will be named according to the IDs on each line, separated by underscores; e.g. AT1G56390_AT1G09210.fa.

score 1 · Answer 2 · 2011-07-12

1

Entering edit mode

14.0 years ago

Ido Tamir 5.2k

notseq from EMBOSS can read in a multi sequence file and a list of identifiers (wildcards) and write the non matching sequences to one file the matching ones to a different file (-junkout). You want the junkout file. You can specify a file with the identifiers to filter e.g. '@names.dat'.

this is the command that you need:

ruby -nae 'system("notseq -sequence fasta.txt -sformat1 fasta -osformat fasta -osname ignore.fasta -junkoutseq "+ $..to_s()+".matches.fasta -auto -exclude "+$F.join(",")) ' identifiers.txt

ADD COMMENT • link 14.0 years ago by Ido Tamir 5.2k

0

Entering edit mode

I am not sure if this is what I want to do. I need to have for every line of IDs one fasta file so that I can make a multiple alignment later.

ADD REPLY • link 14.0 years ago by Chris ▴ 30

0

Entering edit mode

I added a ruby command that does this in a loop. Your genome goes into fasta.txt your identifiers as they are into identifiers.txt.

ADD REPLY • link 14.0 years ago by Ido Tamir 5.2k

score 1 · Answer 3 · 2011-07-13

If you use Bioperl and SeqIO you can have your fasta file stored as a proper data structure (alternatively you can write your own fasta parser and store the sequences in a hash of your own).

Parsing the IDs is simple, process each line, on each line split on whitespace so you have an array of IDs. Then it is a simple matter of grabbing the sequences from the hash using the ID key and printing them to a file.

I prefer keeping my sequences in a hash data structure myself over Bio::SeqIO in most cases as I can easily pull out sequences by their ID given a list of IDs.

score 0 · Answer 4 · 2011-07-12

0

Entering edit mode

14.0 years ago

Pierre Lindenbaum 166k

di you try the Sequence Bulk Download and Analysis for Arabidopsis ?

http://www.arabidopsis.org/tools/bulk/sequences/index.jsp ?

ADD COMMENT • link 14.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

this will only give me a big fasta file with the ids I have in my list. So again will not do what I want.

ADD REPLY • link 14.0 years ago by Chris ▴ 30