Question

Perl How To Isolate Fasta Sequences With A Range Of Values

4

Entering edit mode

14.8 years ago

Raghul ▴ 200

Hi to all, I have a fasta file with lots of sequences with the description not like NCBI.But the file has keyword mentioning the length of the sequence. I want to extract sequences only of, say upto 100 bp in length.How can I do this with perl?

the format is as follows:

>gene1  group5  length=84  
AaaaaatGTCTGGATGAGTTCATCCTGTAAAAaTTGCTGTCTGATACAAATACTTTGCTT
AGTCCAGTTAAATCTTCACTACTTTTGTGCACTGAAAGGCTAGCTTTtCTTCCAAAGCGG
TTTTCAATAATTCCTCTGACGCCTCCTTTTTtAGAGTATTTATTGTGTCTTCTATTTCCT

thanx raghul

Yes the length of the sequence is is not as mentioned. I did that for brevity(sorry!). I have many sequences of varying length. I am trying to write a program counting the strings. But I also felt it will be easy if I use the keyword length in FASTA description line to extract sequences within a range of values. thanx once again.raghul

NCBI fasta description is as follows,It has gi no.followed by reseq ID & organism etc which is NCBI way of describing the sequence but Mine is not so when u observe.

>gi|159476307|ref|XM_001696201.1| Chlamydomonas reinhardtii strain CC-503 cw92 mt+
CACAGTACCTTTCTGGTCAGCTGCACTGCATTGCTTTGTGACTAGTGAAGCTTCGACAGCTCACTGCGGA
CATTCCAAAATTGCTGTAACTCGACATTGATTTAACTACAGTATGCTGTTATATCCATAGCGCAAGAGAG
CTTGCGGCTTGCCTCCCCTCCATGCTCTTGTAGTCTGAGCCTATCCAGCTGCCTCGTCGCCGTTTGCAAA
GTTTTATTACTGAGACACAAGTAGCAGGGGCCGAGCAGGCAGCTGCCTGCGAGGCCGGTGAACCACGCGG

sequence parsing fasta perl • 8.3k views

ADD COMMENT • link updated 14.8 years ago by Palu ▴ 290 • written 14.8 years ago by Raghul ▴ 200

1

Entering edit mode

This example is somewhat confusing, since the length clearly is not 84. Also, what is meant by "the description not like NCBI" ? Do you mean the format is not like the NCBI description of FASTA format ?

ADD REPLY • link 14.8 years ago by Neilfws 49k

0

Entering edit mode

This example is somewhat confusing, since the length clearly is not 84.

ADD REPLY • link 14.8 years ago by Neilfws 49k

0

Entering edit mode

OK, so the sequence is valid fasta format; that's what I needed to know. The length description will work (as in Cass' answer, below), provided that it is correct, otherwise better to use the calculated length of the sequence.

ADD REPLY • link 14.8 years ago by Neilfws 49k

0

Entering edit mode

thanx it worked for a newbie like me

ADD REPLY • link 14.8 years ago by Raghul ▴ 200

Ram · Answer 1 · 2010-10-06

10

Entering edit mode

14.8 years ago

Neilfws 49k

Most programming languages have libraries to parse Fasta (and many other sequence formats). Here's a quick example using the Bio::SeqIO library from BioPerl. It will create a new file of sequences 100 bp or less in length.

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

my $seqin  = Bio::SeqIO->new(-file => "myfile.fa",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">myfile_100.fa", -format => "fasta");
while(my $seq = $seqin->next_seq) {
  if($seq->length <= 100) {
    $seqout->write_seq($seq);
  }
}

Similar solutions exist in BioPython, BioRuby, BioJava. There are pure "non-library" solutions, but I always recommend learning how to use a Bio* library. You can also search this site for "fasta parsing" to see more advice.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.8 years ago by Neilfws 49k

0

Entering edit mode

I want to search sequences with length greater than 100. Will this program work if I change the sign > ie. if($seq->length >= 100) {

thank you.raghul

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 14.0 years ago by Raghul ▴ 200

0

Entering edit mode

Try it and see :-) Yes, it will.

ADD REPLY • link 14.0 years ago by Neilfws 49k

Ram · Answer 2 · 2010-10-06

Since neilfws mentioned Biopython, here is a solution using it, although this does not satisfy the OP requirements:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
min_length = int(sys.argv[2]) # Minimum length of sequence
result_file = sys.argv[3] # Output fasta file

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
end = False
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if len(seq.seq.tostring()) > min_length:
            SeqIO.write([seq], f, "fasta")

Should be used as follows (after saving to script.py and making the script executable):

script.py input.fasta 100 output.fasta

Cheers

Ram · Answer 3 · 2010-10-06

3

Entering edit mode

14.8 years ago

Cassj ★ 1.3k

perl -ne '/length=(\d+)/; print if $1>=100' file.fa

Seems to work. Neil's BioPerl method is far more reliable though ;)

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.8 years ago by Cassj ★ 1.3k

Ram · Answer 4 · 2010-10-06

3

Entering edit mode

14.8 years ago

Rm 8.3k

Newer BLAST version(s) have blastdbcmd which is the successor to fastacmd. You can use range option in it. It selects the range of a sequence to extract.

blastdbcmd -range "< String >"

Range of sequence to extract (Format: start-stop)

In your case 1-100

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 14.8 years ago by Rm 8.3k

0

Entering edit mode

Hi Thanx for the suggestion. Mine is local file with lot of sequences.blastdbcmd only works for blast formatted databases. My sequences do not follow NCBI fasta descriptions(u can find the differnce as explained in the question), so I have problems with formatting & then using it. If you have a solution for this,please let me know.

thanks once again raghul

ADD REPLY • link 14.8 years ago by Raghul ▴ 200

0

Entering edit mode

">gene1 group5 length=84 " with above descriptions you can still format it in blast and use it give it a try

ADD REPLY • link 14.8 years ago by Rm 8.3k

Ram · Answer 5 · 2011-08-31

0

Entering edit mode

13.9 years ago

Yannick Wurm ★ 2.5k

Ruby one-liner for the first option (with length in the title):

ruby -ne 'BEGIN{$/="\n>"}; length = $_.match(/length=(\d+)/)[1].to_i; print $_ if length <= 100' < in.fasta > out.fa

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 13.9 years ago by Yannick Wurm ★ 2.5k

Ram · Answer 6 · 2011-08-31

0

Entering edit mode

13.9 years ago

Palu ▴ 290

I have suggested one option in this thread

I have tested it by my own. It worls like gem

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 13.9 years ago by Palu ▴ 290