Perl How To Isolate Fasta Sequences With A Range Of Values
6
4
Entering edit mode
14.2 years ago
Raghul ▴ 200

Hi to all, I have a fasta file with lots of sequences with the description not like NCBI.But the file has keyword mentioning the length of the sequence. I want to extract sequences only of, say upto 100 bp in length.How can I do this with perl?

the format is as follows:

>gene1  group5  length=84  
AaaaaatGTCTGGATGAGTTCATCCTGTAAAAaTTGCTGTCTGATACAAATACTTTGCTT
AGTCCAGTTAAATCTTCACTACTTTTGTGCACTGAAAGGCTAGCTTTtCTTCCAAAGCGG
TTTTCAATAATTCCTCTGACGCCTCCTTTTTtAGAGTATTTATTGTGTCTTCTATTTCCT

thanx raghul

Yes the length of the sequence is is not as mentioned. I did that for brevity(sorry!). I have many sequences of varying length. I am trying to write a program counting the strings. But I also felt it will be easy if I use the keyword length in FASTA description line to extract sequences within a range of values. thanx once again.raghul

NCBI fasta description is as follows,It has gi no.followed by reseq ID & organism etc which is NCBI way of describing the sequence but Mine is not so when u observe.

>gi|159476307|ref|XM_001696201.1| Chlamydomonas reinhardtii strain CC-503 cw92 mt+
CACAGTACCTTTCTGGTCAGCTGCACTGCATTGCTTTGTGACTAGTGAAGCTTCGACAGCTCACTGCGGA
CATTCCAAAATTGCTGTAACTCGACATTGATTTAACTACAGTATGCTGTTATATCCATAGCGCAAGAGAG
CTTGCGGCTTGCCTCCCCTCCATGCTCTTGTAGTCTGAGCCTATCCAGCTGCCTCGTCGCCGTTTGCAAA
GTTTTATTACTGAGACACAAGTAGCAGGGGCCGAGCAGGCAGCTGCCTGCGAGGCCGGTGAACCACGCGG
sequence parsing fasta perl • 7.3k views
ADD COMMENT
1
Entering edit mode

This example is somewhat confusing, since the length clearly is not 84. Also, what is meant by "the description not like NCBI" ? Do you mean the format is not like the NCBI description of FASTA format ?

ADD REPLY
0
Entering edit mode

This example is somewhat confusing, since the length clearly is not 84.

ADD REPLY
0
Entering edit mode

OK, so the sequence is valid fasta format; that's what I needed to know. The length description will work (as in Cass' answer, below), provided that it is correct, otherwise better to use the calculated length of the sequence.

ADD REPLY
0
Entering edit mode

thanx it worked for a newbie like me

ADD REPLY
10
Entering edit mode
14.2 years ago
Neilfws 49k

Most programming languages have libraries to parse Fasta (and many other sequence formats). Here's a quick example using the Bio::SeqIO library from BioPerl. It will create a new file of sequences 100 bp or less in length.

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

my $seqin  = Bio::SeqIO->new(-file => "myfile.fa",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">myfile_100.fa", -format => "fasta");
while(my $seq = $seqin->next_seq) {
  if($seq->length <= 100) {
    $seqout->write_seq($seq);
  }
}

Similar solutions exist in BioPython, BioRuby, BioJava. There are pure "non-library" solutions, but I always recommend learning how to use a Bio* library. You can also search this site for "fasta parsing" to see more advice.

ADD COMMENT
0
Entering edit mode

I want to search sequences with length greater than 100. Will this program work if I change the sign > ie. if($seq->length >= 100) {

thank you.raghul

ADD REPLY
0
Entering edit mode

Try it and see :-) Yes, it will.

ADD REPLY
4
Entering edit mode
14.2 years ago

Since neilfws mentioned Biopython, here is a solution using it, although this does not satisfy the OP requirements:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
min_length = int(sys.argv[2]) # Minimum length of sequence
result_file = sys.argv[3] # Output fasta file

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
end = False
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if len(seq.seq.tostring()) > min_length:
            SeqIO.write([seq], f, "fasta")

Should be used as follows (after saving to script.py and making the script executable):

script.py input.fasta 100 output.fasta

Cheers

ADD COMMENT
0
Entering edit mode

Thanks for the answer. I use perl & currently learning to use Bioperl. I dont know python. So i am yet to try it. Thank u very much.raghul

ADD REPLY
3
Entering edit mode
14.2 years ago
Cassj ★ 1.3k
perl -ne '/length=(\d+)/; print if $1>=100' file.fa

Seems to work. Neil's BioPerl method is far more reliable though ;)

ADD COMMENT
3
Entering edit mode
14.2 years ago
Rm 8.3k

Newer BLAST version(s) have blastdbcmd which is the successor to fastacmd. You can use range option in it. It selects the range of a sequence to extract.

blastdbcmd -range "< String >"

Range of sequence to extract (Format: start-stop)

In your case 1-100

ADD COMMENT
0
Entering edit mode

Hi Thanx for the suggestion. Mine is local file with lot of sequences.blastdbcmd only works for blast formatted databases. My sequences do not follow NCBI fasta descriptions(u can find the differnce as explained in the question), so I have problems with formatting & then using it. If you have a solution for this,please let me know.

thanks once again raghul

ADD REPLY
0
Entering edit mode

">gene1 group5 length=84 " with above descriptions you can still format it in blast and use it give it a try

ADD REPLY
0
Entering edit mode
13.3 years ago
Yannick Wurm ★ 2.5k

Ruby one-liner for the first option (with length in the title):

ruby -ne 'BEGIN{$/="\n>"}; length = $_.match(/length=(\d+)/)[1].to_i; print $_ if length <= 100' < in.fasta > out.fa
ADD COMMENT
0
Entering edit mode
13.3 years ago
Palu ▴ 290

I have suggested one option in this thread

I have tested it by my own. It worls like gem

ADD COMMENT

Login before adding your answer.

Traffic: 1551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6