perl programming for finding sequenc
2
1
Entering edit mode
8.5 years ago
bulbul ▴ 10

I have a de novo assembly file "denovo.fasta", and another file with 25,000 genes id "gene_id.txt". how should i retrive those sequences from the "denovo.fasta" which are only present in "gene_id.txt" file?

RNA-Seq Assembly • 1.5k views
ADD COMMENT
1
Entering edit mode

Can you please post an example of "denovo.fasta" and "gene_id.txt". Can awk be used or is perl desired? Thanks.

ADD REPLY
1
Entering edit mode
8.5 years ago
Daniel ★ 4.0k

If your fasta file is in one line (no linebreaks within the DNA) this will do it.

grep -w -f gene_id.txt -A 1 denovo.fasta >present_only.fasta

Searches with each line in the gene_id.txt file, only looks for whole words (incase some of your IDs are shorter versions of others) and when it finds it, prints that line and the one after it.

If you need to convert a multi-line fasta into a single-line fasta before starting, I use this simple script:

#!/usr/bin/perl
chomp(@lines = <>);

foreach $line (@lines){
        if($line =~ /^>/){
                $line = "\n$line\n";
        }
        if($line != /^$/){
                print "$line";
        }
}
ADD COMMENT
1
Entering edit mode
8.5 years ago
JC 13k

Considering your fasta head line is only one word (the id) then hashes are your friend:

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

$ARGV[1] or die "use getSeqs.pl ID_LIST FASTA_IN > FASTA_OUT\n";
my $list = shift @ARGV;
my $fasta = shift @ARGV;
my %wanted =();
open (my $lh, "<", $list);
while (<$lh>) {
    chomp;
    $wanted{$_}++;
}
close $lh;

$/ = "\n>";
open (my $fh, "<", $fasta);
while (<$fh>) {
    s/>//g;
    my ($id) = split (/\n/, $_);
    print ">$_" if (defined $wanted{$id});
}
close $fh;
ADD COMMENT

Login before adding your answer.

Traffic: 1698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6