Question

perl programming for finding sequenc

1

Entering edit mode

8.9 years ago

bulbul ▴ 10

I have a de novo assembly file "denovo.fasta", and another file with 25,000 genes id "gene_id.txt". how should i retrive those sequences from the "denovo.fasta" which are only present in "gene_id.txt" file?

RNA-Seq Assembly • 1.7k views

ADD COMMENT • link updated 8.9 years ago by JC 13k • written 8.9 years ago by bulbul ▴ 10

1

Entering edit mode

Can you please post an example of "denovo.fasta" and "gene_id.txt". Can awk be used or is perl desired? Thanks.

ADD REPLY • link 8.9 years ago by bioguy24 ▴ 230

score 1 · Answer 1 · 2016-05-17

If your fasta file is in one line (no linebreaks within the DNA) this will do it.

grep -w -f gene_id.txt -A 1 denovo.fasta >present_only.fasta

Searches with each line in the gene_id.txt file, only looks for whole words (incase some of your IDs are shorter versions of others) and when it finds it, prints that line and the one after it.

If you need to convert a multi-line fasta into a single-line fasta before starting, I use this simple script:

#!/usr/bin/perl
chomp(@lines = <>);

foreach $line (@lines){
        if($line =~ /^>/){
                $line = "\n$line\n";
        }
        if($line != /^$/){
                print "$line";
        }
}

score 1 · Answer 2 · 2016-05-17

Considering your fasta head line is only one word (the id) then hashes are your friend:

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

$ARGV[1] or die "use getSeqs.pl ID_LIST FASTA_IN > FASTA_OUT\n";
my $list = shift @ARGV;
my $fasta = shift @ARGV;
my %wanted =();
open (my $lh, "<", $list);
while (<$lh>) {
    chomp;
    $wanted{$_}++;
}
close $lh;

$/ = "\n>";
open (my $fh, "<", $fasta);
while (<$fh>) {
    s/>//g;
    my ($id) = split (/\n/, $_);
    print ">$_" if (defined $wanted{$id});
}
close $fh;