I have a de novo assembly file "denovo.fasta", and another file with 25,000 genes id "gene_id.txt". how should i retrive those sequences from the "denovo.fasta" which are only present in "gene_id.txt" file?
I have a de novo assembly file "denovo.fasta", and another file with 25,000 genes id "gene_id.txt". how should i retrive those sequences from the "denovo.fasta" which are only present in "gene_id.txt" file?
If your fasta file is in one line (no linebreaks within the DNA) this will do it.
grep -w -f gene_id.txt -A 1 denovo.fasta >present_only.fasta
Searches with each line in the gene_id.txt file, only looks for whole words (incase some of your IDs are shorter versions of others) and when it finds it, prints that line and the one after it.
If you need to convert a multi-line fasta into a single-line fasta before starting, I use this simple script:
#!/usr/bin/perl
chomp(@lines = <>);
foreach $line (@lines){
if($line =~ /^>/){
$line = "\n$line\n";
}
if($line != /^$/){
print "$line";
}
}
Considering your fasta head line is only one word (the id) then hashes are your friend:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
$ARGV[1] or die "use getSeqs.pl ID_LIST FASTA_IN > FASTA_OUT\n";
my $list = shift @ARGV;
my $fasta = shift @ARGV;
my %wanted =();
open (my $lh, "<", $list);
while (<$lh>) {
chomp;
$wanted{$_}++;
}
close $lh;
$/ = "\n>";
open (my $fh, "<", $fasta);
while (<$fh>) {
s/>//g;
my ($id) = split (/\n/, $_);
print ">$_" if (defined $wanted{$id});
}
close $fh;
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Can you please post an example of "denovo.fasta" and "gene_id.txt". Can awk be used or is perl desired? Thanks.