Hi guys.
I have two files:
- The first is a tab file. In its first column i have list of location and description of fasta sequence in the 2nd column.
- The second is a multi fasta file. Some sequences begin with a normal header and others with the location in it. I'd like to compare the two files and replace the "LOC" header in the multi fasta with the location and the corresponding description in the tab file, if they share the same location.
The tab file look like this :
LOC105031928 regulator of telomere elongation helicase 1
LOC105031929 pathogenesis-related protein 1B-like
In my multi fasta reference I may have some sequences likes:
>LOC105031928
GCTTGCCAGGGTTCCTCGACACCTTGTGCCGAGTCTTCACTATCTCCTTCCACAAGAAGC
TTTCTAGGGTTTCCCAAGAACCCTCATACCTGTCCTCCATCCCATTCGTCGAAAAAATTT
CTAGGGTGTCCTCAAGAATCCCCGTGCCCTCTTCCGAACGAACGGTGCGAAGGTCGAGGG
AAATGCCGATCTACAAGATTAGGGGGATCGATGTGGATTTCCCCTTCGAAGCCTACGATT
I would like to modify this sequence for example and get as output:
>LOC105031928 | regulator of telomere elongation helicase 1
GCTTGCCAGGGTTCCTCGACACCTTGTGCCGAGTCTTCACTATCTCCTTCCACAAGAAGC
TTTCTAGGGTTTCCCAAGAACCCTCATACCTGTCCTCCATCCCATTCGTCGAAAAAATTT
CTAGGGTGTCCTCAAGAATCCCCGTGCCCTCTTCCGAACGAACGGTGCGAAGGTCGAGGG
AAATGCCGATCTACAAGATTAGGGGGATCGATGTGGATTTCCCCTTCGAAGCCTACGATT
I tried to do it with this code sample
#!/usr/local/perl-5.24.0/bin/perl
use strict;
use warnings;
use strict;
use Data::Dumper;
my $tabfile = $ARGV[0];
my $Reference = $ARGV[1];
open REF, "<", $Reference or die "Cannot open $Reference";
open TAB, "<", $tabfile or die "Cannot open $tabfile";
open RES, ">", "RES.fasta" or die "Cannot open RES.txt";
my %hashref;
while (my $ligne = <REF>){
if ($ligne =~ /^>LOC/){
my $reflines = (split (m/>/, $ligne))[1];
$hashref{$reflines} = $ligne;
}
}
my ($LOC, $DES, $header);
my @lines = <TAB>;
foreach my $line(@lines){
while (my ($key, $value) = each %hashref){
$LOC = (split (m/\t/, $line))[0];
$DES = (split (m/\t/, $line))[1];
$header = $LOC."|".$DES;
if ($LOC !~ $key){
## Here is the part where I stuck. I tried printing only $key into RES
## to debug the program but it prints all possible combinations
## between $line and $key;
}
}
}
close (TAB);
close (RES);
close(REF);
exit;
Any idea of an outcome. I tried several methods but still can't figure out how to manage both files and edit my multi fasta sequence. Thanks.
This has almost certainly been answered before. @Pierre generally has a list of all answered fasta-related threads handy.
Here is one: modifying fasta header
Thank you for the links. I'll take a look.
There have been multiple questions/answers similar to this on biostars.
What have you tried? Please show in your question.
I Just posted a part of my script. Thank you for the python code but unfortunately since the program I'd like to get is part of a huge script written in perl I'm afraid I can't combine both script in the same.
That should be added to the original post.
Firstly, you have to know how to parse FASTA file in Perl. I wrote a function or this optimized but hard to read one
No need to iterate the
hash
, just check if a key exists in ahash
using exists.