Entering edit mode
8.4 years ago
empyrean999
▴
180
Hello;
I need to process fasta header by matching fasta description (not fasta id) with a first column in a another file with two columns and print second column in file on to fasta header. Here are examples and what i have till now.
file1.txt (list file)
group_1 gene 1
group_2 gene 2
group_3 gene 3
group_4 gene 4
file1.fasta
>Desc_0001.p01 group_1
TTGGAAAACATCTCTGATTTATGGAACAGCGCCTTAAAAGAACTCGAAAAAAAGGTCAGT
>Desc_0002.p01 group_2
ATGCGTTTTTCAATTCAAAAAGACTATCTTGTAAGAAGTGTACAAGATGTAATGAAGGCT
>Desc_0003.p01 group_3
GTGAGCGATTTTATGAAACGTATTAAAATTTCAACAGAGTATATTACACTAGGGCAATTT
>Desc_0004.p01 group_4
TTGTTTATTTCAGAAATACAATTAAAAAACTATCGCAATTATGAAAAATTAGAGCTTTCC
>Desc_0005.p01 group_11
ATGGAACAAAAGCAAATGCAAGAAAATTCATATGATGAAAGTCAAATACAGGTACTTGAA
>Desc_0006.p01 group_23
ATGTCAGACAATCAACAACAAGCACGAATTCGAGAAATTAATATTAGCCATGAAATGCGT
>Desc_0008.p01 group_622
ATGCATCAAACACATTGTACTTGGCAGCAATTAAGTTTCTTCTTTTCATCTCAAAATGTA
desired output :
>Desc_0001.p01 group_1 gene 1
TTGGAAAACATCTCTGATTTATGGAACAGCGCCTTAAAAGAACTCGAAAAAAAGGTCAGT
>Desc_0002.p01 group_2 gene 2
ATGCGTTTTTCAATTCAAAAAGACTATCTTGTAAGAAGTGTACAAGATGTAATGAAGGCT
>Desc_0003.p01 group_3 gene 3
GTGAGCGATTTTATGAAACGTATTAAAATTTCAACAGAGTATATTACACTAGGGCAATTT
>Desc_0004.p01 group_4 gene 4
TTGTTTATTTCAGAAATACAATTAAAAAACTATCGCAATTATGAAAAATTAGAGCTTTCC
Code i have till now:
use strict;
use warnings;
my $file = 'file1.txt';
open my $infile, '<', $file or die "Can't read from $file: $!";
my @genes;
my @desc;
while (<$infile>){
chomp $infile;
my @split = split('\t');
push @genes, $split[0];
push @desc, $split[1];
# print "$split[1]";
}
my $input;
close $infile;
{
local $/ = undef;
open my $fasta, '<file1.fasta';
$input = <$fasta>;
close $fasta;
}
my @lines = split(/>/,$input);
foreach my $l (@lines)
{
foreach my $reg (@genes) {
print ">$l" if $l =~ /\s$reg\s+/i
}
}
my output from code :
>Desc_0001.p01 group_1
TTGGAAAACATCTCTGATTTATGGAACAGCGCCTTAAAAGAACTCGAAAAAAAGGTCAGT
>Desc_0002.p01 group_2
ATGCGTTTTTCAATTCAAAAAGACTATCTTGTAAGAAGTGTACAAGATGTAATGAAGGCT
>Desc_0003.p01 group_3
GTGAGCGATTTTATGAAACGTATTAAAATTTCAACAGAGTATATTACACTAGGGCAATTT
>Desc_0004.p01 group_4
TTGTTTATTTCAGAAATACAATTAAAAAACTATCGCAATTATGAAAAATTAGAGCTTTCC
How can i add second column to description. i am able to get matching fasta descriptor out.
And if it's more complicated than that a Biopython script could do that in a few lines. But I have the feeling I already wrote those scripts a few times here on biostars :-D A simple search might work...
can you point me to some of your old suggestions?
This one comes probably pretty close, requiring some modifications: Please help with connecting the part of the header in a fasta file with another text file
Do you have (bio)python experience? It's probably most rewarding that you get this working on yourself, but if you come across any issues I'd be happy to help.
gene is not a static number and not in incremental order. The real file will have random descriptions of typical gene names .