modifying fasta header
1
1
Entering edit mode
8.4 years ago
empyrean999 ▴ 180

Hello;

I need to process fasta header by matching fasta description (not fasta id) with a first column in a another file with two columns and print second column in file on to fasta header. Here are examples and what i have till now.

file1.txt (list file)

group_1 gene 1
group_2 gene 2
group_3 gene 3
group_4 gene 4

file1.fasta

>Desc_0001.p01 group_1
TTGGAAAACATCTCTGATTTATGGAACAGCGCCTTAAAAGAACTCGAAAAAAAGGTCAGT
>Desc_0002.p01 group_2
ATGCGTTTTTCAATTCAAAAAGACTATCTTGTAAGAAGTGTACAAGATGTAATGAAGGCT
>Desc_0003.p01 group_3
GTGAGCGATTTTATGAAACGTATTAAAATTTCAACAGAGTATATTACACTAGGGCAATTT
>Desc_0004.p01 group_4
TTGTTTATTTCAGAAATACAATTAAAAAACTATCGCAATTATGAAAAATTAGAGCTTTCC
>Desc_0005.p01 group_11
ATGGAACAAAAGCAAATGCAAGAAAATTCATATGATGAAAGTCAAATACAGGTACTTGAA
>Desc_0006.p01 group_23
ATGTCAGACAATCAACAACAAGCACGAATTCGAGAAATTAATATTAGCCATGAAATGCGT
>Desc_0008.p01 group_622
ATGCATCAAACACATTGTACTTGGCAGCAATTAAGTTTCTTCTTTTCATCTCAAAATGTA

desired output :

>Desc_0001.p01 group_1 gene 1
TTGGAAAACATCTCTGATTTATGGAACAGCGCCTTAAAAGAACTCGAAAAAAAGGTCAGT
>Desc_0002.p01 group_2 gene 2
ATGCGTTTTTCAATTCAAAAAGACTATCTTGTAAGAAGTGTACAAGATGTAATGAAGGCT
>Desc_0003.p01 group_3 gene 3
GTGAGCGATTTTATGAAACGTATTAAAATTTCAACAGAGTATATTACACTAGGGCAATTT
>Desc_0004.p01 group_4 gene 4
TTGTTTATTTCAGAAATACAATTAAAAAACTATCGCAATTATGAAAAATTAGAGCTTTCC

Code i have till now:

use strict;
use warnings;

my $file = 'file1.txt'; 
open my $infile, '<', $file or die "Can't read from $file: $!";

my @genes;
my @desc;
while (<$infile>){
 chomp $infile;
 my @split = split('\t');
    push @genes, $split[0];
    push @desc, $split[1];
#    print "$split[1]";
}

my $input;
close $infile;

{
    local $/ = undef;
   open my $fasta, '<file1.fasta';
    $input = <$fasta>;
   close $fasta;
}

my @lines = split(/>/,$input);

foreach my $l (@lines)
{
   foreach my $reg (@genes) {
        print ">$l" if $l =~ /\s$reg\s+/i
    }

}

my output from code :

>Desc_0001.p01 group_1
TTGGAAAACATCTCTGATTTATGGAACAGCGCCTTAAAAGAACTCGAAAAAAAGGTCAGT
>Desc_0002.p01 group_2
ATGCGTTTTTCAATTCAAAAAGACTATCTTGTAAGAAGTGTACAAGATGTAATGAAGGCT
>Desc_0003.p01 group_3
GTGAGCGATTTTATGAAACGTATTAAAATTTCAACAGAGTATATTACACTAGGGCAATTT
>Desc_0004.p01 group_4
TTGTTTATTTCAGAAATACAATTAAAAAACTATCGCAATTATGAAAAATTAGAGCTTTCC

How can i add second column to description. i am able to get matching fasta descriptor out.

perl unix fasta • 1.8k views
ADD COMMENT
1
Entering edit mode
8.4 years ago
5heikki 11k

If it's always "group_N gene N": then

awk 'BEGIN{OFS=FS=" "}{if(/^>/){print $1,$2,"gene",substr($2,7)}else{print $0}}' file.fasta
ADD COMMENT
0
Entering edit mode

And if it's more complicated than that a Biopython script could do that in a few lines. But I have the feeling I already wrote those scripts a few times here on biostars :-D A simple search might work...

ADD REPLY
0
Entering edit mode

can you point me to some of your old suggestions?

ADD REPLY
0
Entering edit mode

This one comes probably pretty close, requiring some modifications: Please help with connecting the part of the header in a fasta file with another text file

Do you have (bio)python experience? It's probably most rewarding that you get this working on yourself, but if you come across any issues I'd be happy to help.

ADD REPLY
0
Entering edit mode

gene is not a static number and not in incremental order. The real file will have random descriptions of typical gene names .

ADD REPLY

Login before adding your answer.

Traffic: 1651 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6