Question

Append TPM value to sequence header in FASTA file

0

Entering edit mode

4.9 years ago

molly77 ▴ 10

Hi,

I have TPM values for every transcript in my transcriptome. I want to append the TPM value of each transcript to its corresponding sequence header in the transcriptome FASTA file. For example, I have a file where column one lists each transcript sequence ID and then column two lists the TPM value for that transcript. I also have the transcriptome FASTA file. How can I do this on the command line? UNIX preferred.

Thank you!!!!

RNA-Seq FASTA TPM TRANSCRIPTS LINUX • 1.3k views

ADD COMMENT • link updated 4.9 years ago by JC 13k • written 4.9 years ago by molly77 ▴ 10

0

Entering edit mode

Please add example data for input and desired output.

ADD REPLY • link 4.9 years ago by ATpoint 86k

0

Entering edit mode

Yes, examples could be better to understand how complex/easy it is.

ADD REPLY • link 4.9 years ago by JC 13k

0

Entering edit mode

The format keeps getting messed up when trying to add example.

But I have one file with two columns. Transcript IDs in column 1, TPM value in column 2. My transcriptome FASTA file IDs for example ">TRINITY_1" (followed by the sequence on the next line) but I want the TPM value appended so ">TRINITY_1_1000". So in my first file for example, column one will contain "TRINITY_1" and column two will contain "1000".

Caveat is that the order of transcript IDs in file one is different from order in the transcriptome FASTA file. Thanks!

ADD REPLY • link 4.9 years ago by molly77 ▴ 10

0

Entering edit mode

You can use the code option 101010 to properly format code and data.

ADD REPLY • link 4.9 years ago by ATpoint 86k

score 1 · Answer 1 · 2020-02-21

I'm not sure if this could work unless you properly provide examples, but here is a Perl script to do the job:

#!/usr/bin/perl

use strict;
use warnings;

my $tpm_file = "your_tpm.txt";
my $fasta_file = "your_Fasta.fa";
my $new_fasta = "output_fasta.fa";

my %tpm =();
open (my $th, "<", $tpm_file) or die "cannot read $tpm_file\n";
while (<$th>) {
    chomp;
    my ($id, $val) = split (/\s+/, $_);
    $tpm{$id} = $val;
}
close $th;

open (my $fh, "<", $fasta_file) or die "cannot read $fasta_file\n";
open (my $oh, ">", $new_fasta) or die "cannot write $new_fasta\n";
while (<$fh>) {
    chomp;
    if (/>(.+)/) {
        my $id = $1;
        if (defined $tpm{$id}) {
             print $oh ">" . $id . "_" . $tpm{$id} . "\n";
         }
         else {
              print $oh ">" . $id . "_NA\n"; # in case the id is not in the tpm list
         }
    next;
    }
    print $oh "$_\n";
}
close $fh;
close $oh;

score 0 · Answer 2 · 2020-02-20

There are many posts regarding fasta headers handling and renaming here at BioStars and at other forums, did you search and try to get a solution by adapting the answers from these posts?

However, I think it is a bad idea to modify the headers from a Trinity assembly, as they have a lot of information (e.g. putative transcript to gene relationships), and as these headers are used in several downstream Trinity programs (Trinity helper scripts, Trinotate, TransDecoder).