Append TPM value to sequence header in FASTA file
2
0
Entering edit mode
4.8 years ago
molly77 ▴ 10

Hi,

I have TPM values for every transcript in my transcriptome. I want to append the TPM value of each transcript to its corresponding sequence header in the transcriptome FASTA file. For example, I have a file where column one lists each transcript sequence ID and then column two lists the TPM value for that transcript. I also have the transcriptome FASTA file. How can I do this on the command line? UNIX preferred.

Thank you!!!!

RNA-Seq FASTA TPM TRANSCRIPTS LINUX • 1.3k views
ADD COMMENT
0
Entering edit mode

Please add example data for input and desired output.

ADD REPLY
0
Entering edit mode

Yes, examples could be better to understand how complex/easy it is.

ADD REPLY
0
Entering edit mode

The format keeps getting messed up when trying to add example.

But I have one file with two columns. Transcript IDs in column 1, TPM value in column 2. My transcriptome FASTA file IDs for example ">TRINITY_1" (followed by the sequence on the next line) but I want the TPM value appended so ">TRINITY_1_1000". So in my first file for example, column one will contain "TRINITY_1" and column two will contain "1000".

Caveat is that the order of transcript IDs in file one is different from order in the transcriptome FASTA file. Thanks!

ADD REPLY
0
Entering edit mode

You can use the code option 101010 to properly format code and data.

ADD REPLY
1
Entering edit mode
4.8 years ago
JC 13k

I'm not sure if this could work unless you properly provide examples, but here is a Perl script to do the job:

#!/usr/bin/perl

use strict;
use warnings;

my $tpm_file = "your_tpm.txt";
my $fasta_file = "your_Fasta.fa";
my $new_fasta = "output_fasta.fa";

my %tpm =();
open (my $th, "<", $tpm_file) or die "cannot read $tpm_file\n";
while (<$th>) {
    chomp;
    my ($id, $val) = split (/\s+/, $_);
    $tpm{$id} = $val;
}
close $th;

open (my $fh, "<", $fasta_file) or die "cannot read $fasta_file\n";
open (my $oh, ">", $new_fasta) or die "cannot write $new_fasta\n";
while (<$fh>) {
    chomp;
    if (/>(.+)/) {
        my $id = $1;
        if (defined $tpm{$id}) {
             print $oh ">" . $id . "_" . $tpm{$id} . "\n";
         }
         else {
              print $oh ">" . $id . "_NA\n"; # in case the id is not in the tpm list
         }
    next;
    }
    print $oh "$_\n";
}
close $fh;
close $oh;
ADD COMMENT
0
Entering edit mode
4.8 years ago
h.mon 35k

There are many posts regarding fasta headers handling and renaming here at BioStars and at other forums, did you search and try to get a solution by adapting the answers from these posts?

However, I think it is a bad idea to modify the headers from a Trinity assembly, as they have a lot of information (e.g. putative transcript to gene relationships), and as these headers are used in several downstream Trinity programs (Trinity helper scripts, Trinotate, TransDecoder).

ADD COMMENT

Login before adding your answer.

Traffic: 1974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6