I have TPM values for every transcript in my transcriptome. I want to append the TPM value of each transcript to its corresponding sequence header in the transcriptome FASTA file. For example, I have a file where column one lists each transcript sequence ID and then column two lists the TPM value for that transcript. I also have the transcriptome FASTA file. How can I do this on the command line? UNIX preferred.
The format keeps getting messed up when trying to add example.
But I have one file with two columns. Transcript IDs in column 1, TPM value in column 2.
My transcriptome FASTA file IDs for example ">TRINITY_1" (followed by the sequence on the next line) but I want the TPM value appended so ">TRINITY_1_1000". So in my first file for example, column one will contain "TRINITY_1" and column two will contain "1000".
Caveat is that the order of transcript IDs in file one is different from order in the transcriptome FASTA file. Thanks!
I'm not sure if this could work unless you properly provide examples, but here is a Perl script to do the job:
#!/usr/bin/perl
use strict;
use warnings;
my $tpm_file = "your_tpm.txt";
my $fasta_file = "your_Fasta.fa";
my $new_fasta = "output_fasta.fa";
my %tpm =();
open (my $th, "<", $tpm_file) or die "cannot read $tpm_file\n";
while (<$th>) {
chomp;
my ($id, $val) = split (/\s+/, $_);
$tpm{$id} = $val;
}
close $th;
open (my $fh, "<", $fasta_file) or die "cannot read $fasta_file\n";
open (my $oh, ">", $new_fasta) or die "cannot write $new_fasta\n";
while (<$fh>) {
chomp;
if (/>(.+)/) {
my $id = $1;
if (defined $tpm{$id}) {
print $oh ">" . $id . "_" . $tpm{$id} . "\n";
}
else {
print $oh ">" . $id . "_NA\n"; # in case the id is not in the tpm list
}
next;
}
print $oh "$_\n";
}
close $fh;
close $oh;
There are many posts regarding fasta headers handling and renaming here at BioStars and at other forums, did you search and try to get a solution by adapting the answers from these posts?
However, I think it is a bad idea to modify the headers from a Trinity assembly, as they have a lot of information (e.g. putative transcript to gene relationships), and as these headers are used in several downstream Trinity programs (Trinity helper scripts, Trinotate, TransDecoder).
Please add example data for input and desired output.
Yes, examples could be better to understand how complex/easy it is.
The format keeps getting messed up when trying to add example.
But I have one file with two columns. Transcript IDs in column 1, TPM value in column 2. My transcriptome FASTA file IDs for example ">TRINITY_1" (followed by the sequence on the next line) but I want the TPM value appended so ">TRINITY_1_1000". So in my first file for example, column one will contain "TRINITY_1" and column two will contain "1000".
Caveat is that the order of transcript IDs in file one is different from order in the transcriptome FASTA file. Thanks!
You can use the code option
101010
to properly format code and data.