How to get the genes that contain the highest number of isoforms from a fasta file generated by trinity de novo?
2
0
Entering edit mode
3.4 years ago
FadyNabil ▴ 20

I have a trinity file in a fasta format and I want to get the gene(s) with the highest no. of isoforms.

assembly fasta trinity • 4.8k views
ADD COMMENT
1
Entering edit mode

I'm assuming you're referring to the contigs FASTA file generated by trinity de novo. If that's a correct assumption, please state this overtly in your question.

A FASTA file contains sequences. Genes and isoforms are functional annotations that go on top. Without the latter annotation process, what you're asking for is not possible.

EDIT: I was mistaken - Trinity seems to add gene and isoform annotations to its output (though I still think these annotations are unreliable without a GTF especially when they result from pure computational analyses)

ADD REPLY
0
Entering edit mode

ok, I have a FASTA file generated by trinity de novo, how then I can get the gene(s) with the highest no. of isoforms?

ADD REPLY
0
Entering edit mode

You do not have a "trinity file" - you ave the assembled output FASTA format from trinityrnaseq. What organism is this for? Does it have reference gene/transcript annotations?

ADD REPLY
0
Entering edit mode

does the longest sequence is the sequence that contains the highest number of isoforms?

ADD REPLY
1
Entering edit mode

The longest sequence is the sequence with the most bases. Please read up on de novo RNAseq processing methods and try something on your own before asking here.

ADD REPLY
1
Entering edit mode
3.4 years ago

Hey,

this could work.

You should have a file called: Trinity.fasta.gene_trans_map link

Upload the file in R:

library(readr)
Trinity_fasta_gene_trans_map <- read_delim("D:/Download/Trinity.fasta.gene_trans_map.txt", 
    "\t", escape_double = FALSE, col_names = FALSE, 
    trim_ws = TRUE)

head(Trinity_fasta_gene_trans_map)

# A tibble: 6 x 2
  #X1                 X2                   
  #<chr>              <chr>                
#1 TRINITY_DN1_c0_g1  TRINITY_DN1_c0_g1_i1 
#2 TRINITY_DN1_c0_g2  TRINITY_DN1_c0_g2_i1 
#3 TRINITY_DN13_c0_g1 TRINITY_DN13_c0_g1_i1
#4 TRINITY_DN13_c0_g1 TRINITY_DN13_c0_g1_i2
#5 TRINITY_DN14_c0_g1 TRINITY_DN14_c0_g1_i1
#6 TRINITY_DN15_c0_g1 TRINITY_DN15_c0_g1_i1

out<-as.matrix(table(Trinity_fasta_gene_trans_map$X1)) #get the number of isoform per gene
head(out)
                    [,1]
TRINITY_DN1_c0_g1      1
TRINITY_DN1_c0_g2      1
TRINITY_DN101_c0_g1    1
TRINITY_DN101_c0_g2    1
TRINITY_DN102_c0_g1    1
TRINITY_DN102_c0_g2    1
ADD COMMENT
0
Entering edit mode

I am using bash not R

ADD REPLY
1
Entering edit mode

Bash solution:

awk '{h[$1]++}; END { for(k in h) print k, h[k] }' Trinity.fasta.gene_trans_map.txt > out_trinity

source

ps: since several trinity wrappers run on R, I would strongly suggest you to learn some

ADD REPLY
0
Entering edit mode

I do not have Trinity.fasta.gene_trans_map.txt I have Trinity.fasta !

ADD REPLY
1
Entering edit mode

I used this file as input

$ awk 'sub(/^>/, "")' Trinity.fasta > header.trinity # get the fasta headers

$ head header.trinity

TRINITY_DN10_c0_g1_i1 len=334 path=[312:0-333] [-1, 312, -2]
TRINITY_DN11_c0_g1_i1 len=319 path=[297:0-318] [-1, 297, -2]
TRINITY_DN12_c0_g1_i1 len=244 path=[222:0-243] [-1, 222, -2]
TRINITY_DN17_c0_g1_i1 len=229 path=[207:0-228] [-1, 207, -2]
TRINITY_DN18_c0_g1_i1 len=633 path=[611:0-632] [-1, 611, -2]
TRINITY_DN18_c1_g1_i1 len=289 path=[267:0-288] [-1, 267, -2]
TRINITY_DN19_c0_g1_i1 len=283 path=[261:0-282] [-1, 261, -2]
TRINITY_DN21_c0_g1_i1 len=242 path=[220:0-241] [-1, 220, -2]


$ cat header.trinity | sed 's/_i.*//' > gene.header.trinity # keep only the portion of the header representing the gene_id (example `TRINITY_DN21_c0_g1`).

$ head gene.header.trinity # check the new headers
TRINITY_DN10_c0_g1
TRINITY_DN11_c0_g1
TRINITY_DN12_c0_g1
TRINITY_DN17_c0_g1
TRINITY_DN18_c0_g1
TRINITY_DN18_c1_g1
TRINITY_DN19_c0_g1
TRINITY_DN21_c0_g1
TRINITY_DN22_c0_g1
TRINITY_DN43_c0_g1

$ awk '{h[$1]++}; END { for(k in h) print k, h[k] }' gene.header.trinity > final_output # count how many time a string appears

$ head final_output

TRINITY_DN554_c0_g1 1
TRINITY_DN502_c0_g1 1
TRINITY_DN454_c0_g1 1
TRINITY_DN402_c0_g1 1
TRINITY_DN402_c4_g1 1
TRINITY_DN402_c0_g2 1
TRINITY_DN302_c0_g1 1
TRINITY_DN354_c0_g1 1
TRINITY_DN354_c4_g1 1
TRINITY_DN110_c0_g1 1
ADD REPLY
0
Entering edit mode

I think you'll need to run this script to get that file: https://github.com/trinityrnaseq/trinityrnaseq/blob/18bb3182005a61d236525185d8dda65867ca4cc0/util/align_and_estimate_abundance.pl

Try:

perl align_and_estimate_abundance.pl --transcripts Trinity.fasta --est_method RSEM --aln_method bowtie --trinity_mode --prep_reference
ADD REPLY
0
Entering edit mode

it says 'Can't open perl script "align_and_estimate_abundance.pl": No such file or directory'

ADD REPLY
2
Entering edit mode

My bad - don't type the command exactly as I gave it to you. Understand it and make necessary changes so it makes sense in your context.

ADD REPLY
0
Entering edit mode

Are you opposed to using R?

ADD REPLY
0
Entering edit mode

no, but I need to solve it by bash now if it possible

ADD REPLY
0
Entering edit mode
3.4 years ago
h.mon 35k

The output of a Trinity fasta assembly has headers following a convention, see the documentation for details:

>TRINITY_DN17755_c0_g1_i1 len=221 path=[0:0-220]
>TRINITY_DN17876_c0_g1_i1 len=254 path=[0:0-253]
>TRINITY_DN17736_c0_g1_i1 len=413 path=[0:0-412]

The c, g and i stand for clusters, genes and isoforms - note those are somewhat loosely determined, as short reads transcriptome assemblies are very noisy in general.

To get the single "gene" with the most "isoforms", you can use standard linux command line utilities:

grep ">" Trinity.fasta \
  | cut -d " " -f1 \
  | sort -t"_" -n -r -k5.2 \
  | head -n1

To get a list of the "N" genes with the most isoforms would be a bit more complicated, and probably involve some scripting, parsing, sorting and filtering the fields from the fasta headers.

ADD COMMENT
0
Entering edit mode

ok, when I ran this command on the Trinity fasta that you sent I got this 'TRINITY_DN17876_c0_g1_i1' could you explain to me why I get especially this gene? also what is the meaning of -k5.2 in this command

ADD REPLY
0
Entering edit mode

what is the meaning of -k5.2 in this command

EDIT: Looks like 5.2 is not a difference-in-decimal-separator version of 5,2 but a whole different thing. Please see h.mon's comment below

It's probably because some countries use `,` to represent decimals and `.` to represent the comma separator. The manual will help you understand `-k5,2`, and that should translate to the `-k5.2` used here.

ADD REPLY
0
Entering edit mode

ok is there is a way to get all the genes not just the one with the highest isoforms?

ADD REPLY
0
Entering edit mode

Read the manual pages of each shell command used, and you'll understand what to do so you see all entries, not just the first.

ADD REPLY
0
Entering edit mode

I have no idea why you get a gene with one isoform as a result. I get >TRINITY_DN1104_c0_g1_i372. Unless all the transcripts from your assembly have only one isoform, you shoud get something with a high number of isoforms.

Regarding -k, reading man sort shows:

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.

So -k5.2 means sort by the 5th field, beginning from the second character from that field. As I stated "_" as separator, the 5th field is the isoform number, and the sorting ignores the i prefix.

ADD REPLY
1
Entering edit mode

Side note: -k5Vr would work as well - version sort is great for alphanumeric content where we want the alphabet portion sorted alphabetically and the numeric portion sorted numerically.

ADD REPLY
0
Entering edit mode

in this example what should I do to know how many assembly clusters do we have in the assembled transcriptome?

ADD REPLY

Login before adding your answer.

Traffic: 1064 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6