I have a trinity file in a fasta format and I want to get the gene(s) with the highest no. of isoforms.
I have a trinity file in a fasta format and I want to get the gene(s) with the highest no. of isoforms.
Hey,
this could work.
You should have a file called: Trinity.fasta.gene_trans_map
link
Upload the file in R:
library(readr)
Trinity_fasta_gene_trans_map <- read_delim("D:/Download/Trinity.fasta.gene_trans_map.txt",
"\t", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
head(Trinity_fasta_gene_trans_map)
# A tibble: 6 x 2
#X1 X2
#<chr> <chr>
#1 TRINITY_DN1_c0_g1 TRINITY_DN1_c0_g1_i1
#2 TRINITY_DN1_c0_g2 TRINITY_DN1_c0_g2_i1
#3 TRINITY_DN13_c0_g1 TRINITY_DN13_c0_g1_i1
#4 TRINITY_DN13_c0_g1 TRINITY_DN13_c0_g1_i2
#5 TRINITY_DN14_c0_g1 TRINITY_DN14_c0_g1_i1
#6 TRINITY_DN15_c0_g1 TRINITY_DN15_c0_g1_i1
out<-as.matrix(table(Trinity_fasta_gene_trans_map$X1)) #get the number of isoform per gene
head(out)
[,1]
TRINITY_DN1_c0_g1 1
TRINITY_DN1_c0_g2 1
TRINITY_DN101_c0_g1 1
TRINITY_DN101_c0_g2 1
TRINITY_DN102_c0_g1 1
TRINITY_DN102_c0_g2 1
I used this file as input
$ awk 'sub(/^>/, "")' Trinity.fasta > header.trinity # get the fasta headers
$ head header.trinity
TRINITY_DN10_c0_g1_i1 len=334 path=[312:0-333] [-1, 312, -2]
TRINITY_DN11_c0_g1_i1 len=319 path=[297:0-318] [-1, 297, -2]
TRINITY_DN12_c0_g1_i1 len=244 path=[222:0-243] [-1, 222, -2]
TRINITY_DN17_c0_g1_i1 len=229 path=[207:0-228] [-1, 207, -2]
TRINITY_DN18_c0_g1_i1 len=633 path=[611:0-632] [-1, 611, -2]
TRINITY_DN18_c1_g1_i1 len=289 path=[267:0-288] [-1, 267, -2]
TRINITY_DN19_c0_g1_i1 len=283 path=[261:0-282] [-1, 261, -2]
TRINITY_DN21_c0_g1_i1 len=242 path=[220:0-241] [-1, 220, -2]
$ cat header.trinity | sed 's/_i.*//' > gene.header.trinity # keep only the portion of the header representing the gene_id (example `TRINITY_DN21_c0_g1`).
$ head gene.header.trinity # check the new headers
TRINITY_DN10_c0_g1
TRINITY_DN11_c0_g1
TRINITY_DN12_c0_g1
TRINITY_DN17_c0_g1
TRINITY_DN18_c0_g1
TRINITY_DN18_c1_g1
TRINITY_DN19_c0_g1
TRINITY_DN21_c0_g1
TRINITY_DN22_c0_g1
TRINITY_DN43_c0_g1
$ awk '{h[$1]++}; END { for(k in h) print k, h[k] }' gene.header.trinity > final_output # count how many time a string appears
$ head final_output
TRINITY_DN554_c0_g1 1
TRINITY_DN502_c0_g1 1
TRINITY_DN454_c0_g1 1
TRINITY_DN402_c0_g1 1
TRINITY_DN402_c4_g1 1
TRINITY_DN402_c0_g2 1
TRINITY_DN302_c0_g1 1
TRINITY_DN354_c0_g1 1
TRINITY_DN354_c4_g1 1
TRINITY_DN110_c0_g1 1
I think you'll need to run this script to get that file: https://github.com/trinityrnaseq/trinityrnaseq/blob/18bb3182005a61d236525185d8dda65867ca4cc0/util/align_and_estimate_abundance.pl
Try:
perl align_and_estimate_abundance.pl --transcripts Trinity.fasta --est_method RSEM --aln_method bowtie --trinity_mode --prep_reference
The output of a Trinity fasta assembly has headers following a convention, see the documentation for details:
>TRINITY_DN17755_c0_g1_i1 len=221 path=[0:0-220]
>TRINITY_DN17876_c0_g1_i1 len=254 path=[0:0-253]
>TRINITY_DN17736_c0_g1_i1 len=413 path=[0:0-412]
The c, g and i stand for clusters, genes and isoforms - note those are somewhat loosely determined, as short reads transcriptome assemblies are very noisy in general.
To get the single "gene" with the most "isoforms", you can use standard linux command line utilities:
grep ">" Trinity.fasta \
| cut -d " " -f1 \
| sort -t"_" -n -r -k5.2 \
| head -n1
To get a list of the "N" genes with the most isoforms would be a bit more complicated, and probably involve some scripting, parsing, sorting and filtering the fields from the fasta headers.
what is the meaning of -k5.2 in this command
EDIT: Looks like 5.2
is not a difference-in-decimal-separator version of 5,2
but a whole different thing. Please see h.mon's comment below
It's probably because some countries use `,` to represent decimals and `.` to represent the comma separator.
The manual will help you understand `-k5,2`, and that should translate to the `-k5.2` used here.
I have no idea why you get a gene with one isoform as a result. I get >TRINITY_DN1104_c0_g1_i372. Unless all the transcripts from your assembly have only one isoform, you shoud get something with a high number of isoforms.
Regarding -k
, reading man sort
shows:
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.
So -k5.2
means sort by the 5th field, beginning from the second character from that field. As I stated "_" as separator, the 5th field is the isoform number, and the sorting ignores the i
prefix.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I'm assuming you're referring to the contigs FASTA file generated by trinity de novo. If that's a correct assumption, please state this overtly in your question.
A FASTA file contains sequences. Genes and isoforms are functional annotations that go on top. Without the latter annotation process, what you're asking for is not possible.
EDIT: I was mistaken - Trinity seems to add gene and isoform annotations to its output (though I still think these annotations are unreliable without a GTF especially when they result from pure computational analyses)
ok, I have a FASTA file generated by trinity de novo, how then I can get the gene(s) with the highest no. of isoforms?
You do not have a "trinity file" - you ave the assembled output FASTA format from trinityrnaseq. What organism is this for? Does it have reference gene/transcript annotations?
does the longest sequence is the sequence that contains the highest number of isoforms?
The longest sequence is the sequence with the most bases. Please read up on de novo RNAseq processing methods and try something on your own before asking here.