Question

How can I know what genes each Trinity ID are ?

2

Entering edit mode

9.6 years ago

tiago211287 ★ 1.5k

I performed denovo assembly with Trinity using reads from heart mouse RNaseq. Than I mapped the transcriptome back to the reference genome with Blat . I also used Kallisto to Count the transcript abbundance in each sample. But now I want to know what Trinity ID's are known already in the annotation and what their names, and what is not annotated. How can I do that?

Blat Trinity Denovo assembly rnaseq • 6.3k views

ADD COMMENT • link updated 9.6 years ago by cyril-cros ▴ 950 • written 9.6 years ago by tiago211287 ★ 1.5k

0

Entering edit mode

Though I never did this, I guess bedtools / bedops can provide overlaps between the transcriptome mapping and the mouse annotation.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.6 years ago by h.mon 35k

0

Entering edit mode

9.6 years ago

cyril-cros ▴ 950

Disclaimer: I forgot Trinity outputs a fasta file and not a GTF or BED. Bad answer, but might be useful to someone

I had that same issue with another tool. Your best bet is to use bedtools/bedops. My scripts are not really portable (working on it, who knows it could be a small methods article) but:

I use bedtools merge to merge transcripts of the same de novo gene into one single maximum length transcript with no introns (min start position max end position)
I do the same with the official annotation
I use bedtools intersect to get a hopefully one to one correspondence

Caveats:

you need to use the -s (strand specific) flag.
I check if I have a true one-to-one correspondence: are there unassigned transcripts, and more importantly do I have several genes overlapping the same transcript? If you are unlucky and have very similar sequences close by, you may get fused transcripts where your alignment software misplaces one half of a pair of reads. The assembly software then outputs a single really long gene with lots of introns, instead of separate genes. The alignment software should have an option for maximum intron size you can fiddle (conversely, if it is too short, you split a gene with a large intron into two genes).
you have different transcripts for each gene due to alternative splicing, polyadenylation, TSS. Merging transcripts resolves this issue for me.

I would like to first take a look at what Cufflinks does since it is pretty good for de novo assembly with a reference. Its 3' UTR are often screwy though. In all cases I use IGV often to look at my reads, and I have a good depth to start with after pooling several biological replicates.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.6 years ago by cyril-cros ▴ 950

0

Entering edit mode

Questions:
My data is non strand specific. Still I must use the -s flag? Will this be a problem?

After the blat alignment I got psl files, there is some tool to convert to bed?

ADD REPLY • link 9.6 years ago by tiago211287 ★ 1.5k

0

Entering edit mode

Same with mine. Transcripts generated by Trinity are strand-specific. However, you are including reads that may be the product of antisense transcription.

EDIT: I made a mistake, trinity outputs a fasta file and not a gtf or bed file...

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.6 years ago by cyril-cros ▴ 950

Ram · Accepted Answer · 2016-01-04

2

Entering edit mode

9.6 years ago

cyril-cros ▴ 950

Now, for a correct answer.

You are doing mice which is a really well annotated organism. Trinity will be imprecise. You would be way better using Cufflinks with the reference genome and annotation, if you are looking for novel isoforms or things like that it will find them for you. Are trying to achieve something in particular?

Cufflinks will also give you the correspondence between its gene names and the official ones.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.6 years ago by cyril-cros ▴ 950

0

Entering edit mode

I already made a pipeline using a reference, with STAR aligner.

To find new stuff and study alternative splicing Iam performing de novo assembly because it is independent of the reference. I found a tool inside bedops that can convert PSL files to BED (psl2bed), I thought I could follow your previous ideas with this,.

ADD REPLY • link 9.6 years ago by tiago211287 ★ 1.5k

1

Entering edit mode

Just be careful, Blat shows you similar segments. Orthologous genes may be a problem here...

ADD REPLY • link 9.6 years ago by cyril-cros ▴ 950

0

Entering edit mode

I also used a tool called pslReps that filter blat output to only the best hit of each query.

link: https://github.com/ENCODE-DCC/kentUtils/tree/master/src/hg/pslReps

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.6 years ago by tiago211287 ★ 1.5k