Question

Mapping a de novo built transcriptome to a reference transcriptome

1

Entering edit mode

9.9 years ago

cyril-cros ▴ 950

Hi,

I am interested in building an updated annotation of the mouse olfactory receptor genes. They have a tissue specific expression obviously, with various 3' isoforms, so the current annotation is not very good. Hand annotation using IGV to visualize my data is not an option with this gene family ;)

I have mapped RNA Seq paired ends reads from olfactory epithelia to the mm10 genome with STAR, and then I used a new tool called IsoSCM to obtain a more precise annotation (as a GTF file). It is a de novo approach.

My problem is that I can't easily (it seems) merge this new annotation with the 'official' one (mm10.gtf), which has less precise 5' and 3' (as seen in IGV) but includes Ensembl Gene IDs which I would very much like to have. I am including examples below of both syntaxes.

I would like to know if there is a tool that allows to add the Ensembl Gene IDs to a de novo transcriptome according to the features' position in the genome. I think someone already had to answer this type of question....

Otherwise, I can code simple stuff such as sorting features by their position in the genome and finding reasonable rules for merging (a feature from the new genome which has the same start or end position in mm10.gtf could inherit the mm10.gtf gene ID). Determining the feature type (5', CDS, exon, 3') is a bit more complex but possible.

Here is the mm10.gtf annotation:

[cyril@synapse ~]$ head -20 mm10.gtf
chr1    mm10_ensGene    stop_codon    134202951    134202953    0.000000    -    .    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    CDS    134202954    134203590    0.000000    -    1    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    exon    134199223    134203590    0.000000    -    .    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    CDS    134234015    134234355    0.000000    -    0    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    start_codon    134234353    134234355    0.000000    -    .    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    exon    134234015    134234412    0.000000    -    .    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465"; 
chr1    mm10_ensGene    exon    134235228    134235431    0.000000    -    .    gene_id "ENSMUST00000086465"; transcript_id "ENSMUST00000086465";

And the one I obtain with IsoSCM:

[cyril@synapse ~]$ head -20 isoSCM.gtf 
chr4_JH584293_random    sol    exon    10822    11047    0.0    +    .    locus_id "locus.00000000";type "5p_exon"
chr4_JH584293_random    sol    exon    11191    11251    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    12077    12246    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    12375    12489    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    12643    12675    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    12754    12920    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    13477    13640    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    14817    14989    0.0    +    .    locus_id "locus.00000000";type "3p_exon"
chr4_JH584293_random    sol    exon    15884    15933    0.0    +    .    locus_id "locus.00000001";type "5p_exon"

Thanks for your help!

transcript-assembly de-novo-transcriptome RNA-Seq • 2.7k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by cyril-cros ▴ 950

Ram · Accepted Answer · 2015-03-24

2

Entering edit mode

9.9 years ago

poisonAlien ★ 3.2k

Use cuffcompare to compare your de novo gtf with mm10 gtf. Then look for class code j, which I think stands for novel isoform. You can play around.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by poisonAlien ★ 3.2k

0

Entering edit mode

Thanks a lot for the idea. Cuffcompare runs quickly and successfully, even if the input GTF is not a cufflinks output. The class code is also a nice idea, I can filter all unusual isoforms.

I will run comparisons using IGV and update this thread when I am done.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by cyril-cros ▴ 950

0

Entering edit mode

After a few trials: I have roughly what I want, but some 3'UTR are truncated by cufflinks.

It seems bedtools might offer a more hands-on solution, I will look into that.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by cyril-cros ▴ 950