I am trying to do differential gene expression analysis, and analysis of 3'UTR usage, in a knock-down experiment in Zebrafish (RNA-seq, PE). Since I am interested in 3'UTR processing, I am using the GTF with improved 3'UTR annotation provided in Junker et al. Here is what it looks like (it also contains "regular" chr seqnames):
track name="tomo-seq extended gene models"
Zv9_NA250 Ensembl exon 3 727 . - . gene_id "ENSDARG00000091021"; transcript_id "ENSDART00000124805"; exon_number "2"; gene_biotype "protein_coding"; gene_name "CABZ01001466.1"; transcript_name "CABZ01001466.1-201"
Zv9_NA250 Ensembl exon 5774 5858 . - . gene_id "ENSDARG00000091021"; transcript_id "ENSDART00000124805"; exon_number "1"; gene_biotype "protein_coding"; gene_name "CABZ01001466.1"; transcript_name "CABZ01001466.1-201"
Zv9_NA250 Ensembl transcript 3 5858 . - . gene_id "ENSDARG00000091021"; transcript_id "ENSDART00000124805"; gene_biotype "protein_coding"; gene_name "CABZ01001466.1"; transcript_name "CABZ01001466.1-201"
Zv9_NA619 Ensembl exon 2 678 . - . gene_id "ENSDARG00000073818"; transcript_id "ENSDART00000111264"; exon_number "2"; gene_biotype "protein_coding"; gene_name "CABZ01031998.1"; transcript_name "CABZ01031998.1-201"
Zv9_NA619 Ensembl exon 4863 4947 . - . gene_id "ENSDARG00000073818"; transcript_id "ENSDART00000111264"; exon_number "1"; gene_biotype "protein_coding"; gene_name "CABZ01031998.1"; transcript_name "CABZ01031998.1-201"
Zv9_NA619 Ensembl transcript 2 4947 . - . gene_id "ENSDARG00000073818"; transcript_id "ENSDART00000111264"; gene_biotype "protein_coding"; gene_name "CABZ01031998.1"; transcript_name "CABZ01031998.1-201"
Zv9_NA979 Ensembl exon 353 499 . - . gene_id "ENSDARG00000089935"; transcript_id "ENSDART00000123414"; exon_number "2"; gene_biotype "protein_coding"; gene_name ""; transcript_name ""
Zv9_NA979 Ensembl exon 720 854 . - . gene_id "ENSDARG00000089935"; transcript_id "ENSDART00000123414"; exon_number "1"; gene_biotype "protein_coding"; gene_name ""; transcript_name ""
Zv9_NA979 Ensembl transcript 21 854 . - . gene_id "ENSDARG00000089935"; transcript_id "ENSDART00000123414"; gene_biotype "protein_coding"; gene_name ""; transcript_name ""
for the sake of consistency, the Tuxedo pipeline was used:
- mapping to Zv9 with Tophat;
- followed by RABT cufflinks2 on each sample to find new transcripts;
- cuffmerge to generated a new reference from all the transcripts.gtf and the Junker.gtf
- cuddfiff2 for differential expression.
The problem is that the results of differential CDS use are missing form the cuffdiff output (empty file). Tracked down the issue, and it stems from the lack of "p_id" in the merged.gtf from cuffmerge:
chr1 Cufflinks exon 1 635 . + . gene_id "XLOC_001204"; transcript_id "TCONS_00001298"; exon_number "1"; oId "CUFF.38.1"; class_code "u"; tss_id "TSS1231";
chr1 Cufflinks exon 3762 3930 . + . gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "1"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1 Cufflinks exon 4646 4809 . + . gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "2"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1 Cufflinks exon 5178 5202 . + . gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "3"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1 Cufflinks exon 5533 5646 . + . gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "4"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1 Cufflinks exon 6247 6393 . + . gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "5"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1 Cufflinks exon 11050 11141 . + . gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "6"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1 Cufflinks exon 11240 11360 . + . gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "7"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1 Cufflinks exon 13073 15178 . + . gene_id "XLOC_001205"; transcript_id "TCONS_00001300"; exon_number "8"; gene_name "F7 (3 of 3)"; oId "CUFF.40.2"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
chr1 Cufflinks exon 3762 3930 . + . gene_id "XLOC_001205"; transcript_id "TCONS_00001299"; exon_number "1"; gene_name "F7 (3 of 3)"; oId "CUFF.40.1"; nearest_ref "ENSDART00000109479"; class_code "j"; tss_id "TSS1232";
This, despite the fact that cuffmerge was ran with the option -s fasta
. Also tried these cuffdiff's instructions:
Note: If an arbitrary GTF/GFF3 file is used as input (instead of the .combined.gtf file produced by Cuffcompare), these attributes will not be present, but Cuffcompare can still be used to obtain these attributes with a command like this:
cuffcompare -s /path/to/genome_seqs.fa -CG -r annotation.gtf annotation.gtf
But p_id
is still missing. This is an issue because the CDS information is quite critical for my analysis.
Does anyone have an idea of how to solve this? Is this because I am using a non UCSC/Ensembl annotation?
If you don't have annotated coding sequences in the annotation file then it's impossible to assign protein IDs (p_id). You need to add those entries first and then you'll be able to proceed. I'm not familiar with any tools to do that, though they presumably exist (maybe in GMOD?). If nothing else, you could always write something (use biopython/bioperl and find the longest ORF in each transcript), though that wouldn't be a trivial thing to write.
N.B., posting as a comment to not dissuade someone that knows of a simple solution from answering.
That is what I feared. Just occurred to me that maybe an Ensembl GTF could be used during the merge: it contains the p_id (or cds info); and Junker et al used one to base their gene models on. Cheers.
Just make sure to change the chromosome names, since Ensembl doesn't use the "chr" prefix.