Hi everyone,
I am working with mm10 data and using the GRCm38 build 75 GTF from Ensembl. As everyone knows you need the tss_id
and p_id
to be present for differential isoform expression (by cuffdiff) when using any GTF other than the cufflinks' merged.gtf. I am using the following command to add the tss_id
and p_id
to my ensembl gtf:
cuffcompare \
-o cuffcmp \
-C -G \
-r Mus_musculus.GRCm38.75.protein_linc.gtf \
-s mm10.fa \
Mus_musculus.GRCm38.75.protein_linc.gtf
To check whether I was doing it correctly, I checked the entries for a particular gene in both the input and output gtfs.
The 'gene' entry for Xkr4 in the original GTF looks like this:
chr1 protein_coding gene 3205901 3671498 . - . gene_id "ENSMUSG00000051951"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
And, these are the entries corresponding to the above coordinates in the output GTF cuffcmp.combined.gtf
:
chr1 processed_transcript exon 3205901 3207317 . - . gene_id "XLOC_000653"; transcript_id "TCONS_00002060"; exon_number "1"; gene_name "Xkr4"; oId "ENSMUST00000162897"; nearest_ref "ENSMUST00000162897"; class_code "="; tss_id "TSS1356";
chr1 processed_transcript exon 3213609 3216344 . - . gene_id "XLOC_000653"; transcript_id "TCONS_00002060"; exon_number "2"; gene_name "Xkr4"; oId "ENSMUST00000162897"; nearest_ref "ENSMUST00000162897"; class_code "="; tss_id "TSS1356";
chr1 processed_transcript exon 3206523 3207317 . - . gene_id "XLOC_000653"; transcript_id "TCONS_00002061"; exon_number "1"; gene_name "Xkr4"; oId "ENSMUST00000159265"; nearest_ref "ENSMUST00000159265"; class_code "="; tss_id "TSS1357";
chr1 processed_transcript exon 3213439 3215632 . - . gene_id "XLOC_000653"; transcript_id "TCONS_00002061"; exon_number "2"; gene_name "Xkr4"; oId "ENSMUST00000159265"; nearest_ref "ENSMUST00000159265"; class_code "="; tss_id "TSS1357";
chr1 protein_coding exon 3214482 3216968 . - . gene_id "XLOC_000653"; transcript_id "TCONS_00002062"; exon_number "1"; gene_name "Xkr4"; oId "ENSMUST00000070533"; nearest_ref "ENSMUST00000070533"; class_code "="; tss_id "TSS1358"; p_id "P1235";
chr1 protein_coding exon 3421702 3421901 . - . gene_id "XLOC_000653"; transcript_id "TCONS_00002062"; exon_number "2"; gene_name "Xkr4"; oId "ENSMUST00000070533"; nearest_ref "ENSMUST00000070533"; class_code "="; tss_id "TSS1358"; p_id "P1235";
chr1 protein_coding exon 3670552 3671498 . - . gene_id "XLOC_000653"; transcript_id "TCONS_00002062"; exon_number "3"; gene_name "Xkr4"; oId "ENSMUST00000070533"; nearest_ref "ENSMUST00000070533"; class_code "="; tss_id "TSS1358"; p_id "P1235";
In the output, the gene_id field has XLOC ids instead of Ensembl IDs. Can I fix this to have Ensembl IDs instead? Is there a better way to add tss_id
and p_id
to your Ensembl GTF?
Thanks for the script, works great!