Warning in CuffDiff when using novel transcripts from CuffMerge as input
0
1
Entering edit mode
10.5 years ago
komal.rathi ★ 4.1k

NOTE: cross posting from seqanswers as I received no answers there and couldn't delete my post there because I don't know how to.

Hi everyone,

I am doing a differential gene expression test on RNASeq data from 64 Human Heart samples as follows:

cufflinks --library-type fr-firststrand -p 2 -o cufflinks_out <bam_file>

This gave me 64 transcripts.gtf files, one per sample. I created a list of the file names and saved it to assembly_list.txt

I then downloaded three GTF files from Gencode v19 : 1. Protein Coding Genes 2. Long Non-coding RNAs and 3. Pseudogenes. I merged the three gtf files, extracted out just the entries matching exon and CDS in the third column and used it as input for cuffmerge as follows.

cuffmerge -o cuffmerge_out -g gencode.v19.protein.ncRNA.pseudos_exonCDS.gtf -s hg19.fa -p 25 assembly_list.txt

and then I extracted the novel transcripts matching class_code="u":

chr1    Cufflinks       exon    16971271        16972173        .       +       .       gene_id "XLOC_000205"; transcript_id "TCONS_00001942"; exon_number "1"; oId "CUFF.386.1"; class_code "u"; tss_id "TSS739";
chr1    Cufflinks       exon    16972864        16972878        .       +       .       gene_id "XLOC_000205"; transcript_id "TCONS_00001942"; exon_number "2"; oId "CUFF.386.1"; class_code "u"; tss_id "TSS739";
chr1    Cufflinks       exon    20609243        20609496        .       +       .       gene_id "XLOC_000241"; transcript_id "TCONS_00002229"; exon_number "1"; oId "CUFF.393.1"; class_code "u"; tss_id "TSS855";
chr1    Cufflinks       exon    20611093        20611502        .       +       .       gene_id "XLOC_000241"; transcript_id "TCONS_00002229"; exon_number "2"; oId "CUFF.393.1"; class_code "u"; tss_id "TSS855";
chr1    Cufflinks       exon    33925677        33925957        .       +       .       gene_id "XLOC_000396"; transcript_id "TCONS_00003679"; exon_number "1"; oId "CUFF.651.1"; class_code "u"; tss_id "TSS1384";
chr1    Cufflinks       exon    33926256        33926832        .       +       .       gene_id "XLOC_000396"; transcript_id "TCONS_00003679"; exon_number "2"; oId "CUFF.651.1"; class_code "u"; tss_id "TSS1384";
chr1    Cufflinks       exon    56905467        56905577        .       +       .       gene_id "XLOC_000616"; transcript_id "TCONS_00005969"; exon_number "1"; oId "CUFF.1131.1"; class_code "u"; tss_id "TSS2162";
chr1    Cufflinks       exon    56910015        56910124        .       +       .       gene_id "XLOC_000616"; transcript_id "TCONS_00005969"; exon_number "2"; oId "CUFF.1131.1"; class_code "u"; tss_id "TSS2162";

and used it in cuffdiff as follows:

cuffdiff -o cuffdiff_out --library-type fr-firststrand -L Control,HF --multi-read-correct --upper-quartile-norm -b hg19.fa --num-threads=20 novels.gtf <control1_bam,control2_bam,...> <hf1_bam,hf2_bam,...>

This is giving me a warning repeatedly and is not moving ahead:

Warning: Your version of Cufflinks is not up-to-date. It is recommended that you upgrade to Cufflinks v2.2.1 to benefit from the most recent features and bug fixes ([url]http://cufflinks.cbcb.umd.edu[/url]).
[09:32:46] Loading reference annotation and sequence.
[09:33:06] Inspecting maps and determining fragment length distributions.
Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.
Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.
Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.
Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.

I have never come across this kind of warning before. Any suggestions/help would be appreciated.

UPDATE 1:

I changed my GTF file to include some coding transcripts (from gencode v19) plus some novels (from cuffmerge) and it is still showing me the same warnings. This is how my new sample GTF looks like:

chr1    HAVANA  UTR     70006   70008   .       +       .       gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
chr1    ENSEMBL gene    134901  139379  .       -       .       gene_id "ENSG00000237683.5"; transcript_id "ENSG00000237683.5"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1"; level 3;
chr1    ENSEMBL transcript      134901  139379  .       -       .       gene_id "ENSG00000237683.5"; transcript_id "ENST00000423372.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1-201"; level 3; tag "basic"; tag "appris_principal";
chr1    ENSEMBL exon    137621  139379  .       -       .       gene_id "ENSG00000237683.5"; transcript_id "ENST00000423372.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1-201"; exon_number 1;  exon_id "ENSE00002221580.1";  level 3; tag "basic"; tag "appris_principal";
chr1    Cufflinks       exon    16971271        16972173        .       +       .       gene_id "XLOC_000205"; transcript_id "TCONS_00001942"; exon_number "1"; oId "CUFF.386.1"; class_code "u"; tss_id "TSS739";
chr1    Cufflinks       exon    16972864        16972878        .       +       .       gene_id "XLOC_000205"; transcript_id "TCONS_00001942"; exon_number "2"; oId "CUFF.386.1"; class_code "u"; tss_id "TSS739";

UPDATE 2:

And so, I removed novel transcripts and used just a bunch of coding transcripts from chr1:

chr1    HAVANA  UTR     70006   70008   .       +       .       gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
chr1    ENSEMBL gene    134901  139379  .       -       .       gene_id "ENSG00000237683.5"; transcript_id "ENSG00000237683.5"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1"; level 3;
chr1    ENSEMBL transcript      134901  139379  .       -       .       gene_id "ENSG00000237683.5"; transcript_id "ENST00000423372.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1-201"; level 3; tag "basic"; tag "appris_principal";
chr1    ENSEMBL exon    137621  139379  .       -       .       gene_id "ENSG00000237683.5"; transcript_id "ENST00000423372.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1-201"; exon_number 1;  exon_id "ENSE00002221580.1";  level 3; tag "basic"; tag "appris_principal";

Still giving me the same warnings. So at least I know that nothing is specifically wrong with the novel transcripts.

UPDATE 3:

I started to think that there must be some problem with chromosome 1, so I just extracted chromosome 22 entries for coding and novel transcripts from the GTF file and used it in cuffdiff but it is still showing me the same warning.

UPDATE 4:

I also used coding-transcripts from all chromosomes and novel-transcripts from all chromosomes and ran cuffdiff. This time it went past estimating the abundances stage, and onto the tests for differential expression and regulation but it is somehow stuck on chr1 since a couple of days now.

I am so confused. Cuffdiff shows me the same warnings when:

  1. I use all novel transcripts from all chromosomes
  2. I use a bunch of chr1 coding and novel transcripts
  3. I used a bunch of chr1 coding transcripts alone
  4. I use all of chr22 coding and novel transcripts

But does not show me any warnings when I use novel and coding transcripts from all chromosomes but instead gets stuck at the differential gene expression stage.

Thanks!

cuffdiff novel-transcripts • 4.8k views
ADD COMMENT
0
Entering edit mode

I actually meant to reply on seqanswers but must have forgotten :P

Do you have paired-end or single-end reads? How many BAM files are you using? That warning message (or at least one really similar) will happen if you have single-end reads, but should be pretty rare if you actually do have paired-end reads.

In any case, the warning message shouldn't be the underlying problem. Is cuffdiff still using a bunch of cores or does it seem to have stopped?

ADD REPLY
0
Entering edit mode

Hi, I have 64 paired-end strand-specific BAM files (32 controls and 32 cases). I agree, the warning should only pop up if I have single-end reads, but that's not the case here. I killed the program after waiting for a couple of days because it was still using the cores but not proceeding further.

ADD REPLY
0
Entering edit mode

Regarding the warning message, you might have a look at the alignments in IGV. Cuffdiff is looking for fragments that aren't expected to cross splice junctions, such that it can estimate the fragment length distribution, which is needed to accurately assign fractional fragment counts to multiple transcripts (since this distribution can act as a prior). You should be able to find a large chunk of pairs that align near each other with no apparent splicing going on. If not, then something weird is probably going on and that's what's borking cuffdiff.

ADD REPLY
0
Entering edit mode

Did you try running cufflinks assembly using reference Gencode V19? The command line you provided does not seems to have a GTF file:

cufflinks --library-type fr-firststrand -p 2 -o cufflinks_out <bam_file>

I know it is optional. But if you have already decided to use Gencode annotation why don't you try using RABT assembly (cufflinks guided mode, -g/--GTF-guide) to avoid complexions in later steps.

My pipeline: The following steps worked for me using Gencode v11, v17, v19 and v20

Note: Gencode v20 had some errors. You can correct it by using this simple command: sed -i 's/\"\ tag\ \"PAR\"\;/\"\;\ tag\ \"PAR\"\;/g' gencode.v20.annotation.gtf

  1. Merge all BAM files, if possible (I have tried till 120 samples) and do Cufflinks RABT assembly using reference GTF (In my case Gencode) on merged BAM.
  2. Do cuffdiff with Obtained GTF.

OR if you have thousands of BAMs

  1. Cufflinks RABT assembly using reference GTF (In my case Gencode) for individual BAMs.
  2. Cuffmerge
  3. Cuffdiff

Besk of luck.

ADD REPLY
0
Entering edit mode

Hi,I have used the same method which is mentioned by Santhilal Subhash but still I am getting the same warning message. Anybody can help me with it?

Thanks

ADD REPLY
0
Entering edit mode

Just use Ensembl instead of Gencode.

ADD REPLY

Login before adding your answer.

Traffic: 1663 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6