NOTE: cross posting from seqanswers as I received no answers there and couldn't delete my post there because I don't know how to.
Hi everyone,
I am doing a differential gene expression test on RNASeq data from 64 Human Heart samples as follows:
cufflinks --library-type fr-firststrand -p 2 -o cufflinks_out <bam_file>
This gave me 64 transcripts.gtf
files, one per sample. I created a list of the file names and saved it to assembly_list.txt
I then downloaded three GTF files from Gencode v19 : 1. Protein Coding Genes 2. Long Non-coding RNAs and 3. Pseudogenes. I merged the three gtf files, extracted out just the entries matching exon and CDS in the third column and used it as input for cuffmerge as follows.
cuffmerge -o cuffmerge_out -g gencode.v19.protein.ncRNA.pseudos_exonCDS.gtf -s hg19.fa -p 25 assembly_list.txt
and then I extracted the novel transcripts matching class_code="u"
:
chr1 Cufflinks exon 16971271 16972173 . + . gene_id "XLOC_000205"; transcript_id "TCONS_00001942"; exon_number "1"; oId "CUFF.386.1"; class_code "u"; tss_id "TSS739";
chr1 Cufflinks exon 16972864 16972878 . + . gene_id "XLOC_000205"; transcript_id "TCONS_00001942"; exon_number "2"; oId "CUFF.386.1"; class_code "u"; tss_id "TSS739";
chr1 Cufflinks exon 20609243 20609496 . + . gene_id "XLOC_000241"; transcript_id "TCONS_00002229"; exon_number "1"; oId "CUFF.393.1"; class_code "u"; tss_id "TSS855";
chr1 Cufflinks exon 20611093 20611502 . + . gene_id "XLOC_000241"; transcript_id "TCONS_00002229"; exon_number "2"; oId "CUFF.393.1"; class_code "u"; tss_id "TSS855";
chr1 Cufflinks exon 33925677 33925957 . + . gene_id "XLOC_000396"; transcript_id "TCONS_00003679"; exon_number "1"; oId "CUFF.651.1"; class_code "u"; tss_id "TSS1384";
chr1 Cufflinks exon 33926256 33926832 . + . gene_id "XLOC_000396"; transcript_id "TCONS_00003679"; exon_number "2"; oId "CUFF.651.1"; class_code "u"; tss_id "TSS1384";
chr1 Cufflinks exon 56905467 56905577 . + . gene_id "XLOC_000616"; transcript_id "TCONS_00005969"; exon_number "1"; oId "CUFF.1131.1"; class_code "u"; tss_id "TSS2162";
chr1 Cufflinks exon 56910015 56910124 . + . gene_id "XLOC_000616"; transcript_id "TCONS_00005969"; exon_number "2"; oId "CUFF.1131.1"; class_code "u"; tss_id "TSS2162";
and used it in cuffdiff as follows:
cuffdiff -o cuffdiff_out --library-type fr-firststrand -L Control,HF --multi-read-correct --upper-quartile-norm -b hg19.fa --num-threads=20 novels.gtf <control1_bam,control2_bam,...> <hf1_bam,hf2_bam,...>
This is giving me a warning repeatedly and is not moving ahead:
Warning: Your version of Cufflinks is not up-to-date. It is recommended that you upgrade to Cufflinks v2.2.1 to benefit from the most recent features and bug fixes ([url]http://cufflinks.cbcb.umd.edu[/url]).
[09:32:46] Loading reference annotation and sequence.
[09:33:06] Inspecting maps and determining fragment length distributions.
Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges. It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.
Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges. It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.
Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges. It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.
Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges. It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.
I have never come across this kind of warning before. Any suggestions/help would be appreciated.
UPDATE 1:
I changed my GTF file to include some coding transcripts (from gencode v19) plus some novels (from cuffmerge) and it is still showing me the same warnings. This is how my new sample GTF looks like:
chr1 HAVANA UTR 70006 70008 . + . gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
chr1 ENSEMBL gene 134901 139379 . - . gene_id "ENSG00000237683.5"; transcript_id "ENSG00000237683.5"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1"; level 3;
chr1 ENSEMBL transcript 134901 139379 . - . gene_id "ENSG00000237683.5"; transcript_id "ENST00000423372.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1-201"; level 3; tag "basic"; tag "appris_principal";
chr1 ENSEMBL exon 137621 139379 . - . gene_id "ENSG00000237683.5"; transcript_id "ENST00000423372.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1-201"; exon_number 1; exon_id "ENSE00002221580.1"; level 3; tag "basic"; tag "appris_principal";
chr1 Cufflinks exon 16971271 16972173 . + . gene_id "XLOC_000205"; transcript_id "TCONS_00001942"; exon_number "1"; oId "CUFF.386.1"; class_code "u"; tss_id "TSS739";
chr1 Cufflinks exon 16972864 16972878 . + . gene_id "XLOC_000205"; transcript_id "TCONS_00001942"; exon_number "2"; oId "CUFF.386.1"; class_code "u"; tss_id "TSS739";
UPDATE 2:
And so, I removed novel transcripts and used just a bunch of coding transcripts from chr1:
chr1 HAVANA UTR 70006 70008 . + . gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
chr1 ENSEMBL gene 134901 139379 . - . gene_id "ENSG00000237683.5"; transcript_id "ENSG00000237683.5"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1"; level 3;
chr1 ENSEMBL transcript 134901 139379 . - . gene_id "ENSG00000237683.5"; transcript_id "ENST00000423372.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1-201"; level 3; tag "basic"; tag "appris_principal";
chr1 ENSEMBL exon 137621 139379 . - . gene_id "ENSG00000237683.5"; transcript_id "ENST00000423372.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "AL627309.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "AL627309.1-201"; exon_number 1; exon_id "ENSE00002221580.1"; level 3; tag "basic"; tag "appris_principal";
Still giving me the same warnings. So at least I know that nothing is specifically wrong with the novel transcripts.
UPDATE 3:
I started to think that there must be some problem with chromosome 1, so I just extracted chromosome 22 entries for coding and novel transcripts from the GTF file and used it in cuffdiff but it is still showing me the same warning.
UPDATE 4:
I also used coding-transcripts from all chromosomes and novel-transcripts from all chromosomes and ran cuffdiff. This time it went past estimating the abundances stage, and onto the tests for differential expression and regulation but it is somehow stuck on chr1 since a couple of days now.
I am so confused. Cuffdiff shows me the same warnings when:
- I use all novel transcripts from all chromosomes
- I use a bunch of chr1 coding and novel transcripts
- I used a bunch of chr1 coding transcripts alone
- I use all of chr22 coding and novel transcripts
But does not show me any warnings when I use novel and coding transcripts from all chromosomes but instead gets stuck at the differential gene expression stage.
Thanks!
I actually meant to reply on seqanswers but must have forgotten :P
Do you have paired-end or single-end reads? How many BAM files are you using? That warning message (or at least one really similar) will happen if you have single-end reads, but should be pretty rare if you actually do have paired-end reads.
In any case, the warning message shouldn't be the underlying problem. Is cuffdiff still using a bunch of cores or does it seem to have stopped?
Hi, I have 64 paired-end strand-specific BAM files (32 controls and 32 cases). I agree, the warning should only pop up if I have single-end reads, but that's not the case here. I killed the program after waiting for a couple of days because it was still using the cores but not proceeding further.
Regarding the warning message, you might have a look at the alignments in IGV. Cuffdiff is looking for fragments that aren't expected to cross splice junctions, such that it can estimate the fragment length distribution, which is needed to accurately assign fractional fragment counts to multiple transcripts (since this distribution can act as a prior). You should be able to find a large chunk of pairs that align near each other with no apparent splicing going on. If not, then something weird is probably going on and that's what's borking cuffdiff.
Did you try running cufflinks assembly using reference Gencode V19? The command line you provided does not seems to have a GTF file:
I know it is optional. But if you have already decided to use Gencode annotation why don't you try using RABT assembly (cufflinks guided mode,
-g
/--GTF-guide
) to avoid complexions in later steps.My pipeline: The following steps worked for me using Gencode v11, v17, v19 and v20
Note: Gencode v20 had some errors. You can correct it by using this simple command:
sed -i 's/\"\ tag\ \"PAR\"\;/\"\;\ tag\ \"PAR\"\;/g' gencode.v20.annotation.gtf
OR if you have thousands of BAMs
Besk of luck.
Hi,I have used the same method which is mentioned by Santhilal Subhash but still I am getting the same warning message. Anybody can help me with it?
Thanks
Just use Ensembl instead of Gencode.