Hi all,
I'm having some difficulties in understanding how cufflinks and cuffcompare count the assembled transcripts. When I extract the transcript sequences into a FASTA file, in case of my data I get 15,127 sequences. The headers in FASTA file look like this, e.g:
>rna_NCLIV_068720-1 gene=NCLIV_068720 >rna_NCLIV_068730-1 gene=NCLIV_068730 >rna_NCLIV_068740-1 gene=NCLIV_068740 >rna_NCLIV_068750-1 gene=NCLIV_068750 >rna_NCLIV_068770-1 gene=NCLIV_068770 >CUFF.1.1 gene=CUFF.1 >rna_NCLIV_068790-1 gene=CUFF.1 >rna_NCLIV_068800-1 gene=NCLIV_068800 >rna_NCLIV_068810-1 gene=NCLIV_068810 >CUFF.2.1 gene=CUFF.2
After simple calculations I can say that these 15,127 transcripts are product of 8,087 genes.
Now when I run cuffcompare against the reference genome annotation, this is what I get:
#= Summary for dataset: ../cufflinks/cufflinks_out/transcripts.gtf : # Query mRNAs : 12904 in 7986 loci (9484 multi-exon transcripts) # (3050 multi-transcript loci, ~1.6 transcripts per locus) # Reference mRNAs : 7265 in 7265 loci (5738 multi-exon) # Super-loci w/ reference transcripts: 6832 #--------------------| Sn | Sp | fSn | fSp Base level: 99.7 52.2 - - Exon level: 90.7 70.2 91.1 70.5 Intron level: 99.9 86.4 100.0 86.5 Intron chain level: 90.4 54.7 100.0 65.8 Transcript level: 70.4 39.6 70.4 39.6 Locus level: 92.4 81.5 99.9 85.8 Matching intron chains: 5188 Matching loci: 6714 Missed exons: 53/43090 ( 0.1%) Novel exons: 4543/55662 ( 8.2%) Missed introns: 45/35825 ( 0.1%) Novel introns: 2455/41434 ( 5.9%) Missed loci: 8/7265 ( 0.1%) Novel loci: 1128/7986 ( 14.1%) Total union super-loci across all input datasets: 7986
The number of query mRNAs is estimated to be 12,904. How does this relate to my previous count of transcripts (15,127)?
The summary of cufflinks "class codes" also gives the same number - 12,904.
u 752 i 93 j 3737 x 135 o 218 c 2 p 307 = 7295 e 365
Any clue how these numbers (12,904 vs 15,127) should be interpreted?
Doesn't the GTF file you used to produce the fasta file contain the class codes? If so, just look and see if you have genes lacking codes and then have a look at those genes. I would guess that these are mRNAs with no supporting reads.