I'm using data from e coli, which is a pretty much well studied organism and a nice annotation file(NC000913.3) can be downloaded from NCBI.
After I used cufflinks with the very basic command line(see below) and got the genes.fpkm_tracking.txt. I found 2576 features(or lines in cufflinks' output file )' fpkm, while the original annotation file consisted of ~4500 genes. In another dataset(different experiment) I used, cufflinks gave me 3502 features.
I found that part of the reason was that cufflinks grouped some genes together(assume they belonged to a single transcript ??!!). And I think another explanation is that the 0 count genes were not reported.
Do you think it's a reasonable analysis result?
Command line cufflinks -o <out_dir >="" -g="" <annotation.gff=""> <aligned.bam>
Example lines of cufflinks output: tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
CUFF.8 - - CUFF.8 fixC,fixX - NC_000913.3:44179-45750 - - 10.5497 1.36756 19.7577 OK
gene44 - - gene44 yaaU - NC_000913.3:45806-47138 - - 1.98785 0.960989 3.00309 OK
CUFF.9 - - CUFF.9 kefC,kefF - NC_000913.3:47245-49631 - - 49.5173 34.9478 49.0769 OK
CUFF.10 - - CUFF.10 folA - NC_000913.3:49822-50302 - - 0 0 1.00003 OK
CUFF.11 - - CUFF.11 - - NC_000913.3:50567-57253 - - 116.716 112.037 122.431 OK
gene48 - - gene48 apaH - NC_000913.3:50379-51222 - - 0 0 0.379608 OK
gene49 - - gene49 apaG - NC_000913.3:51228-51606 - - 0 0 0.846585 OK
CUFF.12 - - CUFF.12 pdxA,rsmA,surA - NC_000913.3:51608-54702 - - 1039.98 872.984 933.547 OK
Actually the aligner is not so much a problem because there is few splice reads reported by tophat(as one should expect for a bacteria data). I agree with you about the fact that some genes were bundled together due to their close locations and maybe relatively short length. I think the
--overlap-radius
–min-intron-length
default values are the reason for that merge.I've been used HISAT and Rockhopper recently. Did not know much about StringTie but I'll check their protocol for details.
Thanks a lot for the advice.
Hi panda,
Yes, and you can slightly improve the analysis by reducing the values of the overlap radius and min intron length, but you'll still see some genes merged.
I have not actually used StringTie unfortunately, so, cannot comment further. These programs have obviously mostly been designed with the human genome/transcriptome in mind.
Rockhopper was quite useful to me. Once you get your significant sequences, you can BLASTx them in order to infer functionality.