Hello all,
So, Im running a test dataset available here
https://github.com/Arabinet/Introduction-to-Bioinformatics/wiki/Arabidopsis-RNA-Seq
through an RNAseq pipeline I hope to use for experimental data later. I have recently been working with HISAT2 and STRINGTIE and have generated 18 gtf files. Now I would like to get a sense of how many transcripts are being predicted in my merged-gtf produced in STRINGTIE and the one produced by GFFCOMPARE. The odd part is that gffcompare, using the -r option does not give me any accuracy/precision estimates in its .stats file when I provide it with my list of 18 gtf files. Here is the output it gives me:
# gffcompare v0.10.6 | Command line was:
#gffcompare -r Arabidopsis_thaliana.TAIR10.42.gff3 -i list_of_stringtie_gtfs.txt
#
Total union super-loci across all input datasets: 1901
(64 multi-transcript, ~1.2 transcripts per locus)
2206 out of 2206 consensus transcripts written in gffcmp.combined.gtf (0 discarded as redundant)
Note that I am here using the gff3 file found here ftp://ftp.ensemblgenomes.org/pub/release-42/plants/gff3/arabidopsis_thaliana rather than the gtf file in that first links Arabidopsis test data under the assumption that maybe my gtf file was the problem, but it gives me the same output. Have I missed an option or something? Can anyone guess why I wouldnt get output including something like this in my stats file:
/assembled/11-B5690_S3_mRNA_ARGU1_NGSLib3b.gtf
Query mRNAs : 136272 in 81266 loci (111377 multi-exon transcripts)
(20690 multi-transcript loci, ~1.7 transcripts per locus)
Reference mRNAs : 136039 in 81267 loci (111148 multi-exon)
Super-loci w/ reference transcripts: 80037
-----------------| Sensitivity | Precision |
Base level: 100.0 | 100.0 |
Exon level: 100.0 | 100.0 |
Intron level: 99.5 | 99.5 |
Intron chain level: 100.0 | 99.8 |
Transcript level: 100.0 | 99.8 |
Locus level: 100.0 | 100.0 |
Thanks!
UHG. I just figured out what I was doing wrong and feel like a fool. Please ignore the above question. If I could delete it I would. What an a** I am...
Hi, I face the same problem with you. Have you find the solution for it? Thank you.
Hey Cy,
I know that I did but to be honest I've forgotten what I did to correct the problem, I wound up having my whole attention switched to an totally absorbing project for the last 4 weeks and looking back this seems like years ago. I do have my scratch notes on this step of the pipeline I was building:
################################################# Stringtie & gffcompare Notes:
The above line will take the .bam from hisat2, and the reference genome in gtf format and produce a variety of output files. When all samples have been done, a sperate process is used to combine all the input gtfs being used here into one merged-gtf. gffcompare can also produce a merged gtf
This command line will take a list of gtf files generated by hisat2 and merge them into one. it will also give file specific outputs used in assesments
The output of the above merger whether done in stringtie or gffcompare should undergo quality assessment in gffcompare by
To see how many transcripts exist in our merged gtf file simply use the following unix command line:
Stringtie should be rerun using the -e -B options and the merged .gtf file as a reference.
This will produce the 5 input files that ballgown needs to conduct differential expression analysis. The will have the file extension .ctab If for some eason you are using a program other than stringtie and have only .gtf and .bam files, you will need to run the bioconductor program 'tablemaker' on them to create thise five ballgown inputs:
<h6>############################ End notes</h6>I believe the solution is in here but can't remember what it was. Hopefully this provides you with what you need. My apologies for not being able to give you the specific fix. Side Note: I had it pointed out to me that ballgown is now considered obsolete so that part of my notes should obviously be ignored. Let me know if this helps, and good luck.