Dear community,
I performed a polyA+ stranded RNAseq experiment which I processed throught TopHat-Cufflinks and TopHat-Scripture in order to make a genome-guided transcriptome assembly, as was published in 2011 by Cabili et. al.. Therefore, I wanted to know If there's any tool/program to compute summary statistics of my assemblies and also to perform transcript overlaps in order to determine the degree of correlation between both methods. As additional information I have as ouput from Cufflinks/Scripture bed12 or GTF files.
Also I wanted to ask if anybody could clarify how in the paper they calculate the minimal read coverage threshold (3 reads/base), which they use to filter for minimal expressed transcripts? In the paper they say:
(2)Minimal read coverage threshold. We ran Cufflinks with its transcript abundance calculation mode to estimate the read coverage of each transcript across the 24 tissues and cell types. We eliminated transcripts with a maximal coverage below 3 reads per base. This coverage threshold was set by optimizing the sensitivity and specificity of identifying full length vs. partial length transcripts of protein coding genes annotated in RefSeq or non-coding genes annotated in UCSC. To this end, we calculated the number of full length and partial length transcripts identified at each coverage threshold (considering the maximal coverage threshold in which a transcript was identified across all tissues). We used area under the curve (AUC) calculations to determine the optimal threshold for the coding and non-coding sets and took their average as the final threshold.
I edited as bold text the part which I do not understand. Maybe with some sort of algorithm steps on how to calculate this threshold would be very helpful, because it's very difficult for me, to translate this into a python/R code.
Thank you very much for help!