Hi All,
I working to identify novel antisense transcriptions , and what I did right now that I got read count from sense and antisense RNA as shown below (very small portion of data):
gene id strand antisense count sense count
ENSG00000160087 -1 42 3018
ENSG00000177954 1 33 9590
ENSG00000237283 -1 1 1
ENSG00000153207 -1 65 9185
ENSG00000223764 -1 9 29
ENSG00000236423 1 243 498
ENSG00000175137 -1 29 3727
ENSG00000224235 -1 1 14
ENSG00000044012 1 1 1
ENSG00000230812 1 53 79
ENSG00000134183 -1 484 9
In some cases, I found that antisense counts are higher that sense counts, some are equal, but in most cases sense are higher that antisense. so is there any technique to identify those novel antisense (which is the gene is highly expressed than sense)
Dear Michael,
I computed my reads by using Python and R scripts. and here is my library details:
I never used HTSeq-count before.
According to the NEXTflex website, the kit you used has approx. 99% strand specificity. Meaning that in 100 correct oriented reads, one is a falsely detected as antisense-read. In your example ENSG00000160087 might be such a case, while ENSG00000236423 has a higher ratio of antisense reads and may be a good candidate for antisense transcript detection.
Anyway, I would visually inspect your most promising findings with IGV or UCSC.
If you want to go also for Madelaine's approach (see below), you need to flag your reads during alignment properly (for instance,
--Library-type
in tophat). Please be aware of the fact, that you need sufficient coverage for cufflinks to find transcript hypothesis.Thanks again Michael, I used library type as follows:
tophat --library-type fr-firststrand
. Which is the best way to visualize the most promising findings, Is there any criteria to choose the count (I mean how I can decide if these promising antisense or not from their counts).I would just take your list and use all those genes as candidates having more than 1% antisense reads and sort those by read number and/or percentage. You may think about a Poisson model for low abundant genes. You should use the annotation to filter out genes with known antisense transcripts.
For visualisation, I would use the IGV browser. You can split the alignment according to the genomic strand (+/-) in order to have a more easy to inspect picture.
Hi Michael, I already excluded all positive pairs (known antisense using the annotation file). and I sort the antisense read counts from smallest to largest (from A to Z). Also, I got the percentage (proportion) by dividing antisense counts over sense count, but I still confused about from which percentage or from which count should I take it to be considered as novel (I mean what is the criterion that can use for determining a novel transcript). BTW, you mentioned about Poisson model for lower abundant genes, so do you have any idea about using this model.
In addition, how can I split the alignment according to the genomic strand using IGV.
If you would like to have a deep look of one of my complete files for all read counts after removing known antisense transcripts, I can send one as Excel file via your email.
Thanks again for helping me.
You may address your antisense reads as rare events. Your probability of having a false antisense read is 1% and you compute for each locus the probability of seeing the given number of antisense reads. If an occurrence is most presumably explainable by a Poisson distribution, it is no good candidate for a novel antisense.
You can use samtools view for splitting your file:
I would suggest, that you check your results first by yourself using the IGV browser (for ENSEMBL annotation you need to add the ENSEMBL DAS). You may also check, for instance, the coverage profiles of your antisense candidates.
Hi Michael, Could you please explain how to use IGV to inspect novel antisense?
I think he meant that you can eyeball the coverage profile of two transcripts and check for similarity in the pattern. Often, the same transcript shows similar 'degradation pattern' (or whatever causes this) in different sequencing runs even. (Don't ask me for a citation for this, I am not sure there is one). As the leaked antisense transcripts are from the same transcript, they should have similar coverage pattern, depending on what causes the pattern. While real antisense transcript would theoretically have an different degradation pattern. This is however more difficult to quantify.
Hi Michael,
Could you please explain how to use Poisson model with my previous example?
I guess he meant to model read counts as coming from a poisson distribution (see
?ppois
), that way you can directly calculate p-values in favor of the alternative hypotheses that there are significantly more antisense reads than to be expected at 1% strand leakage.