How to identify novel antisense?
2
0
Entering edit mode
10.1 years ago
M K ▴ 660

Hi All,

I working to identify novel antisense transcriptions , and what I did right now that I got read count from sense and antisense RNA as shown below (very small portion of data):

gene id             strand     antisense count     sense count
ENSG00000160087     -1         42                  3018
ENSG00000177954     1          33                  9590
ENSG00000237283     -1         1                   1
ENSG00000153207     -1         65                  9185
ENSG00000223764     -1         9                   29
ENSG00000236423     1          243                 498
ENSG00000175137     -1         29                  3727
ENSG00000224235     -1         1                   14
ENSG00000044012     1          1                   1
ENSG00000230812     1          53                  79
ENSG00000134183     -1         484                 9

In some cases, I found that antisense counts are higher that sense counts, some are equal, but in most cases sense are higher that antisense. so is there any technique to identify those novel antisense (which is the gene is highly expressed than sense)

RNA-Seq • 4.1k views
ADD COMMENT
1
Entering edit mode
10.1 years ago
michael.ante ★ 3.9k

Hi M K

how did you compute the reads? And what is the strandedness of your library prep.?

For instance, for ENSG00000134183 it seems, that there's an overlapping antisence transcript (just have a look in the ENSEMBL browser).

I would use HTSeq-count with the correct stranded and the samout options. From the new sam file I would extract all ambigious and no_feature reads and analyse those more carefully. Additionally, you might use HTSeq-count with the "wrong" strand option and compare it to your regular result.

Cheers,

Michael

ADD COMMENT
0
Entering edit mode

Dear Michael,

I computed my reads by using Python and R scripts. and here is my library details:

  1. The strand-specific RNAseq libraries were prepared with the NEXTflex™ Directional RNA-Seq Kit, dUTP method.
  2. rRNA was removed with Ribozero Human/Mouse from Epicentre.
  3. Each library was quantitated by qPCR and sequenced on one lane 101 cycles on a HiSeq2000 using a TruSeq SBS sequencing kit version 3 and analyzed with Casava1.8.2 .
  4. Reads are 100nt in length.

I never used HTSeq-count before.

ADD REPLY
0
Entering edit mode

According to the NEXTflex website, the kit you used has approx. 99% strand specificity. Meaning that in 100 correct oriented reads, one is a falsely detected as antisense-read. In your example ENSG00000160087 might be such a case, while ENSG00000236423 has a higher ratio of antisense reads and may be a good candidate for antisense transcript detection.

Anyway, I would visually inspect your most promising findings with IGV or UCSC.

If you want to go also for Madelaine's approach (see below), you need to flag your reads during alignment properly (for instance, --Library-type in tophat). Please be aware of the fact, that you need sufficient coverage for cufflinks to find transcript hypothesis.

ADD REPLY
0
Entering edit mode

Thanks again Michael, I used library type as follows: tophat --library-type fr-firststrand. Which is the best way to visualize the most promising findings, Is there any criteria to choose the count (I mean how I can decide if these promising antisense or not from their counts).

ADD REPLY
0
Entering edit mode

I would just take your list and use all those genes as candidates having more than 1% antisense reads and sort those by read number and/or percentage. You may think about a Poisson model for low abundant genes. You should use the annotation to filter out genes with known antisense transcripts.

For visualisation, I would use the IGV browser. You can split the alignment according to the genomic strand (+/-) in order to have a more easy to inspect picture.

ADD REPLY
0
Entering edit mode

Hi Michael, I already excluded all positive pairs (known antisense using the annotation file). and I sort the antisense read counts from smallest to largest (from A to Z). Also, I got the percentage (proportion) by dividing antisense counts over sense count, but I still confused about from which percentage or from which count should I take it to be considered as novel (I mean what is the criterion that can use for determining a novel transcript). BTW, you mentioned about Poisson model for lower abundant genes, so do you have any idea about using this model.

In addition, how can I split the alignment according to the genomic strand using IGV.

If you would like to have a deep look of one of my complete files for all read counts after removing known antisense transcripts, I can send one as Excel file via your email.

Thanks again for helping me.

ADD REPLY
0
Entering edit mode

You may address your antisense reads as rare events. Your probability of having a false antisense read is 1% and you compute for each locus the probability of seeing the given number of antisense reads. If an occurrence is most presumably explainable by a Poisson distribution, it is no good candidate for a novel antisense.

You can use samtools view for splitting your file:

samtools view -f 0x010 -bh accepted_hits.bam > reverse.bam
samtools view -F 0x010 -bh accepted_hits.bam > forward.bam

I would suggest, that you check your results first by yourself using the IGV browser (for ENSEMBL annotation you need to add the ENSEMBL DAS). You may also check, for instance, the coverage profiles of your antisense candidates.

ADD REPLY
0
Entering edit mode

Hi Michael, Could you please explain how to use IGV to inspect novel antisense?

ADD REPLY
0
Entering edit mode

I think he meant that you can eyeball the coverage profile of two transcripts and check for similarity in the pattern. Often, the same transcript shows similar 'degradation pattern' (or whatever causes this) in different sequencing runs even. (Don't ask me for a citation for this, I am not sure there is one). As the leaked antisense transcripts are from the same transcript, they should have similar coverage pattern, depending on what causes the pattern. While real antisense transcript would theoretically have an different degradation pattern. This is however more difficult to quantify.

ADD REPLY
0
Entering edit mode

Hi Michael,

Could you please explain how to use Poisson model with my previous example?

ADD REPLY
0
Entering edit mode

I guess he meant to model read counts as coming from a poisson distribution (see ?ppois), that way you can directly calculate p-values in favor of the alternative hypotheses that there are significantly more antisense reads than to be expected at 1% strand leakage.

ADD REPLY
0
Entering edit mode
10.1 years ago

You could run cufflinks with no annotation and it will call transcripts for you, then see which match to known genes and which are antisense to known genes.

ADD COMMENT

Login before adding your answer.

Traffic: 1997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6