Question

Determining Transcription Start Sites from multiple RNAseq bioreps

2

Entering edit mode

9.3 years ago

Daniel ★ 4.0k

Hi. I am trying to conceptualise how I could do this, and whether it makes sense, and looking for advice.

I have 4 RNAseq bioreps from a batch of samples, which corresponds to some DNA work that I am doing. I intend to use the RNAseq to improve my TSS positions by using the correct splice variants and general improved accuracy than the publicly downloadable gtf file.

This is fine for the independent bioreps (I have done this using cufflinks for each sample and have 4 gtf files), but I am wondering whether I can use the four together to build a consensus for greater accuracy.

Question:

Would I be better to just merge the 4 bams and call TSSs from the whole dataset, or is there a way to interpret them together which would have better accuracy in generating the final GTF annotation? Is there a standard practice for this step?

What I can't understand is how differences between the bioreps would be dealt with. I imagine minor differences could end up giving 4 models for each gene, and this wouldn't help.

Thanks

TSS-prediction RNA-Seq • 3.4k views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 9.3 years ago by Daniel ★ 4.0k

Ram · Accepted Answer · 2015-08-06

Are these 5'RNA-Seq datasets? I would do them both ways.

Generate four TSS lists from four replicates
Merge them and generate a single list.

Also, try ranking each TSS locus with a parameter (eg how much enichment you can found and from which rep), then generate a high, medium and low confidence list of TSS and then cross-compare. You can also use additional scores like presence of conserved TATA boxes, CpG island and GC strength.

Copying excerpt for the Homer Suite:

http://homer.salk.edu/homer/ngs/tss/index.html

Introduction to Transcriptional Initiation at Metazoan Promoters

To understand the analysis of 5'RNA data, it is worth taking a moment highlight that there are multiple 'types' of promoters in living organisms. First of all, there are different RNA polymerases including RNA polymerase I (rRNA), II (mRNA, lncRNA, miRNA), III (tRNA), IV(plant specific), viral polymerases, etc., and each polymerase has different mechanisms of transcriptional initiation that may vary between different distally related organisms. Also be aware that different RNA polymerases may generate RNAs with different covalent modifications and may or may not be present in your5' RNA sequencing, depending on how the experiment was performed. By in large most researchers are interested in RNA polymerase II transcripts (mRNA) and as a result most 5'RNA methods focus on the identification of

RNAs containing a 7-methylguanosine cap protecting their 5' end.

With respect to RNA polymerase II initiation sites, there are two generally recognized 'types' of TSS. Sharp (or Focused) TSS initiate transcription from a single nucleotide (or +/- 2 nt) and resemble the promoters found in molecular biology text books. They often contain well define core-promoter elements such as the TATA box and usually initiate transcription from a purine preceded by a pyrimidine (PyPu, i.e. CA, with the A being the initiating nucleotide).

The other, more common TSS is a broad (or dispersed) TSS. These promoters initiate transcription from sevearl different sites within a large area (often 50-100 nt in size). These promoters usually lack core promoter elements (no TATA box), but they each individual initiation site DOES normally still initiate on a purine preceded by a pyrimidine (PyPu).

False TSS - be careful of artifacts

A quick note about artifacts in 5'RNA-Seq data: Most 5' RNA-Seq methodologies work by enriching for 5' cap-protected RNA, which means that most of the sequence data describes 5' RNA ends, but a fraction of it may be noise from random RNA-Seq fragments (again, a lot like ChIP-Seq). In particular, highly expressed RNAs may yield "5'RNA-Seq" reads along the whole body of the gene giving the appearance of alternative TSS which are likely false positives. Because of this, I would highly recommend using traditional RNA-Seq as a "background" when analyzing 5' RNA-Seq data. This approach (describe below) may remove several real TSS from the results, but it is also likely to remove a large number of false positives and clean up your analysis.

Transcplicing of transcripts (where the 5' end of one transcript is added to the front of another) and recapping (where a transcript is cleaved and a new cap placed on the truncated product) are two phenomena you may want to think carefully about when analysing 5' RNA-Seq data. Transplicing will create false negatives and recapping will create false-positives. In certain organisms, such as C. elegans, transcplicing is very common, making 5'GRO-Seq a much better assay for identifying TSS than 5'RNA-Seq (i.e. measuring the 5' RNA ends before they have a chance to transplice). In other organisms (e.g. mouse, human, fly, etc.) it appears to be rare. The degree to which transcription are 'recapped' is a matter of debate because it can be hard to distinguish them from true alternative TSS or noise in the 5' RNA-seq assay.