Question

Can someone explain to me what is the difference between clustering and collapsing of of reads in Iso-Seq analyses?

0

Entering edit mode

4.4 years ago

yvanpapa • 0

Hi, I have been working with Iso-Seq reads produced with PacBio Sequel II.

I have been following the Isoseq V3 pipeline based on the manual and on the github instructions https://www.pacb.com/wp-content/uploads/SMRT_Tools_Reference_Guide_v90.pdf https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md

After the final clustering step, I have been able to map the transcripts to a reference genome with pbmm2 and collapse them with command "collapse".

Everything is probably fine but looking in detail at my data, I see that the clustering step produced 127,857 HQ transcripts out of ~1 million FL reads. However, after collapsing based on genome mapping of this HQ set, the total number of isoforms drops to 93,943 (contained in ~16,000 genes or so).

My question is, what happened to these ~30,000 transcripts after collapsing? I thought that clustering would create a set of unique (i.e. non-redundant) transcripts. But collapsing seems to have further merged some of the reads into one isoform (according to the "group.txt" file).

There is probably something I am misunderstanding here. What is the difference between "clustering" and "collapsing", and what happens to the number or transcripts/isoforms retained during these two steps?

Thank you in advance for your help

Iso-Seq PacBio rna-seq SMRT-Tools transcripts • 2.8k views

ADD COMMENT • link updated 3.1 years ago by HARMEET SINGH • 0 • written 4.4 years ago by yvanpapa • 0

score 0 · Answer 1 · 2022-06-03

I found this in a publication (doi: 10.1101/gr.274282.120). "TSSs and TTSs may still have some error in their exact location as the clustering algorithm used by Iso-Seq3 allows for 100 bp of variability at the 5′ end and 30 bp of variability at the 3′ end of the transcript. Transcripts with start or end positions within this range are collapsed into a single isoform, creating a small window of possible TSS and TTS locations".