Hi, I have been working with Iso-Seq reads produced with PacBio Sequel II.
I have been following the Isoseq V3 pipeline based on the manual and on the github instructions https://www.pacb.com/wp-content/uploads/SMRT_Tools_Reference_Guide_v90.pdf https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md
After the final clustering step, I have been able to map the transcripts to a reference genome with pbmm2 and collapse them with command "collapse".
Everything is probably fine but looking in detail at my data, I see that the clustering step produced 127,857 HQ transcripts out of ~1 million FL reads. However, after collapsing based on genome mapping of this HQ set, the total number of isoforms drops to 93,943 (contained in ~16,000 genes or so).
My question is, what happened to these ~30,000 transcripts after collapsing? I thought that clustering would create a set of unique (i.e. non-redundant) transcripts. But collapsing seems to have further merged some of the reads into one isoform (according to the "group.txt" file).
There is probably something I am misunderstanding here. What is the difference between "clustering" and "collapsing", and what happens to the number or transcripts/isoforms retained during these two steps?
Thank you in advance for your help