Question

How does PacBio Iso-Seq annotation compare to other pipelines?

0

Entering edit mode

4.0 years ago

ilante ▴ 30

I have no experience in annotation RNA seq data (or genome assembly) but I've spent the last weeks ploughing through some papers and looking at manuals.

I am wondering why it seems so common that e.g. proprietary pipelines of sequencers are not used?

E.g. the de novo genome assembly pipeline devised by PacBio seems to be cited less than hifiasm or other open source programs. Is this mostly because they are charging a higher price, so people move over to other options, or is it becaue the field is moving at a fast pace and programs are written and benchmarked making the proprietary pipelines obsolete?

1) Is PacBio Iso-Seq annotation a good choice, or is it surpassed by some recently published/improved tool that I haven't read of yet?

2) Using Full-length RNA sequencing to annotate genomes; What are the best performing tools (in terms of speed and accuracy) in 2021 this type of genome annotation?

genome-annotation • 2.2k views

ADD COMMENT • link updated 3.8 years ago by tjduncan ▴ 280 • written 4.0 years ago by ilante ▴ 30

score 2 · Answer 1 · 2021-09-10

2

Entering edit mode

4.0 years ago

lieven.sterck 15k

1) PacBio isoseq is an excellent choice for doing annotation but not because of the tools or such but just because of the higher info content of long reads (identification of isoforms is less error prone for instance)

2) why are you interested in speed?? First objective in science should be quality rather than speed (ok, there are limits of course :) ) . I would think that all the commonly used ones are comparable speed wise and it usually is much more depending on the data you want to use than the technology of the tool or pipeline

ADD COMMENT • link 4.0 years ago by lieven.sterck 15k

0

Entering edit mode

2) Time is money computing cost can vary significantly people have to budget their projects according to funding. I assume that the fastest one will be used most, thus cited most? E.g. Hifiasm replacing Falcon.... Also I assume that anyone publishing a new pipeline will only publish if it at least the same in accuracy, if not better than pipelines already on the market or open source. I guess that boils down to people not publishing before their tool is better than other tools.

The question 2 is still standing :)

ADD REPLY • link 3.9 years ago by ilante ▴ 30

1

Entering edit mode

Time is indeed a factor (though in my institute we're quite fortunate not having to care too much about it). Nonetheless, quality still prevails and should be the first criteria, if there are tools that perform similarly, then yes you can go for the fastest one. On the other hand, would you choose a fast tool over a slower one if the former performs much less than the latter?

Ah, there your assumption is wrong, there a likely plenty of reasons why people publish something: getting funding, project is dying anyway, ... I would not assume that since it's published more recently it will work better or faster. Of course in that paper their tool will perform better and faster but that is not always true when seeing the general picture.

For the reason I have put forward already, speed is rarely a criteria in genome annotation. Moreover the time it takes to perform an annotation is so dependent on several other factors it rarely is the tool itself that determines the runtime. I mean an annotation done with only intrinsic info (not even trained for a particular species) will run lighting fast. The one I used to use a lot (EuGene, INRA) will annotate a Gb genome in minutes in this mode. However, then I'm not taking the weeks/months of optimizing it for a new species into account. If I use that same tool but would like to throw in some more data (proteins, RNAseq, ... ) it will do the annotation in, let's say, few hours but then I'm not taking the time to perform protein alignments, RNAseq mapping, ... into account (if I add that it will likely be days to weeks) .... I hope you understand that the tool itself is much less of a factor and there are many other factors in play that determine (total) runtime.

ADD REPLY • link 3.9 years ago by lieven.sterck 15k

score 0 · Answer 2 · 2021-11-19

For the most part PacBio is working to incorporate the best community developed into their own recommended pipelines or recommend researchers use community developed pipelines for their work. They don't currently charge for use of their pipelines and you can use them via bioconda or downloading SMRTLink. I would recommend bioconda as it is easier to use from the command line.

One of the authors on hifiasm pre-print works at PacBio. That said HiFi data analysis is constantly being improved upon as developers are spending more time designing programs to optimize the use of the highly accurate, long-read HiFi datasets. There are also going to be different assemblers/pipelines for each particular application's use case. They are developing rapidly.

Example: Here for different HiFi based assemblers based on genome/size/type.

Specifically for full length RNA sequencing Iso-Seq, take a look at this paper: PacBio Iso-Seq Improves the Rainbow Trout Genome Annotation and Identifies Alternative Splicing Associated With Economically Important Phenotypes.

This shows how powerful (and easy) genome annotation can be with Iso-Seq and explains the most recently established workflow well. Basically you would use PacBio's Iso-Seq3 package for secondary analysis and use several community developed tools (cupcake, Squanti3, cogent, ect...) for the tertiary annotation steps.