How to perform quality check and assembly using pacbio sequel data?
1
2
Entering edit mode
6.4 years ago

Hi

I am working with Pacbio sequel data for few bacterial strains. I have got 3 files from the sequencing facility

  • sample.bam

  • sample.bam.pbi

  • sample.subreadset.xml

Question#1

How to asses the quality of the data? Since, this is sequel data, the phred scores are arbitrariliy set to exclamation mark (phred score=0). There should be some way to asses the QC. PacBio suggest using SMRTlink program to asses the quality; I can also see from the user guide (page#28) that .subreadset.xml file contains information about Sequel sequence data. From page#110, I know that I have all the files required by SMRTlink, however, I am not sure how to import these files into SMRTlink and asses the quality? I have already installed it on a windows machine and I am able to login.

Page#109 of the same user guide says that another file called .sts.xml contains summary statistics about the collection/cell and its post-processing. I havent receive that file. Is it required for QC?

  • Do I have the files required for QC using SMRTlink?

  • Any alternative way to perform QC?

Update: the sequel machine is not with us. Can I still perform the QC having just the 3 files mentioned above?

Question#2

I am trying to understand the pacbio sequencing chemistry. From the image below, it is clear that what I have got is the subreads (sequenced inserts devoid of the green adapters) and not CCS (circular consensus sequence). I am trying to assemble the bacteria genome with canu assembler.

SMRT_Technology_SMRTbell_Template

image source: Pacbio

What should I use?

  1. the fastq file converted from the subreads.bam file? I think most of the blogs suggest that.

OR

  1. first I generate a CCS (smartlink?) and then use that to perform denovo assembly? This is because pacbio error rate is high and CCS will help compensate that (?). I understand that CCS reads are the result of doing a consensus base calling from subreads that are all from the same template.
pacbio qc Assembly sequel error correction • 10k views
ADD COMMENT
0
Entering edit mode

What is the expected insert size in your libraries? One reason sequel is popular is one can get much longer sequences (tens of kb) so CCS may not be coming into play (when you have the right kind of library). At least that is what I would think.

ADD REPLY
0
Entering edit mode

Thankfully I had bookmarked this tutorial, which you may find useful: Polish PacBio assembly with latest PacBio tools : an affordable solution for everyone

ADD REPLY
1
Entering edit mode

While good that tutorial is for RS II which had different data formats than Sequel. It may only be partially useful.

ADD REPLY
0
Entering edit mode

Just wanted to update that you are correct. That tutorial is not helping much in my case.

ADD REPLY
1
Entering edit mode
6.4 years ago
harish ▴ 470

If you are planning to use Canu you'll need to extract the subreads from the fastq/a file. For this you can use bam2fastq/a from smrtlink to extract reads from bam.

After this you can use canu to filter your reads using correct and trim modules and then assemble, for this refer canu's documentation. Or the alternative is you can use LoRDEC and then go for Canu assembly directly.

If you are planning to use HGAP4 or the likes from smrtlink portal, you can refer smrtlink documentation with the needed section being "Data Management"

ADD COMMENT
0
Entering edit mode

Hi Harish

If you are planning to use Canu you'll need to extract the subreads from the fastq/a file. For this you can use bam2fastq/a from smrtlink to extract reads from bam.

For that purpose, I think any general purpose bam to fastq converter will work. I used the latest version of samtools and that worked fine. It was interesting to know that canu can assemble the reads differently when supplied with fastq or fastq converted fasta files.

After this you can use canu to filter your reads using correct and trim modules and then assemble, for this refer canu's documentation. Or the alternative is you can use LoRDEC and then go for Canu assembly directly.

By default, canu performs all the 3 steps i.e. error correction, trimming and assembly. Thanks for suggesting LoRDEC; I ll have a look.

If you are planning to use HGAP4 or the likes from smrtlink portal, you can refer smrtlink documentation with the needed section being "Data Management"

My primary concern is QC. As suggested in the original post above, the files that I have; are those sufficient for using in SMRTlink? If yes, how to perform QC ?

Thanks

Vijay

ADD REPLY
0
Entering edit mode

Most of the QC information is provided in the *sts.xml in that case. You can convert that xml to a csv format or to a json. For that I think you can use this: https://github.com/jfalkner/metrics-sts

And later to visualize or parse the csv you can use https://github.com/PacificBiosciences/stsPlots

You can as well check sequanna : http://sequana.readthedocs.io/en/master/installation.html

I tend to mostly use the metrics-sts, as our provider generally provide the csv files as well.

And IIRC, you can directly load the BAM in the SMRTLink, or you might be needing the *subreadset.xml files as well.

Can you post a snapshot of the files received from your provider?

Canu can just simply correct the reads as well, which is why I had suggested you that initially. You just have to use canu -correct switch

ADD REPLY
0
Entering edit mode

Hi Harish

Thank you for the response. I should definitely try that once.

ADD REPLY
0
Entering edit mode

Hi, Did you ever figure out how to do the QC? I have the exact same files and am hoping to do this... interested to know what you did! Thanks, Annabelle

ADD REPLY

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6