Question

Pacbio: extract fastq from h5 file based on quality filtering

1

Entering edit mode

10.3 years ago

merodev ▴ 150

Hi, I am new to pacbio and have 2 sets of .h5 files as output from pacbio. I am planning to use celera assembler and for that i need fastq files from .h5 files.

1) Is there any way to convert .h5 to fastq.

2) Is there any specific method to filter pacbio reads based on quality?

3) Do we combine both sets of data and then work on it for assembly?

Thanks!

pacbio Assembly long reads celera hgap • 14k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by merodev ▴ 150

0

Entering edit mode

The quality values are sufficiently low that reads may be artificially trimmed by celera. I've found it's best to just fake fastq from fasta with high enough quality value that reads are retained. The assembly quality needs to be improved later using quiver.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by mchaisso ▴ 160

0

Entering edit mode

Could you please tell me If bash5tools.py also removes adapter sequences?

I have just used it and got subreads from raw data but I am not sure whether subreads still contains adapter sequences?

ADD REPLY • link 8.5 years ago by mehmetgoktay1989 • 0

0

Entering edit mode

You need to post this as another question, also please refer to the manual

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 8.5 years ago by Rohit ★ 1.5k

Ram · Answer 1 · 2015-02-11

5

Entering edit mode

10.3 years ago

Biomonika (Noolean) 3.2k

1) and 2) Use bash5tools.py

bash5tools.py --minLength 500 --readType subreads --minReadScore 0.8 --outType fastq

Depends on your dataset, but if you just sequenced 2 SMRT cells to get more coverage, then you can merge them prior to assembly.

https://github.com/PacificBiosciences/pbh5tools/blob/master/doc/index.rst

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Biomonika (Noolean) 3.2k

Ram · Answer 2 · 2015-02-11

1

Entering edit mode

10.3 years ago

thackl ★ 3.0k

Have a look at dextract. It's very quick and lets you set a score cutoff. However, I think it only generates FASTA.

https://dazzlerblog.wordpress.com/2014/03/22/the-dextractor-module-save-disk-space-for-your-pacbio-projects/

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by thackl ★ 3.0k

1

Entering edit mode

dextract can generate FASTQ if you add -q paramenter. To filter fastq with minimum Read Quality 0.80, use -s800 (default: 750)

dextract -q *.bax.h5 -s800 > raw_reads_RQ0.80.fastq

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.1 years ago by rtliu ★ 2.2k

0

Entering edit mode

How to combine it with find command e.g. find All_RawData/Each_Cell_Raw/ -name "*.bax.h5" | xargs -I {} dextract -q {} > How to get the file name?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 7.9 years ago by Ric ▴ 440

0

Entering edit mode

https://www.everythingcli.org/find-exec-vs-find-xargs/

ADD REPLY • link 7.8 years ago by h.mon 35k

Ram · Answer 3 · 2015-02-11

0

Entering edit mode

10.3 years ago

Jean-Karim Heriche 27k

I've written a perl wrapper for the hdf5 library that you might find useful. It hasn't been tested on pacbio files though I have no reason to think it wouldn't read them.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Jean-Karim Heriche 27k