Pacbio: extract fastq from h5 file based on quality filtering
3
1
Entering edit mode
9.8 years ago
merodev ▴ 150

Hi, I am new to pacbio and have 2 sets of .h5 files as output from pacbio. I am planning to use celera assembler and for that i need fastq files from .h5 files.

1) Is there any way to convert .h5 to fastq.

2) Is there any specific method to filter pacbio reads based on quality?

3) Do we combine both sets of data and then work on it for assembly?

Thanks!

pacbio Assembly long reads celera hgap • 14k views
ADD COMMENT
0
Entering edit mode

The quality values are sufficiently low that reads may be artificially trimmed by celera. I've found it's best to just fake fastq from fasta with high enough quality value that reads are retained. The assembly quality needs to be improved later using quiver.

ADD REPLY
0
Entering edit mode

Could you please tell me If bash5tools.py also removes adapter sequences?

I have just used it and got subreads from raw data but I am not sure whether subreads still contains adapter sequences?

ADD REPLY
0
Entering edit mode

You need to post this as another question, also please refer to the manual

ADD REPLY
5
Entering edit mode
9.8 years ago

1) and 2) Use bash5tools.py

bash5tools.py --minLength 500 --readType subreads --minReadScore 0.8 --outType fastq

Depends on your dataset, but if you just sequenced 2 SMRT cells to get more coverage, then you can merge them prior to assembly.

https://github.com/PacificBiosciences/pbh5tools/blob/master/doc/index.rst

ADD COMMENT
1
Entering edit mode
9.8 years ago
thackl ★ 3.0k

Have a look at dextract. It's very quick and lets you set a score cutoff. However, I think it only generates FASTA.

https://dazzlerblog.wordpress.com/2014/03/22/the-dextractor-module-save-disk-space-for-your-pacbio-projects/

ADD COMMENT
1
Entering edit mode

dextract can generate FASTQ if you add -q paramenter. To filter fastq with minimum Read Quality 0.80, use -s800 (default: 750)

dextract -q *.bax.h5 -s800 > raw_reads_RQ0.80.fastq
ADD REPLY
0
Entering edit mode

How to combine it with find command e.g. find All_RawData/Each_Cell_Raw/ -name "*.bax.h5" | xargs -I {} dextract -q {} > How to get the file name?

ADD REPLY
0
Entering edit mode
9.8 years ago

I've written a perl wrapper for the hdf5 library that you might find useful. It hasn't been tested on pacbio files though I have no reason to think it wouldn't read them.

ADD COMMENT

Login before adding your answer.

Traffic: 2519 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6