I have PacBio CCS.h5 and the corresponding fasta and fastq files and I would like to demultiplex them. Does anyone know of how this can be done in the absence of bas.h5 files.
Thanks for your help!
Mandy
I have PacBio CCS.h5 and the corresponding fasta and fastq files and I would like to demultiplex them. Does anyone know of how this can be done in the absence of bas.h5 files.
Thanks for your help!
Mandy
You can use HMMer package to identify barcodes. Start and finish barcode HMMs can be probabilistically pinned (independently) to the start_pos
and end_pos
of the reads where the barcodes are supposed to occur.
The two ends can then be considered together by adding their log-likelihood scores of the start_pos
and end_pos
HMM hits pertaining to the different barcode combinations that were used for multiplexing (your hypothesis i.e. the barcode combinations that were actually used).
You can easily extract barcode sequences with below commands with bam files, but this will only applicable for exact barcode matches not suitable when there are base errors in the barcode sequences.
example:
forward barcode = "CAAGCTCACT"
sequence between barcodes = ".*"
reverse complementary barcode = "GCACGACTTG"
or = "|"
reverse barcode = "CAAGTCGTGC"
sequence between barcodes = ".*"
forward complementary barcode = "AGTGAGCTTG"
samtools view -H pacbio_reads.ccs.bam > pacbio_reads.ccs-header.sam
samtools view pacbio_reads.ccs.bam | grep 'CAAGCTCACT.*GCACGACTTG\|CAAGTCGTGC.*AGTGAGCTTG' | cat pacbio_reads.ccs-header.sam - | samtools view -Sb - > pacbio_reads.ccs.demultiplex.bam
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.