Converting large (compressed and unsorted) BAM files to fastq
2
2
Entering edit mode
8.2 years ago
alesssia ▴ 580

Dear all,

we got a few hundred large compressed BAM files (70GB <= size <= 300GB). They are not sorted and we would like to convert them back to fastq, in order to align them with a different algorithm.

We have paired-end reads and were planning to first sort the BAM by read name using sambamba (http://lomereiter.github.io/sambamba/docs/sambamba-sort.html) and then to convert the sorted BAM into fastq using bedtools (http://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html). However, while the sorting is relatively fast (about 4h for each file), the conversion is very slow.

Is anyone aware of any other procedure that will make the conversion faster? I've seen that there are alternatives to bedtools such as picard (https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam) and biobambam2 (https://github.com/gt1/biobambam2). Does anyone know the performances of these tools, if a benchmarking has already been performed and/or if there are better tools?

Thank you very much in advance :)

BAM fatsq alignment bedtools picard • 12k views
ADD COMMENT
0
Entering edit mode

Have you tried bam2fastq: https://gsl.hudsonalpha.org/information/software/bam2fastq

It says its no longer supported but still works

ADD REPLY
0
Entering edit mode

Hi Tonor. Do you know if bam2fastq is faster than Picard's SamToFastq and why it has been discontinued for Picard's SamToFastq (as in the top pf their web page)?

ADD REPLY
0
Entering edit mode

I've never done benchmarking so not sure which one if faster, I think as Picard tools accomplishes the same task they stopped actively developing it.

I don't think the bam file has to be sorted though for bam2fastq to work

ADD REPLY
0
Entering edit mode

BAM needs to be name sorted for PE data for bedtools bamtofastq.

ADD REPLY
0
Entering edit mode

Yes, it does need to be sorted if the reads are paired-end (as specified by both picard and bedtools).

ADD REPLY
0
Entering edit mode

I am interested in the fastest way to complete the task, but I will add this tool to the list of programs to benchmark --if no one has done this before.

ADD REPLY
1
Entering edit mode

I recently used bam2fastq for single end reads. The problem I found was that reads that had more than one alignment in the bam file ended up being present more than once in the fastq file.

ADD REPLY
1
Entering edit mode

Yes, that's why for the reformat command I added the "primaryonly" flag; without that it has similar behavior.

ADD REPLY
3
Entering edit mode
8.2 years ago

The BBMap package's reformat tool will do bam -> fastq conversion. I just tested it, and it ran at 50 Mbps/s, limited by Samtools conversion (it runs samtools in a subprocess for bam -> sam conversion). Sam.gz -> fastq is slightly faster at 70 Mbp/s. Converting a 2.5 GB sorted, mapped bam file took 88 seconds and produced a 10.5 GB fastq, so that's 120 MB/s for the raw fastq output. Gzipped fastq takes the same amount of time if you have pigz installed.

Usage:

reformat.sh in=reads.bam out=reads.fq primaryonly

ADD COMMENT
1
Entering edit mode

Hi Brian, so I tried this:

reformat.sh in=test.bam out=stdout.fq  | reformat.sh in=stdin.fq out1=r.1.fq.gz out2=r.2.fq.gz interleaved addcolon

however, I've been getting this error message:

Input is being processed as unpaired
java.lang.AssertionError: TODO: Encountered a read with 'M' in cigar string but no MD tag and no ScafMap loaded.
This can normally be resolved by adding the flag ref=file, where file is the fasta file to which the reads were mapped.

is there any way to solve this? I'm using ver: BBMap_38.08 thanks.

ADD REPLY
0
Entering edit mode

Hi Brian, thanks for suggesting BBmap. I am trying to use it but I have a weird problem and perhaps you can help me (hope this comment is the right place to ask).

We have paired-end reads, that have already been sorted by read name using sambamba, and I would like to create two fastq, one for each end. When I run the following command ./reformat.sh in=test.sorted.bam out1=test1.fq out2=test2.fq ow allowidenticalnames addslash primaryonly BBmap is telling me that "Input is being processed as paired" but a java.lang.AssertionError is thrown after the first pair of reads is processed (see below). When I ask for a single fastq with the following command ./reformat.sh in=test.sorted.bam out=test.fq ow allowidenticalnames addslash primaryonly BBmap is confirming that "Input is being processed as unpaired", the fastq is created and each read name is followed by " /1".

Any idea?


The thrown error is:

Input is being processed as paired
Exception in thread "Thread-2" java.lang.AssertionError: H2FF2ALXX:1:1:2:0  2   -1  +   32763188    32763337    1000000000000000010 1   0   1   TCTTTAAAACATTTTAAGGCATTTTACCTCTTAATGATAAGAATTTAAGAAAAACGCTAAAATATCATTATTCTGTCTATAAGATTAAAAGAGATTAATTAATCTAACATGTAGTACATATTAAATATAAATGAAATAACAAAAAAAAGC  A---FJJJJAJJJJJJ<-F-AAA-JJ7---AFJJF<J7JJF---JFFJ7JFA7<---<<--JJAJ-AFJ<J7-----7-<-<F-<-JF-F-7-7------7-----J-<-<-F7<--<---7F-F-------<---------AF7A----  .   23  .   .
H2FF2ALXX:1:1:2:0   163 chr16   32763189    1   115M35S =   32763344    304 TCTTTAAAACATTTTAAGGCATTTTACCTCTTAATGATAAGAATTTAAGAAAAACGCTAAAATATCATTATTCTGTCTATAAGATTAAAAGAGATTAATTAATCTAACATGTAGTACATATTAAATATAAATGAAATAACAAAAAAAAGC  A---FJJJJAJJJJJJ<-F-AAA-JJ7---AFJJF<J7JJF---JFFJ7JFA7<---<<--JJAJ-AFJ<J7-----7-<-<F-<-JF-F-7-7------7-----J-<-<-F7<--<---7F-F-------<---------AF7A----  SM:i:1  AS:i:1  RG:Z:0  NM:i:8  BC:Z:none
at stream.SamReadInputStream.toReadList(SamReadInputStream.java:124)
at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:90)
at stream.SamReadInputStream.hasMore(SamReadInputStream.java:54)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:643)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
ADD REPLY
2
Entering edit mode

Reformat produces a file with interleaved reads by default. You can easily separate the two reads into two files by subsequently doing reformat.sh in=test.fq out1=test_R1.fq.gz out2=test_R2.fq.gz

ADD REPLY
0
Entering edit mode

Thanks for clarifying!

ADD REPLY
1
Entering edit mode

Yeah, Reformat will never interpret a sam/bam file as interleaved no matter what flags you give it, so it will just keep them in the input order and treat them as single-ended. If you want them in two files with the names changed, you can do it in a 2-stage pipe with renaming in the second phase, like this:

reformat.sh in=test.sorted.bam out=stdout.fq primaryonly | reformat.sh in=stdin.fq out1=r1.fq.gz out2=r2.fq.gz interleaved addcolon

"addcolon" is probably better than "addslash" as the " 1:" and " 2:" suffix are produced by more recent versions of Illumina software.

ADD REPLY
1
Entering edit mode

@Brian: addcolon is not working in v.36.59.

Exception in thread "main" java.lang.AssertionError: Unknown parameter addcolon
    at jgi.ReformatReads.<init>(ReformatReads.java:162)
    at jgi.ReformatReads.main(ReformatReads.java:42)
ADD REPLY
0
Entering edit mode

Hmm, yeah, looks like I added it in 36.60. But 36.62 is up now.

ADD REPLY
0
Entering edit mode

I must admit the conversion in BBmap is pretty fast! Does it support multi-threading? Will it be even faster if I increase the memory? Thanks!

ADD REPLY
0
Entering edit mode

Reformat is multithreaded. Unfortunately, it's already going as fast as it can go, limited by the bam -> sam conversion of samtools. So, no, increasing memory won't help.

I wrote another program yesterday exclusively for sam/bam -> fastq conversion which is even faster (I clocked it at 109 Mbp/s, so around 50% faster than Reformat sam.gz -> fastq and 120% faster than bam -> fastq). But that's the speed for sam.gz -> fastq, so since you are starting with bam files, it wouldn't be any faster.

ADD REPLY
0
Entering edit mode

Too bad it does not work for bam->fastq :(

Anyway, starting from an (unsorted compressed) BAM file of 102G it took about 4.5h to sort the BAM according to read name with sambamba (using 60GB memory and 20 threads) and about 6h to convert the (uncompressed sorted) BAM of 310G in a single fastq and then to split it into two files (forward/reverse) with BBmap reformat. We have a few hundred of them, corresponding to few thousand hours of computation. Does anyone know any trick to speed it up? @Brian, do the BBmap performance decrease if the BAM is not sorted?

Thanks again!

ADD REPLY
1
Entering edit mode

Hmmm... if you are converting to fastq.gz, Reformat will be faster if you have pigz installed. And since Reformat is low-memory and only uses a few threads, you can run multiple instances at once on a 20-core node (probably 5-6 instances).

The performance of Reformat's bam->fastq is limited by samtools. For compressed bam, position-sorted will be substantially faster than unsorted because the decompression should be faster. For uncompressed bam, I'm not really sure; it may not make a difference. You can run top and look at the CPU utilization; if samtools is at 100%, then Reformat can't go any faster no matter what you do.

ADD REPLY
0
Entering edit mode

Hi @Brian! I've run the reformat on the unsorted compressed BAM with pigz installed and the processing time halved (from 2 to 1h, processing 264.59k reads/sec vs the previous 100.94k reads/sec). On the top of that, using the unsorted BAM saved me about 4.5h :) My current bottleneck is the second step of the BAM to fastq conversion, where the interleaved reads are separate into two files, that takes about 2.5h (122.25k reads/sec). Any way to speed this up? I know that I am being a bit insatiable here, but you never know :) Thanks!

ADD REPLY
0
Entering edit mode

UPDATE: when used as input for BBmap reformat, the unsorted compressed BAM generated a non-interleaved fastq that is equivalent (in number of reads and size) to the one created by using the sorted uncompressed BAM. However, according to the script usage: "If input is paired and there is only one output file, it will be written interleaved.". What am I doing wrong?

Also, I suspect that the following command requires the input fastq to be interleaved to work properly, isn't it?

reformat.sh in=unsorted.test.fq out1=unsorted.test.end1.fq out2=unsorted.test.end2.fq

Thanks!

ADD REPLY
0
Entering edit mode

That command does indeed require interleaved fastq input. As for bam -> fastq, Reformat will simply retain the original ordering and original reads (aside from secondary/supplimentary alignments, which are discarded with the "primaryonly" flag). It does not understand the ordering of the bam file, which is why you need to name-sort the bam file first, so that it will end up interleaved. In other words, the process is like this:

Step 1: Name-sort the bam.

Step 2: Run Reformat on the name-sorted bam, like this:

reformat.sh in=namesorted.bam out=stdout.fq primaryonly | reformat.sh in=stdin.fq out1=r1.fq.gz out2=r2.fq.gz interleaved addcolon

Did you do that, and encounter odd results?

ADD REPLY
0
Entering edit mode

@Brian: Plain out=stdout and in=stdin seems to work fine. Is .fq extension needed?

ADD REPLY
0
Entering edit mode

It's not strictly needed, because fastq is the default for Reformat. But I always like to include it because some other programs have other defaults. For example, "out=stdout" for BBMap will output sam format, and Dedupe will output fasta, I think. Whereas "out=stdout.fq" will always output fastq.

ADD REPLY
0
Entering edit mode

Thanks Brian, I hoped I could avoid the time to name-sort the bam file first :)

ADD REPLY
0
Entering edit mode

I have to revive that thread....

I have a hg19-mapped BAM and want to remap to GRCh38. Here is the flagstat-output from the hg19-BAM:

69351786 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
3301071 + 0 supplementary
6209841 + 0 duplicates
69307585 + 0 mapped (99.94% : N/A)
66050715 + 0 paired in sequencing
33033978 + 0 read1
33016737 + 0 read2
64904898 + 0 properly paired (98.27% : N/A)
65962313 + 0 with itself and mate mapped
44201 + 0 singletons (0.07% : N/A)
880261 + 0 with mate mapped to a different chr
513887 + 0 with mate mapped to a different chr (mapQ>=5)

When I use reformat.sh like here:

reformat.sh in=chr1.hg19.bam  out=stdout.fq pairedonly | reformat.sh in=stdin.fq out1=r1.fq.gz out2=r2.fq.gz interleaved addcolon

and remap the resulting fastq with hisat2, then I get the following output from samtools flagstat:

70280428 + 0 in total (QC-passed reads + QC-failed reads)
3385442 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
66854526 + 0 mapped (95.13% : N/A)
66894986 + 0 paired in sequencing
33447493 + 0 read1
33447493 + 0 read2
15052276 + 0 properly paired (22.50% : N/A)
61023694 + 0 with itself and mate mapped
2445390 + 0 singletons (3.66% : N/A)
495106 + 0 with mate mapped to a different chr
199169 + 0 with mate mapped to a different chr (mapQ>=5)

Why do I lose so many paired reads ? How can I avoid it ? Thanks for any help.

ADD REPLY
0
Entering edit mode

I don't see any reads being lost. You actually have some more reads in second alignment because you do not appear to have removed reads that were secondary alignments when you did the conversion. I guess you are referring to less reads being properly paired in GRCh38 alignment. I suggest that you check to make sure that your fastq files are in sync across R1/R2 reads.

BTW you may want to do the following if you are going to repeat the process.

reformat.sh in=chr1.hg19.bam out1=r1.fq.gz out2=r2.fq.gz pairedonly=t addcolon primaryonly=t
ADD REPLY
0
Entering edit mode

Incidentally, the reason I used the pipe in the command reformat.sh in=namesorted.bam out=stdout.fq primaryonly | reformat.sh in=stdin.fq out1=r1.fq.gz out2=r2.fq.gz interleaved addcolon) is to eliminate the need for two reformat steps.

ADD REPLY
0
Entering edit mode

Does anyone know any trick to speed it up?

Following may sound like a flippant solution but it works. We have done jobs that would have taken a couple of months on local clusters in a couple of days on Google compute.

If this is time sensitive do the compute on Google compute/Amazon AWS. You can start all jobs at the same time and be done in the time it takes to complete one using multiple VM's.

Otherwise get started with the conversion without wasting any additional time testing other tools :)

ADD REPLY
0
Entering edit mode

:) I had already started the conversion before posting the initial question, but, you know, following BAMs can be converted faster ;)

Unfortunately, due to some legal/ethical constraints, using external machines is out of the question, so we must make the most of what we have at hand!

ADD REPLY
0
Entering edit mode

Totally understand.

FYI: Google will sign business associate agreements (perhaps Amazon does too). So it is possible to create a business case (that can fit local security policies) for such use. Google compute is also HIPAA/EU Data protection directive compliant that covers many constraints.

Why do you need to convert the BAM's to fastq, out of curiosity?

ADD REPLY
0
Entering edit mode
8.2 years ago
Ron ★ 1.2k

You can use HYDRA. Its pretty fast.

https://code.google.com/archive/p/hydra-sv/downloads

I have used this version Hydra.v0.5.3.tar.gz

Something like this should work:

Hydra-Version-0.5.3/bin/bamToFastq -bam  input.bam -fq1 out.R1.fq  -fq2 out.R2.fq
ADD COMMENT
0
Entering edit mode

If you say it is pretty fast... I will give it a try! Thanks!

ADD REPLY
1
Entering edit mode

Hi again Ron! Looking at the number of bytes of the fastq file written in the same time interval (that is a quite rough way to look into performances) by both bedtools and HYDRA I would say that their speed is quite comparable, the latter being slightly but consistently slower.

ADD REPLY

Login before adding your answer.

Traffic: 3290 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6