Question

From Compressed Bam To Fastq

2

Entering edit mode

14.0 years ago

Zhidkov ▴ 600

Hi all, We were using the samtools calmd –e command and changed all the basses that are identical to the reference to '='. Anyone knows some software that converts such compressed BAM back to FASTQ? One-liners won't help in this case. Thanks

samtools • 7.4k views

ADD COMMENT • link updated 14.0 years ago by Eitan Rubin ▴ 30 • written 14.0 years ago by Zhidkov ▴ 600

0

Entering edit mode

Thank you all for answers, The approach that was eventually taken is using pileup file to generate uncompressed BAM and then it is trivial BAM2FASTQ script.

Ilia

ADD REPLY • link 13.9 years ago by Zhidkov ▴ 600

score 8 · Answer 1 · 2011-06-02

8

Entering edit mode

14.0 years ago

lh3 33k

If your BAM is coordinate sorted, I would not recommend to use Picard directly, as this will generate a coordiante sorted file, which may cause mapping problems (certainly for bwa) and other potential biases in analyses.

The recommended way is to sort your BAM by name and then convert to fastq with Picard or your own scripts -- for name sorted BAM, it very easy to generate fastq.

ADD COMMENT • link 14.0 years ago by lh3 33k

0

Entering edit mode

Heng, this is useful information that I think is underappreciated. Can you expand on the nature of mapping problems that this will cause?

ADD REPLY • link 12.2 years ago by nlomioni ▴ 30

2

Entering edit mode

Reads from the same genomic regions share similar properties. For bwa, chromosomal ordered reads will interfere with insert size estimate. For all mappers, such ordered reads will make each alignment jobs run at different speeds, faster if reads are from a unique region, but slower if from a repetitive region. Now htslib comes with bamshuf, which is much faster than sorting. Running bamshuf before bam2fq is the preferred solution.

ADD REPLY • link 12.2 years ago by lh3 33k

0

Entering edit mode

@nlomioni It is certainly underappreciated. Thanks for the info @lh3, I still don't understand the issues in the IS calculations and the variability in running times. The purpose of the IS calculations is to find the real IS of the library, how reads from the same chrm regions will interfere here? As for the running time, I understand repeat regions will be more computationally intensive to map but the overal running time of the whole mapping process should be the same for the different methods?

ADD REPLY • link 12.1 years ago by Drio ▴ 920

2

Entering edit mode

Speed matters when you split alignment jobs on different computing nodes. For paired-end, insert size distribution will be different in some regions especially in chr1 centromere, where there are many unpaired/mispaired reads. This will lead to potential artifacts. Bamshuf is much faster and more lightweight than picard without having these concerns. It is the best solution.

ADD REPLY • link 12.1 years ago by lh3 33k

Ram · Answer 2 · 2011-06-02

Take a look to this thread. Basically there is a picard command for that.

Based on other comments I think these are valid methods:

INPUT_BAM=input.bam
PICARD="/path/to/picard/jars"
JAVA_OPTS="-Xmx4G"
PICARD_OPTS="MAX_RECORDS_IN_RAM=5000000 TMP_DIR=/path/to/local/tmp"
htscmd bamshuf -O -u $INPUT_BAM _tmp | \
    java $JAVA_OPTS -jar $PICARD/SamToFastq.jar I=/dev/stdin F=e1.fq F2=e2.fq $PICARD_OPTS

Alternatively we could use picard for the sorting (I haven't tried this method):

I_BAM=input.bam
PICARD="/path/to/picard/jars"
JAVA_OPTS="-Xmx4G"
PI_OPTS="MAX_RECORDS_IN_RAM=5000000 TMP_DIR=/path/to/local/tmp" 
java $JAVA_OPTS -jar $PICARD/SortSam.jar I=$I_BAM O=/dev/stdout SORT_ORDER=queryname $PI_OPTS |\
    java $JAVA_OPTS -jar $PICARD/SamToFastq.jar I=/dev/stdin F=e1.fq F2=e2.fq $PI_OPTS

score 1 · Answer 3 · 2011-06-04

1

Entering edit mode

14.0 years ago

Eitan Rubin ▴ 30

I checked out the recommendation to use picard or Hydra to convert BAM to Fastq. Works great UNLESS YOU USED THE -calmd -e OPTION OF SAMTOOLS. In this case, picard complaints that it sees an illegitimate base '=', and hydra happily creates fastq sequences made almost entirely of '=' charaters :(

ADD COMMENT • link 14.0 years ago by Eitan Rubin ▴ 30