If your BAM is coordinate sorted, I would not recommend to use Picard directly, as this will generate a coordiante sorted file, which may cause mapping problems (certainly for bwa) and other potential biases in analyses.
The recommended way is to sort your BAM by name and then convert to fastq with Picard or your own scripts -- for name sorted BAM, it very easy to generate fastq.
Reads from the same genomic regions share similar properties. For bwa, chromosomal ordered reads will interfere with insert size estimate. For all mappers, such ordered reads will make each alignment jobs run at different speeds, faster if reads are from a unique region, but slower if from a repetitive region. Now htslib comes with bamshuf, which is much faster than sorting. Running bamshuf before bam2fq is the preferred solution.
@nlomioni It is certainly underappreciated. Thanks for the info @lh3, I still don't understand the issues in the IS calculations and the variability in running times. The purpose of the IS calculations is to find the real IS of the library, how reads from the same chrm regions will interfere here? As for the running time, I understand repeat regions will be more computationally intensive to map but the overal running time of the whole mapping process should be the same for the different methods?
Speed matters when you split alignment jobs on different computing nodes. For paired-end, insert size distribution will be different in some regions especially in chr1 centromere, where there are many unpaired/mispaired reads. This will lead to potential artifacts. Bamshuf is much faster and more lightweight than picard without having these concerns. It is the best solution.
Take a look to this thread. Basically there is a picard command for that.
Based on other comments I think these are valid methods:
INPUT_BAM=input.bam
PICARD="/path/to/picard/jars"
JAVA_OPTS="-Xmx4G"
PICARD_OPTS="MAX_RECORDS_IN_RAM=5000000 TMP_DIR=/path/to/local/tmp"
htscmd bamshuf -O -u $INPUT_BAM _tmp | \
java $JAVA_OPTS -jar $PICARD/SamToFastq.jar I=/dev/stdin F=e1.fq F2=e2.fq $PICARD_OPTS
Alternatively we could use picard for the sorting (I haven't tried this method):
I_BAM=input.bam
PICARD="/path/to/picard/jars"
JAVA_OPTS="-Xmx4G"
PI_OPTS="MAX_RECORDS_IN_RAM=5000000 TMP_DIR=/path/to/local/tmp"
java $JAVA_OPTS -jar $PICARD/SortSam.jar I=$I_BAM O=/dev/stdout SORT_ORDER=queryname $PI_OPTS |\
java $JAVA_OPTS -jar $PICARD/SamToFastq.jar I=/dev/stdin F=e1.fq F2=e2.fq $PI_OPTS
I checked out the recommendation to use picard or Hydra to convert BAM to Fastq. Works great UNLESS YOU USED THE -calmd -e OPTION OF SAMTOOLS. In this case, picard complaints that it sees an illegitimate base '=', and hydra happily creates fastq sequences made almost entirely of '=' charaters :(
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you all for answers, The approach that was eventually taken is using pileup file to generate uncompressed BAM and then it is trivial BAM2FASTQ script.
Ilia