I have a bam file that I want to split into paired end fastq files per lane. Can somone please suggest me how it can be done?
I have a bam file that I want to split into paired end fastq files per lane. Can somone please suggest me how it can be done?
Using the example data you have above it can be demultiplexed into lane specific files using demuxbyname.sh
from BBMap suite as follows. (use .gz
extension to keep files zipped). Input here is interleaved (as in your example). If someone reaches this solution in future and have their inputs in two separate files then specify them with in1=R1
and in2=R2
.
$ demuxbyname.sh -Xmx10g in=file.fq delimiter=: column=2 out=%_#.fq
Example output
$ more 5_1.fq
@HS2000-1015_160:5:2306:10070:71746/1
AGAAGACGAGCACAGTGAGGAAGGAGATGAGCAGGCAGGGGATGATGAGGTTGATGGTGTAGAACAAGGGCAGGCGCCGGATGTACAGCGAGTATGTGAT
+
CCCFFFFFHHHHHIICGGHHHAFGEFGIJJJIIJIIGIIGG;FGHEHHGI=CHCEHGHE?DFFDFECEDDDDDDDDDBDDBD@ACDAACDBD5<CDC>@>
@HS2000-1015_160:5:2113:11446:94436/1
AATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGCACCACTGCAGTCCAGCCTGGGAAACCAGAGCAAAACTTGGTCTCAAAAAAAAAA
+
CCCFFFFFHHHHGHGFGFHJIBF1DC<@;)9D*099?F;8=(BFF5(;'.=CEHFBDC7.>;?C@BBB35>(<?<<A>AC?B>4+(+>A:(:@(2)5@B#
$ more 5_2.fq
@HS2000-1015_160:5:2306:10070:71746/2
GAACCTCAAGGACTATTGGGAGAGCGGCGAGTGGGCCATCATCAAAGCCCCAGGCTACAAACACGACATCAAGTACAACTGCTGCGAGGAGATCTACCCC
+
@CCFFFFDHGHFHIJJJJJJGGGIIJJIGHI@FHGIIGHHEFGHHFFFFFBCDDDDDCDDDDDDDD;@BDCCDACDD@>ACCDDDDBDB<BA?C@CC@BD
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################
I ended up using this demuxbyname.sh
. I found it very fast- takes only 15 minutes to split 200G fastq. But before doing this, I did: samtools collate -uO ${INBAM} | samtools fastq - -@ ${THREADS} -N -1 ${FQ_OUT1} -2 ${FQ_OUT2} -0 /dev/null -s /dev/null -n
. It would be nice if they had an option to keep both column1 and column 2 as base name of splitted fastq. Anyway, thank you for your help.
This could also be done starting with BAM directly as follows (not tested, MAPK if you can test would be great). Make sure BAM is name sorted (or could be sorted on fly before reformat.sh
step).
reformat.sh -Xmx20g in=your.BAM out=stdout.fq primaryonly=t | demuxbyname.sh -Xmx10g in=stdin.fq delimiter=: column=2 out=%_#.fq
If someone wants awk solution, I also have written this. This also creates RG files.
#!/bin/bash
SM=SAMPLE
DNA=BARCODE
PR=PROJECT
FULLSM="SM^DNA^PR"
### splitfq1.sh
set -x
echo "Splitting lanes from FASTQ: ${FQ_OUT1}"
awk -v sm="$SM" -v dna="$DNA" -v pr="$PR" -v fullsm="$FULLSM" -v out_dir="${OUT_DIR}" -v tab="\\\\t" '
BEGIN{FS = ":"}
/^@HS2000/{
arr=$1
sub(/^@+/,"",arr)
lane=$2
close(outputFile)
outputFile=out_dir"/"fullsm"."arr"."lane".r1.fastq"
outputFile2=out_dir"/"fullsm"."arr"."lane".rgfile"
print "@RG" tab "ID:"arr"."lane tab "DS:"fullsm tab "LB:"dna tab "PL:illumina" tab "PU:"arr"."lane tab "SM:"sm > (outputFile2)
close(outputFile2)
}
{
print >> (outputFile)
}' ${FQ_OUT1}
ls $OUT_DIR/*.rgfile | sort -V | xargs cat > "$OUT_DIR/${FULLSM}.RGFILES.txt"
## splitfq2.sh
!/bin/bash
set -x
# FQ2
echo "Splitting lanes from FASTQ: ${FQ_OUT2}"
awk -v sm="$SM" -v dna="$DNA" -v pr="$PR" -v fullsm="$FULLSM" -v out_dir="${OUT_DIR}" -v tab="\\\\t" '
BEGIN{FS = ":"}
/^@HS2000/{
arr=$1
sub(/^@+/,"",arr)
lane=$2
close(outputFile)
outputFile=out_dir"/"fullsm"."arr"."lane".r2.fastq"
}
{
print >> (outputFile)
}' ${FQ_OUT2}
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You are going to need to write your own code to do this. You could either convert BAM into fastq and then sort or sort them as you convert.
The lane-splitting is the custom part, for the rest just run
samtools collate
piped intosamtools fastq
. I would probably runsamtools fastq
to get one fastq (not splitted by R1/R2, so an interleaved file) and then scan the headers for the lane information, writing the first entry of each pair (so the one that comes first) tolane_1.fastq.gz
and the second per pair tolane_2.fastq.gz
.@ATpoint Do I need to sort the reads first ? Should I just do
samtools collate input.bam -O | samtools fastq -0
?Yes. You should collate or name-sort reads before conversion to fastq.
You should group by read name, it does not matter whether you sort or collate, the effective result is the same. Reason is that some aligners expect randomly-ordered fastq files due to the way they infer the most likely TLEN (afaik).
@ATpoint Not sure why I am getting this:
samtools collate -uO LP6005117-DNA_G04.bam | samtools fastq -
@ATpoint Now that I have
interleaved.fastq
file, how do I split them into individual lane fastqs?You will need to read these files two records at a time, look at the read header for record 1
Decide what lane file to put it in based on the number in second field ( when separated by
:
).Someone can provide an
awk
based solution but let me see if I can give you abbmap
based one.