How can I convert multiple bam files in a directory into fastq files using samtools bam2fq? I use bash on linux.
How can I convert multiple bam files in a directory into fastq files using samtools bam2fq? I use bash on linux.
You could perhaps run a simple for
loop over the BAM files in your directory. Here is example shell (e.g., Bash) code:
# Change to the directory of interest
cd path/to/directory/of/interest || echo "cd'ing failed"
# Loop over all BAM files in the directory
for bam in *.bam; do
# Collate and convert each BAM file to a FASTQ file using samtools
samtools collate "${bam}" -o "${bam%.bam}.collated.bam"
samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"
done
cd path/to/directory/of/interest
. If the cd
command fails, then the code after || is run, printing the message "cd'ing failed".for
loop iterates over each BAM file in the directory (for bam in *.bam
). The use of *
here is an example of a Bash glob.${bam}
), the script uses samtools bam2fq
to convert it to a FASTQ file..bam
(%.bam
within the pointy brackets; see parameter expansion for more details) and then appending the new extension .collated.bam
..bam
extension with .fastq
("${bam/.bam/.fastq}"
). (See parameter expansion for more details.)Also, looping over each file one-by-one to call samtools collate
and then samtools bam2fq
is going to be pretty slow. You can speed things up by distributing the commands across multiple cores using GNU Parallel:
parallel -j 4 '
samtools collate {} -o {.}.collated.bam &&
samtools bam2fq {.}.collated.bam > {.}.fastq
' ::: *.bam
-j 4
specifies the number of jobs to run in parallel. Adjust this number based on the number of CPU cores available on your machine.{}
is a placeholder for the input file (each BAM file).{.}
is a placeholder for the input file without the extension (e.g., file.bam
becomes file
).&&
means run the following code (samtools bam2fq {.}.collated.bam > {.}.fastq
) if the preceding code (samtools collate {} -o {.}.collated.bam
) ran successfully.::: *.bam
: This part tells parallel
to use all BAM files in the current directory as input.kalavattam sorry, but the code is wrong, the bam should be first processed with 'samtools collate'.
Thank you, yes. One should run samtools collate
or samtools sort -n
prior to samtools bam2fq
. I've edited the answer to reflect this. (samtools bam2fq
will work on a coordinate-sorted BAM file, but the read alignments are out of order, and paired read alignments are generally not with each other.)
Are any error messages reported? If so, it may be helpful to post them here for troubleshooting. If you decide to share the error messages, please format them (along with any code) using the icon above the text box that looks like this:
101
010
Alternatively, you can enclose the text and/or code with three backticks (```), like this:
```
Error message and/or code
```
which will become
Error message and/or code
This kind of formatting will make it much easier for people to read the messages and code, and potentially help you troubleshoot the issue.
No errors, it just completes the collated bit but doesn't go on to create the fastq. Was this all meant happen in the same job? I used the first code you wrote
for bam in *.bam; do
samtools collate "${bam}" -o "${bam%.bam}.collated.bam"
samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"
done
I don't think it is finding the collated bam file, which needs to be distinguished from the original bam file within the same directory.
If you're running the code as a one-liner, which is how you pasted it,
for bam in *.bam; do samtools collate "${bam}" -o "${bam%.bam}.collated.bam" samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}" done
then the two commands are being interpreted as one, and the second one (samtools bam2fq
) is not being run.
If you're going to run the code as a one-liner, you need to separate the two commands. One option is to use a semicolon (;
); another is to use &&
(explained above). I'd use &&
. Also, you need to place a semicolon before the done
operator. The syntax changes if you run the code on one line.
For example,
for bam in *.bam; do samtools collate "${bam}" -o "${bam%.bam}.collated.bam" && samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"; done
The message is telling you that there is "No space left on device
". That means there's too little space on your hard drive.
In that case, you may want to delete some other files to free up space. If you don't want to delete files, then you could consider moving them from the drive you're working on to another one.
It is the uni storage via their HPC server, so there should not be a shortage of space. I have managed far bigger files than the tmp files that won't save now. When I converted the BAM files to Fastq it was fine. The issue arises if I collate first, due to the tmp files.
It is the uni storage via their HPC server, so there should not be a shortage of space.
Space on servers is not a single, monolithic thing.
Looking at the error message,
[E::bgzf_flush] File write failed (wrong size)
samtools collate: Couldn't write to intermediate file "/tmp/collate46003.0059.bam": No space left on device
we see that this operation is taking place in a temporary directory (tmp
) within the root (/
) directory of your server. It's likely that this specific directory, /tmp
, has insufficient space. The people who maintain these servers tend to allocate only so much space for these kinds of shared temporary locations. So, it may be at or near capacity.
By default, many of the utilities in the samtools
software (such as samtools sort
and samtools collate
) use /tmp
for writing temporary intermediate files. However, you can change the location for temporary output.
First, let's look at the help documentation for samtools collate
:
Usage: samtools collate [options...] <in.bam> [<prefix>]
Options:
-O Output to stdout
-o Output file name (use prefix if not set)
-u Uncompressed BAM output
-f Fast (only primary alignments)
-r Working reads stored (with -f) [10000]
-l INT Compression level [1]
-n INT Number of temporary files [64]
-T PREFIX
Write temporary files to PREFIX.nnnn.bam
--no-PG do not add a PG line
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
--output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
--verbosity INT
Set level of verbosity
<prefix> is required unless the -o or -O options are used.
Note the -T
argument, which you can use to change where the temporary files are being written, e.g.,
cd path/to/directory/of/interest || echo "cd'ing failed"
for bam in *.bam; do
samtools collate \
-o "${bam%.bam}.collated.bam" \
-T "$(pwd)/temporary" \
"${bam}"
samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"
done
In the example above, I use -T "$(pwd)/temporary"
to write the temporary intermediate files to the current working directory (the one containing your BAM files of interest). However, this is not really good practice because HPC servers are a shared resource. Depending on how things are set up on the server, how many BAM files you need to process, how large the BAM files are, and so on, continually reading from and writing to this kind of working directory may slow down the work of other people using the HPC.
Does your lab have its own temporary directory in a non-/tmp
location? This is often the case for university research computing environments, and these directories tend to be set up to minimize the impact on other people using the server. If so, then it would be better to use that, e.g.,
cd path/to/directory/of/interest || echo "cd'ing failed"
for bam in *.bam; do
samtools collate \
-o "${bam%.bam}.collated.bam" \
-T "path/to/temporary/directory/for/your/lab/prefix" \
"${bam}"
samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"
done
Also, note that when using -T
, you need to end with a prefix.
More explanation: The \
symbols in the above Bash code are used as line continuation characters. They allow you to split a long command across multiple lines for better readability. When Bash encounters a \
at the end of a line, it knows to continue reading the next line as part of the same command; if it does not encounter this, then it assumes the command is completed. Above, the \
symbols allow the samtools collate
command to be split over three lines, improving readability. Using these symbols is useful for relatively complex commands with multiple options or arguments, making the code easier to read.
If you want to run the code as a (long) one-liner, then see the following:
cd path/to/directory/of/interest || echo "cd'ing failed"; for bam in *.bam; do samtools collate -o "${bam%.bam}.collated.bam" -T "path/to/temporary/directory/for/your/lab/prefix" "${bam}" && samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"; done
To troubleshoot this, please paste the exact whole command you ran on the command line (enclosed by triple backticks as explained above).
Also, please paste the output from running samtools collate
without any options on the command line, e.g., like this:
samtools collate
Also, please paste what is returned by running samtools --version
, e.g., like this:
samtools --version
As I note here, samtools bam2fq
will work with coordinate-sorted (or otherwise unsorted) BAM files. So, it becomes a matter of what analyses you want to do with these FASTQ files of (presumably) read alignments.
Collating the BAM files before converting them to FASTQ files has advantages and may indeed be necessary depending on your downstream analyses. Collation reorders BAM files so that all reads with the same name are grouped together. This is particularly relevant if your BAM files are made up of paired-end read alignments. Some downstream programs and pipelines expect collated FASTQ files, especially those that work with paired-end reads or, e.g., perform duplicate marking.
For people to help you, you may want to explain things in more depth. For example, what programs are you going to use and how? Do these programs have expectations for the FASTQ files supplied to them? (You may be able to find that info in the documentation.) What is in the BAM files? Are they made up of paired-end sequenced alignments? Are there unmapped reads in there too? Etc.
The best and recommended practice is to collate the BAM files prior to the conversion. However, you may want to try your work with FASTQ files converted from non-collated BAM files. If it's an issue, you'll find out soon enough.
https://nf-co.re/bamtofastq/2.1.0/
nf-core/bamtofastq is a bioinformatics best-practice analysis pipeline that converts (un)mapped .bam or .cram files into fq.gz files. Initially, it auto-detects, whether the input file contains single-end or paired-end reads. Following this step, the reads are sorted using samtools collate and extracted with samtools fastq. Furthermore, for mapped bam/cram files it is possible to only convert reads mapping to a specific region or chromosome. The obtained FastQ files can then be used to further process with other pipelines.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
cross posted: https://bioinformatics.stackexchange.com/questions/22697/