How can I convert multiple bam files in a directory into fastq files using samtools bam2fq?
2
0
Entering edit mode
6 months ago
Arthur ▴ 10

How can I convert multiple bam files in a directory into fastq files using samtools bam2fq? I use bash on linux.

samtools bam fastq • 2.2k views
ADD COMMENT
1
Entering edit mode

cross posted: https://bioinformatics.stackexchange.com/questions/22697/

Do not post on multiple forums - that is just bad etiquette. What's worse, you did not link between the two so no one knows that you're asking two sets of online volunteers to spend their time on your problem without telling them that you're also asking the other group.

ADD REPLY
3
Entering edit mode
6 months ago
kalavattam ▴ 280

You could perhaps run a simple for loop over the BAM files in your directory. Here is example shell (e.g., Bash) code:

#  Change to the directory of interest
cd path/to/directory/of/interest || echo "cd'ing failed"

#  Loop over all BAM files in the directory
for bam in *.bam; do
    #  Collate and convert each BAM file to a FASTQ file using samtools
    samtools collate "${bam}" -o "${bam%.bam}.collated.bam"
    samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"
done
  • The code starts by changing to the directory containing your BAM files with cd path/to/directory/of/interest. If the cd command fails, then the code after || is run, printing the message "cd'ing failed".
  • The for loop iterates over each BAM file in the directory (for bam in *.bam). The use of * here is an example of a Bash glob.
  • For each BAM file (${bam}), the script uses samtools bam2fq to convert it to a FASTQ file.
  • The name of the collated BAM file is generated by stripping the extension .bam (%.bam within the pointy brackets; see parameter expansion for more details) and then appending the new extension .collated.bam.
  • The name of the FASTQ outfile is generated by replacing the .bam extension with .fastq ("${bam/.bam/.fastq}"). (See parameter expansion for more details.)

Also, looping over each file one-by-one to call samtools collate and then samtools bam2fq is going to be pretty slow. You can speed things up by distributing the commands across multiple cores using GNU Parallel:

parallel -j 4 '
    samtools collate {} -o {.}.collated.bam &&
    samtools bam2fq {.}.collated.bam > {.}.fastq
' ::: *.bam
  • -j 4 specifies the number of jobs to run in parallel. Adjust this number based on the number of CPU cores available on your machine.
  • {} is a placeholder for the input file (each BAM file).
  • {.} is a placeholder for the input file without the extension (e.g., file.bam becomes file).
  • && means run the following code (samtools bam2fq {.}.collated.bam > {.}.fastq) if the preceding code (samtools collate {} -o {.}.collated.bam) ran successfully.
  • ::: *.bam: This part tells parallel to use all BAM files in the current directory as input.
ADD COMMENT
0
Entering edit mode

Brilliant, thanks kalavattam. I couldn't find this simple code anywhere online for this example.

ADD REPLY
0
Entering edit mode

kalavattam sorry, but the code is wrong, the bam should be first processed with 'samtools collate'.

https://x.com/lh3lh3/status/1132756684789768202

ADD REPLY
0
Entering edit mode

Thank you, yes. One should run samtools collate or samtools sort -n prior to samtools bam2fq. I've edited the answer to reflect this. (samtools bam2fq will work on a coordinate-sorted BAM file, but the read alignments are out of order, and paired read alignments are generally not with each other.)

ADD REPLY
0
Entering edit mode

One issue is that the collated Sam file doesn't then go on to become a fastq file, and the job stops at the collated Sam file.

ADD REPLY
0
Entering edit mode

Are any error messages reported? If so, it may be helpful to post them here for troubleshooting. If you decide to share the error messages, please format them (along with any code) using the icon above the text box that looks like this:

101
010

Alternatively, you can enclose the text and/or code with three backticks (```), like this:
```
Error message and/or code
```

which will become

Error message and/or code

This kind of formatting will make it much easier for people to read the messages and code, and potentially help you troubleshoot the issue.

ADD REPLY
0
Entering edit mode

No errors, it just completes the collated bit but doesn't go on to create the fastq. Was this all meant happen in the same job? I used the first code you wrote

for bam in *.bam; do
    samtools collate "${bam}" -o "${bam%.bam}.collated.bam"
    samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"
done

I don't think it is finding the collated bam file, which needs to be distinguished from the original bam file within the same directory.

ADD REPLY
0
Entering edit mode

If you're running the code as a one-liner, which is how you pasted it,

for bam in *.bam; do samtools collate "${bam}" -o "${bam%.bam}.collated.bam" samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}" done

then the two commands are being interpreted as one, and the second one (samtools bam2fq) is not being run.

If you're going to run the code as a one-liner, you need to separate the two commands. One option is to use a semicolon (;); another is to use && (explained above). I'd use &&. Also, you need to place a semicolon before the done operator. The syntax changes if you run the code on one line.

For example,

for bam in *.bam; do samtools collate "${bam}" -o "${bam%.bam}.collated.bam" && samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"; done
ADD REPLY
0
Entering edit mode

I am now getting an error message:-

[E::bgzf_flush] File write failed (wrong size)
samtools collate: Couldn't write to intermediate file "/tmp/collate46003.0059.bam": No space left on device
ADD REPLY
0
Entering edit mode

The message is telling you that there is "No space left on device". That means there's too little space on your hard drive.

In that case, you may want to delete some other files to free up space. If you don't want to delete files, then you could consider moving them from the drive you're working on to another one.

ADD REPLY
0
Entering edit mode

It is the uni storage via their HPC server, so there should not be a shortage of space. I have managed far bigger files than the tmp files that won't save now. When I converted the BAM files to Fastq it was fine. The issue arises if I collate first, due to the tmp files.

ADD REPLY
0
Entering edit mode

It is the uni storage via their HPC server, so there should not be a shortage of space.

Space on servers is not a single, monolithic thing.

Looking at the error message,

[E::bgzf_flush] File write failed (wrong size)
samtools collate: Couldn't write to intermediate file "/tmp/collate46003.0059.bam": No space left on device

we see that this operation is taking place in a temporary directory (tmp) within the root (/) directory of your server. It's likely that this specific directory, /tmp, has insufficient space. The people who maintain these servers tend to allocate only so much space for these kinds of shared temporary locations. So, it may be at or near capacity.

By default, many of the utilities in the samtools software (such as samtools sort and samtools collate) use /tmp for writing temporary intermediate files. However, you can change the location for temporary output.

First, let's look at the help documentation for samtools collate:

Usage: samtools collate [options...] <in.bam> [<prefix>]

Options:
      -O       Output to stdout
      -o       Output file name (use prefix if not set)
      -u       Uncompressed BAM output
      -f       Fast (only primary alignments)
      -r       Working reads stored (with -f) [10000]
      -l INT   Compression level [1]
      -n INT   Number of temporary files [64]
      -T PREFIX
               Write temporary files to PREFIX.nnnn.bam
      --no-PG  do not add a PG line
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
      --verbosity INT
               Set level of verbosity
  <prefix> is required unless the -o or -O options are used.

Note the -T argument, which you can use to change where the temporary files are being written, e.g.,

cd path/to/directory/of/interest || echo "cd'ing failed"

for bam in *.bam; do
    samtools collate \
        -o "${bam%.bam}.collated.bam" \
        -T "$(pwd)/temporary" \
        "${bam}"

    samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"
done

In the example above, I use -T "$(pwd)/temporary" to write the temporary intermediate files to the current working directory (the one containing your BAM files of interest). However, this is not really good practice because HPC servers are a shared resource. Depending on how things are set up on the server, how many BAM files you need to process, how large the BAM files are, and so on, continually reading from and writing to this kind of working directory may slow down the work of other people using the HPC.

Does your lab have its own temporary directory in a non-/tmp location? This is often the case for university research computing environments, and these directories tend to be set up to minimize the impact on other people using the server. If so, then it would be better to use that, e.g.,

cd path/to/directory/of/interest || echo "cd'ing failed"

for bam in *.bam; do
    samtools collate \
        -o "${bam%.bam}.collated.bam" \
        -T "path/to/temporary/directory/for/your/lab/prefix" \
        "${bam}"

    samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"
done

Also, note that when using -T, you need to end with a prefix.

More explanation: The \ symbols in the above Bash code are used as line continuation characters. They allow you to split a long command across multiple lines for better readability. When Bash encounters a \ at the end of a line, it knows to continue reading the next line as part of the same command; if it does not encounter this, then it assumes the command is completed. Above, the \ symbols allow the samtools collate command to be split over three lines, improving readability. Using these symbols is useful for relatively complex commands with multiple options or arguments, making the code easier to read.

If you want to run the code as a (long) one-liner, then see the following:

cd path/to/directory/of/interest || echo "cd'ing failed"; for bam in *.bam; do samtools collate -o "${bam%.bam}.collated.bam" -T "path/to/temporary/directory/for/your/lab/prefix" "${bam}" && samtools bam2fq "${bam%.bam}.collated.bam" > "${bam/.bam/.fastq}"; done
ADD REPLY
0
Entering edit mode

I now get the below error message:-

collate: invalid option -- 'T'
Usage: samtools collate [-Ou] [-o <name>] [-n nFiles] [-l cLevel] <in.bam> [<prefix>]

I don't know how to change the above colour to black text.

ADD REPLY
0
Entering edit mode

What version of samtools are you using? Provide output of samtools --version. -T is a valid option in latest versions.

ADD REPLY
0
Entering edit mode

To troubleshoot this, please paste the exact whole command you ran on the command line (enclosed by triple backticks as explained above).

Also, please paste the output from running samtools collate without any options on the command line, e.g., like this:

samtools collate

Also, please paste what is returned by running samtools --version, e.g., like this:

samtools --version
ADD REPLY
0
Entering edit mode

Before I go any further, may I ask if it is necessary to collate the bam file first, or can that be done at a later stage after converting the bam file to fastq?

ADD REPLY
0
Entering edit mode

As I note here, samtools bam2fq will work with coordinate-sorted (or otherwise unsorted) BAM files. So, it becomes a matter of what analyses you want to do with these FASTQ files of (presumably) read alignments.

Collating the BAM files before converting them to FASTQ files has advantages and may indeed be necessary depending on your downstream analyses. Collation reorders BAM files so that all reads with the same name are grouped together. This is particularly relevant if your BAM files are made up of paired-end read alignments. Some downstream programs and pipelines expect collated FASTQ files, especially those that work with paired-end reads or, e.g., perform duplicate marking.

ADD REPLY
0
Entering edit mode

I want to find variants in the exome data.

ADD REPLY
0
Entering edit mode

For people to help you, you may want to explain things in more depth. For example, what programs are you going to use and how? Do these programs have expectations for the FASTQ files supplied to them? (You may be able to find that info in the documentation.) What is in the BAM files? Are they made up of paired-end sequenced alignments? Are there unmapped reads in there too? Etc.

The best and recommended practice is to collate the BAM files prior to the conversion. However, you may want to try your work with FASTQ files converted from non-collated BAM files. If it's an issue, you'll find out soon enough.

ADD REPLY
0
Entering edit mode

I finally got it to work! Thanks so such!!!

ADD REPLY
0
Entering edit mode

Thank you to everyone (Kalavattam especially) who helped with my query. I have solved it now.

ADD REPLY
0
Entering edit mode

Use three backticks, not just one. That's how you get a code block.

ADD REPLY
0
Entering edit mode

kalavattam told you how to format code and you did not follow it. Please at least try to follow the advice people give you - it makes life easy for everyone.

ADD REPLY
0
Entering edit mode
6 months ago

https://nf-co.re/bamtofastq/2.1.0/

nf-core/bamtofastq is a bioinformatics best-practice analysis pipeline that converts (un)mapped .bam or .cram files into fq.gz files. Initially, it auto-detects, whether the input file contains single-end or paired-end reads. Following this step, the reads are sorted using samtools collate and extracted with samtools fastq. Furthermore, for mapped bam/cram files it is possible to only convert reads mapping to a specific region or chromosome. The obtained FastQ files can then be used to further process with other pipelines.

ADD COMMENT

Login before adding your answer.

Traffic: 3371 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6