Question

Samtools merge/ "The File BAM exists" warning

0

Entering edit mode

2.4 years ago

Qboy ▴ 10

Dear community,

I am trying to convert BAM to FASTQ because the archived dataset is aligned to the older version of the reference genome. I need to realign to the new version. SRA is not available (I know, devastating experience).

I tried two different approaches and both of them show errors: Approach 1. Merge all BAM files into one (each BAM file represents alignment per chromosome). The error shows that some BAM files already exist and asks to forcefully overwrite.

Approach 2. Convert each BAM file separately and merge them together as FASTQ. By doing the second approach, I tried to skip the error in approach 1.

When I convert BAM to FASTQ in both approaches, I get millions of these warnings:

The error is

WARNING: Query [...] is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.

Here is my code:

#Appoach 1:   
samtools merge -n -b all_sorted_merged.bam *.bam   
bedtools bamtofastq -i all_sorted_merged.bam -fq $1_1.fq -fq2 $1_2.fq   
gzip -c $1_*.fq > $1_\*.fastq.gz

#Approach 2:   
samtools sort -n -o $1.sorted.bam $1.bam    
bedtools bamtofastq -i $1.sorted.bam -fq $1_1.fq -fq2 $1_2.fq   
gzip -c $1_\*.fq > $1_\*.fastq.gz

I am learning bioinformatics myself so I would be glad for any of your comments. Thank you!

samtools • 2.1k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 2.4 years ago by Qboy ▴ 10

1

Entering edit mode

The error shows that some BAM files already exist and asks to forcefully overwrite.

That is strange. Unless you messed up some of the BAM files you should not get any errors with a simple merge.

Why are you not using samtools fastq to do the conversion to fastq? Take a look at in line help for samtools fastq.

ADD REPLY • link 2.4 years ago by GenoMax 147k

0

Entering edit mode

I found why is the warning in bedtools. The error shows up because the BAM files include Single Reads and Paired-End reads. So the warning is for single reads.

Is there any way to use extract-only Paired-End reads? I need only these data for further analysis.

ADD REPLY • link 2.4 years ago by Qboy ▴ 10

0

Entering edit mode

Thank you, I will look for the tool! I just did not know about it!

ADD REPLY • link 2.4 years ago by Qboy ▴ 10

score 1 · Answer 1 · 2022-07-14

1

Entering edit mode

2.4 years ago

jkbonfield ★ 1.3k

This is inefficient for several reasons.

Don't do pipelines this way with one step reading a file and writing a file, and the next step then reading a file and writing a file. It's very bad for I/O. It's also unnecessary CPU due to compression / decompression. Instead use UNIX pipes and output uncompressed BAM. As a bonus there are no intermediate files to clean up. (If you need them then "tee" can be used.)
Samtools has its own fastq tool which can output gz direct, so if your method of writing to fastq and then gzip externally is because bedtools cannot do this then I recommend switching. The compression can be parallelised too, and if you used libdeflate when building htslib (which you really should) it'll be significantly quicker even when single threaded.
Depending on your aligner and whether it makes assumption on the randomness of the first few reads vs everything else, then you may be able to use "samtools collate" instead of "samtools sort" to get your read-pairs together. Your input data is already in name order though so that decision was apparently made earlier, and maybe you already took this approach.

So something like this would probably do the trick in much less time:

samtools merge -n -o - -u *.name.bam | samtools fastq -@4 -1 1.fastq.gz -2 2.fastq.gz -s /dev/null

The -s option is there so the singletons get discarded and don't end up in one of the "1" or "2" files.

ADD COMMENT • link 2.4 years ago by jkbonfield ★ 1.3k

0

Entering edit mode

Dear jkbonfield,

Your script and recommendations worked perfectly. Thank you! I have one more question regarding the alignment. I am using a BWA-MEM aligner and my supervisor says it can use an aligned CRAM file and re-align to a new reference genome. I could not find any information about it and I see people generate FASTQ files first, and then do the alignment separately. I just wanted to ask if you know how to do it if you had such an experience of re-aligning with BWA.

Thank you,

ADD REPLY • link 2.4 years ago by Qboy ▴ 10

0

Entering edit mode

I would like to ask one more question. There are two FASTQ files generated. But they have a 4MB difference in size. I thought there should not be any difference in size. Do you have any recommendation or thought about this?

Thank you,

ADD REPLY • link 2.4 years ago by Qboy ▴ 10

1

Entering edit mode

Don't use file size as a QC metric. Depending on sequence read files can compress to different extent.

ADD REPLY • link 2.4 years ago by GenoMax 147k