Merging paired files by names using samtools
1
0
Entering edit mode
2.2 years ago
AST • 0

Hi,

I have multiple bam files in a folder, for example, there are two pairs (as marked in bold) with each file of pair being R1 and R2:

GC121284_ATCACGTT.220830.R1_trimmed.bam

GC121284_ATCACGTT.220830.R2_trimmed.bam

GC121284_TAGCTTGT.220830.R1_trimmed.bam

GC121284_TAGCTTGT.220830.R2_trimmed.bam

I am trying to write a bash script to identify the paired files (according to their barcode in bold) and merge them using samtools.

I would be grateful if someone could please suggest, how to write the script.

bam shell samtools • 1.6k views
ADD COMMENT
1
Entering edit mode

Before investing any effort, please confirm that separate alignment of paired-end data is indeed what you want and need.

ADD REPLY
0
Entering edit mode

These files are aligned files, I only need to merge these files like:

samtools merge -@ 8 -o GC121284_ATCACGTT_trimmed_merged.bam -n GC121284_ATCACGTT.220830.R1_trimmed.bam GC121284_ATCACGTT.220830.R2_trimmed.bam

But without writing the same script for every pair. Rather I am trying to write a script for samtools to identify the paired files (according to their barcode in bold) and merge them as separate file per barcode.

ADD REPLY
2
Entering edit mode
2.2 years ago
ATpoint 85k

As said in my comment, it is on you to ensure that separate alignment of R1 and R2 is meaningful. Anyway:

# GNU parallel
ls *.R1_trimmed.bam \
| awk -F ".R1_trimmed.bam" '{print $1}' \
| parallel -j <njobs> "samtools merge -o {}_merged.bam {}.R1_trimmed.bam {}.R2_trimmed.bam"

# Loop
ls *.R1_trimmed.bam \
| awk -F ".R1_trimmed.bam" '{print $1}' \
| while read p; do samtools merge ${p}_merged.bam ${p}.R1_trimmed.bam ${p}.R2_trimmed.bam; done < /dev/stdin
ADD COMMENT
1
Entering edit mode

You can condense it a bit if you are using GNU parallel.

parallel --dry-run --link -j <njobs> samtools merge -o {=1 s/.R1_trimmed.bam/_merged.bam/ =} {1} {2} ::: *.R1_* ::: *.R2_*
ADD REPLY
0
Entering edit mode

Thanks for the code. However, to me, it seems that the awk command is not functioning properly:

$ ls *.R1_trimmed.bam

GC121284_ATCACGTT.220330.R1_trimmed.bam GC121284_TAGCTTGT.220330.R1_trimmed.bam GC121284_TCCGTCTT.220330.R1_trimmed.bam

$ ls *.R1_trimmed.bam | awk -F ".R1_trimmed.bam" '{print $2}' >

$ ls *.R1_trimmed.bam | awk -F ".R1_trimmed.bam" '{print $2}' | parallel --dryrun -j 2 "samtools merge -o {}_merged.bam {}.R1_trimmed.bam {}.R2_trimmed.bam"

samtools merge -o ''_merged.bam ''.R1_trimmed.bam ''.R2_trimmed.bam

samtools merge -o ''_merged.bam ''.R1_trimmed.bam ''.R2_trimmed.bam

samtools merge -o ''_merged.bam ''.R1_trimmed.bam ''.R2_trimmed.bam

ADD REPLY
0
Entering edit mode

Ah sorry, I printed $2 but must be $1, fixed. So essentially the awk strips the suffix to get the unique file name and this then goes into the parallel/merge.

ADD REPLY
0
Entering edit mode

Thank you so much for helping out.

ADD REPLY
0
Entering edit mode

This too is giving a weird output file name which is causing an error:

$ parallel --dry-run --link -j 2 samtools merge -o {1= s/.R1_trimmed.bam/_merged.bam/ =} {1} {2} ::: .R1_ ::: .R2_

samtools merge -o {1= s/.R1_trimmed.bam/_merged.bam/ =} GC121284_ATCACGTT.220330.R1_trimmed.bam GC121284_ATCACGTT.220330.R2_trimmed.bam

samtools merge -o {1= s/.R1_trimmed.bam/_merged.bam/ =} GC121284_TAGCTTGT.220330.R1_trimmed.bam GC121284_TAGCTTGT.220330.R2_trimmed.bam

samtools merge -o {1= s/.R1_trimmed.bam/_merged.bam/ =} GC121284_TCCGTCTT.220330.R1_trimmed.bam GC121284_TCCGTCTT.220330.R2_trimmed.bam

ADD REPLY
0
Entering edit mode

I edited the code to fix it and confirmed it now works.

ADD REPLY
0
Entering edit mode

The output is still weird i.e. there are two extra (unwanted) outputs per merged file (in bold):

$ parallel --dry-run --link -j 2 samtools merge -o {=1 s/.R1_trimmed.bam/_merged.bam/ =} {1} {2} ::: .R1_ ::: .R2_

samtools merge -o GC121284_ATCACGTT.220330_merged.bam GC121284_ATCACGTT.220330.R1_trimmed.bam GC121284_ATCACGTT.220330.R2_trimmed.bam

samtools merge -o GC121284_ATCACGTT.220330_merged.bam_merged.bam GC121284_ATCACGTT.220330.R1_trimmed.bam_merged.bam GC121284_TAGCTTGT.220330.R2_trimmed.bam

samtools merge -o GC121284_ATCACGTT.220330_merged.bam_sorted.bam GC121284_ATCACGTT.220330.R1_trimmed.bam_sorted.bam GC121284_TCCGTCTT.220330.R2_trimmed.bam

samtools merge -o GC121284_TAGCTTGT.220330_merged.bam GC121284_TAGCTTGT.220330.R1_trimmed.bam GC121284_ATCACGTT.220330.R2_trimmed.bam

samtools merge -o GC121284_TAGCTTGT.220330_merged.bam_merged.bam GC121284_TAGCTTGT.220330.R1_trimmed.bam_merged.bam GC121284_TAGCTTGT.220330.R2_trimmed.bam

samtools merge -o GC121284_TAGCTTGT.220330_merged.bam_sorted.bam GC121284_TAGCTTGT.220330.R1_trimmed.bam_sorted.bam GC121284_TCCGTCTT.220330.R2_trimmed.bam

samtools merge -o GC121284_TCCGTCTT.220330_merged.bam GC121284_TCCGTCTT.220330.R1_trimmed.bam GC121284_ATCACGTT.220330.R2_trimmed.bam

samtools merge -o GC121284_TCCGTCTT.220330_merged.bam_merged.bam GC121284_TCCGTCTT.220330.R1_trimmed.bam_merged.bam GC121284_TAGCTTGT.220330.R2_trimmed.bam

samtools merge -o GC121284_TCCGTCTT.220330_merged.bam_sorted.bam GC121284_TCCGTCTT.220330.R1_trimmed.bam_sorted.bam GC121284_TCCGTCTT.220330.R2_trimmed.bam

ADD REPLY

Login before adding your answer.

Traffic: 2369 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6