Question

Change ReadGroup name with BWA mem in a group of paired-end files

0

Entering edit mode

2.2 years ago

LDT ▴ 340

Dear all,

I have a group of 1000 files, and I want to change the Read Group (RG) name when I map to the reference with BWA. One thing that I do not get is.

The Read Group name refers to the BAM file.

Shall I indicate the names of the R1 and R2 files in the Read Group name, or would a neutral name be enough?

Here is an example: Here are the files

5126AD_L004_R1_001.fastq
5126AD_L004_R2_001.fastq
...
6100AD_L004_R1_001.fastq
6100AD_L004_R2_001.fastq

here is the code

for each in *R1_001.fastq
do 
bwa mem -t 20 -R "@RG\tID:${each%_R1_001.fastq}_id\tSM:${each%_R1_001.fastq}_id\tPL::ILLUMINA"  ${reference}  ${each}  ${each%_R1_001.fastq}_R2_001.fastq \ |
samtools sort -@20 -O BAM | tee ${each%_R1_001.fastq}.bam |\
samtools index - ${each%_R1_001.fastq}.bam.bai

Do I write RG names in correct way?

bam bwa bwa-mem • 656 views

ADD COMMENT • link updated 2.2 years ago by Istvan Albert 102k • written 2.2 years ago by LDT ▴ 340

score 3 · Accepted Answer · 2022-09-11

The most important thing is to stop ls or bash pattern matching in loops. Those are error-prone constructs and in addition, you need to constantly work backwards to generate simpler names.

Instead, collect the so-called root names of the study into a file. For example your sample.txt could contain:

sample1
sample2
sample3

Use this samples.txt file to build and run commands via GNU Parallel

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

Now, all of a sudden, instead of bending backward to generate the right names, you can simply use the root to construct the new name like so:

cat samples.txt | parallel "bwa mem ... genome.fa \
                            {}_R1_001.fq {}_R2_001.fq | samtools sort > {}.bam"

see how much simpler it is? Now to see what these commands generate but without running the commands put an echo in front and escape the bash redirects:

parallel echo "bwa mem -R '@RG\tID:{}\tSM:{}\tPL:ILLUMINA'  \
               genome.fa {}_R1_001.fq {}_R2_001.fq \| samtools sort \> {}.bam"

now you can see your commands that are about to execute

bwa mem -R @RG\tID:sample1\tSM:sample1\tPL:ILLUMINA genome.fa sample1_R1_001.fq sample1_R2_001.fq | samtools sort > sample1.bam
bwa mem -R @RG\tID:sample2\tSM:sample2\tPL:ILLUMINA genome.fa sample2_R1_001.fq sample2_R2_001.fq | samtools sort > sample2.bam
bwa mem -R @RG\tID:sample3\tSM:sample3\tPL:ILLUMINA genome.fa sample3_R1_001.fq sample3_R2_001.fq | samtools sort > sample3.bam

you can run the first command alone to see if it does what you think it should be doing. If it does the right thing you can remove the echo and the escapes and have run in parallel.

Plus GNU parallel can use multiple CPUs and actually run things in parallel.