Question

prefix extraction and preparation for mapping and variant calling

0

Entering edit mode

21 months ago

Human • 0

hello humans,

I am struggling with a bash script that should actually work as far as I can see. I need to extract prefixes of 262 files in a directory that contains reads. I will map them for later variance calling etc. However, whilst trouble shooting because the mapping always stuck, I found that actually not all prefixes got extracted properly. When I print the output to a .txt file I see that just like 200 sth are properly extracted and some with mistakes as well. And I don't know why... I think the problem might be this prefix extraction step, but maybe also my mapping line is not correct. maybe someone here knows how to fix this.

So my input files look like this. 262 files, with the naming convention:

AH-00001_S117_L001_R1_P_trimmed.001.fastq 
AH-00001_S117_L001_R2_P_trimmed.001.fastq
KewS01_S470_L001_R1_P_trimmed.001.fastq 
KewS01_S470_L001_R2_P_trimmed.001.fastq

(...)

My script looks like this :

#!/bin/bash
(..)
#SBATCH --array=1-262

filenameFWD=$(ls -1 *_L001_R1_P_trimmed.001.fastq | tail -n +${SLURM_ARRAY_TASK_ID} | head -1) 
filenameREV=$(ls -1 *_L001_R2_P_trimmed.001.fastq | tail -n +${SLURM_ARRAY_TASK_ID} | head -1) 

prefixFWD=$(echo $filenameFWD | sed 's/_L001_R1_P_trimmed\.001\.fastq$//')
prefixREV=$(echo $filenameREV | sed 's/_L001_R2_P_trimmed\.001\.fastq$//')

module load NextGenMap
module load SAMtools/1.3.1

ngm -r Reference.fasta -1 ${prefixFWD}L001_R1_P_trimmed.001.fastq -2 ${prefixREV}L001_R1_P_trimmed.001.fastq -t 12 -o Test1.sam

samtools view -bS Test1.sam | samtools sort -o Test1.bam
samtools index Test1.bam

why doesn't it work properly? And also, would you recommend to use set -o errexit set -o nounset in the beginning of the script?

Thanks in advance!

bash • 1.7k views

ADD COMMENT • link updated 21 months ago by Joe 21k • written 21 months ago by Human • 0

0

Entering edit mode

tail -n +${SLURM_ARRAY_TASK_ID}

whoa

ADD REPLY • link 21 months ago by Pierre Lindenbaum 164k

1

Entering edit mode

you should use a workflow manager like snakemake or nextflow. Your futur self will thank you

ADD REPLY • link 21 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

what's wrong about that huh? :D

ADD REPLY • link 21 months ago by Human • 0

0

Entering edit mode

well, how do you manage things when 10 fastqs out of the 262 fastqs failed ?

ADD REPLY • link 21 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

what would you suggest ?

ADD REPLY • link 21 months ago by Human • 0

1

Entering edit mode

you should use a workflow manager like snakemake or nextflow. Your futur self will thank you

ADD REPLY • link 21 months ago by Pierre Lindenbaum 164k

GenoMax · Answer 1 · 2023-03-17

1

Entering edit mode

21 months ago

Joe 21k

Don't rely on parsing the output of ls.

You can essentially just do:

for file in *.fastq ; do
    name="${file%_*_*_*_*}"
   #do stuff with the name here
done

Note this specific iteration will only work if there are always the same number of underscores, but you can still use the same approach with sed or other tools.

You don't need to capture R1 and R2 if you're then specifying the suffix in your command (which is probably why it isn't working). Either capture both, or capture just the common stem and be explicit with the suffixes when you use them downstream,.

ADD COMMENT • link 21 months ago by Joe 21k

0

Entering edit mode

hey, thanks. But could you explain what you mean? bash and sequence analysis workflows are rather new for me. how would you write the prefix suffix part to achieve what I tried to achieve? Thanks:)

ADD REPLY • link 21 months ago by Human • 0

0

Entering edit mode

You already tried it here:

ngm -r Reference.fasta -1 ${prefixFWD}L001_R1_P_trimmed.001.fastq -2 ${prefixREV}L001_R1_P_trimmed.001.fastq -t 12 -o Test1.sam

You are trying to smash the variable $prefixFWD to a string (suffix). Your syntax is a bit wrong, but the general idea is reasonably sound.

The steps are:

Take KewS01_S470_L001_R1_P_trimmed.fastq and strip part of the name away.
The bash operation I used above ("${file%_*_*_*_*}") applied to these strings will yield: KewS01_S470.
You can stitch strings together afterward: "$prefixFWd"_suffix.ext

This uses bash parameter expansion. I would suggest spending some time getting familiar with it if you plan to use bash for this.

https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html

ADD REPLY • link 21 months ago by Joe 21k

0

Entering edit mode

hey , thank you but how would I write the whole code then ? Do I have to define a suffix for FWD and REV or is it automatically stored in the suffix for each file respectively ? so can the code look like this?

#!/bin/bash
(...)

#SBATCH --array=1-262

for file in *.fastq ; do

prefix="${file%_*_*_*_*}"

done

module load NextGenMap

module load SAMtools/1.3.1

ngm -r TEASHIRT.fasta -1 "$prefix"_suffix.ext -t 12 -o Test1.sam

samtools view -bS Test1.sam | samtools sort -o Test1.bam

samtools index Test1.bam

ADD REPLY • link updated 21 months ago by GenoMax 148k • written 21 months ago by Human • 0

0

Entering edit mode

Or what do you think about this:

#!/bin/bash

(...)
#SBATCH --array=1-262

# Load required modules
module load NextGenMap
module load SAMtools/1.3.1

# Use the SLURM_ARRAY_TASK_ID to get the current file
file=$(ls *fastq | sed -n ${SLURM_ARRAY_TASK_ID}p)

# Extract the prefix of the filename
prefix=$(basename ${file} .fastq)

# Map reads to the reference genome using NextGenMap
ngm -r TEASHIRT.fasta -1 ${prefix}_suffix_1.fastq -2 ${prefix}_suffix_2.fastq -t 12 -o Test1.sam

# Convert the SAM file to BAM format and sort it
samtools view -bS Test1.sam | samtools sort -o Test1.bam

# Index the sorted BAM file
samtools index Test1.bam

ADD REPLY • link updated 21 months ago by GenoMax 148k • written 21 months ago by Human • 0

0

Entering edit mode

You are explicitly specifying the suffixes in your commands, so it's entirely up to you if you stick to this, or you wish to define them earlier and re-call them.

For example here:

ngm -r TEASHIRT.fasta -1 ${prefix}_suffix_1.fastq ...

You are explicitly defining the first suffix ("_suffix_1.fastq") in the command call.

ADD REPLY • link 21 months ago by Joe 21k