How do I use bash loop to map with Salmon on files in multiple folders
3
1
Entering edit mode
4.5 years ago
n.tear ▴ 80

Hi,

I want to do a bash loop to get salmon to map files in a number of folders. I know how to do this for all the files in the current working directory:

for f in `ls *.fq.gz | sed 's/_[12].fq.gz//g' | sort -u`;
do
echo "Processing sample $f"
salmon quant -i gencode.v33.transcripts.index -l A -1 ${f}_1.fq.gz -2 ${f}_2.fq.gz -p 6 --gcBias --validateMappings -o quants/${f}_quant
done

BUT how do I do this so that I uses the fastq files in the following folders?

Y:/N/E/R/B/F/ and Z:/N/R/X/r/ (for example)

Many thanks

RNA-Seq mapping bash loop • 4.4k views
ADD COMMENT
1
Entering edit mode

try gnu-parallel or bash arrays.

ADD REPLY
0
Entering edit mode

using plain bash ? you'd better learn how to work with a with a workflow manager like nextflow / snakemake...

otherwise, use find dir1 dir2 -type -f name "*.fq.gz" instead of ls

ADD REPLY
0
Entering edit mode

Thanks ill look into snakemake

ADD REPLY
1
Entering edit mode
4.5 years ago

Don't feel that you have to use a workflow manager for something simple like this.

Using find or ls (or something else), get your input files into ordered lists, like this:

cat R1.list
/scratch/ngs_collab/shared/targeted_seq/fastq/57c_S1_L001_R1_001.fastq.gz
/scratch/ngs_collab/shared/targeted_seq/fastq/58c_S2_L001_R1_001.fastq.gz
/scratch/ngs_collab/shared/targeted_seq/fastq/59c_S3_L001_R1_001.fastq.gz

cat R2.list
/scratch/ngs_collab/shared/targeted_seq/fastq/57c_S1_L001_R2_001.fastq.gz
/scratch/ngs_collab/shared/targeted_seq/fastq/58c_S2_L001_R2_001.fastq.gz
/scratch/ngs_collab/shared/targeted_seq/fastq/59c_S3_L001_R2_001.fastq.gz

Then run a loop:

  mkdir -p out/ ;
  mkdir -p out/salmon/ ;

  paste R1.list R2.list | while read R1 R2 ;
  do
    echo -e "\n" ;
    echo -e "--Input file(s) is/are:\t""${R1}"",""${R2}" ;
    outdir=$(echo "${R2}" | cut -f7 -d"/" | cut -f1 -d"_") ;
    echo -e "--Output directory is:\tout/salmon/""${outdir}" ;
    mkdir -p "out/salmon/""${outdir}" ;

    salmon-latest_linux_x86_64/bin/salmon quant \
      --index=library/targeted_seq \
      --threads=2 \
      --seqBias \
      --gcBias \
      --libType=A \
      -1 "${R1}" \
      -2 "${R2}" \
      --reduceGCMemory \
      --validateMappings \
      --output="out/salmon/""${outdir}" ;

    echo "--Done." ;
  done

You'll have to modify the outdir line to suit your own data.

Kevin

ADD COMMENT
0
Entering edit mode

How do you do the first part?

ADD REPLY
0
Entering edit mode

For example, from the current working directory (and from where the script will be run):

find . -name "*.fastq.gz" | grep -e "_R1" | sort > R1.list ;
find . -name "*.fastq.gz" | grep -e "_R2" | sort > R2.list ;

This will look recursively through all directories under the current working one.

ADD REPLY
1
Entering edit mode
4.5 years ago

Hi,

You can do a nested loop: the first loop checks each folder and the second does the salmon alignment.

Your folders name are a bit weird with : and /. Therefore, I will give you 3 example folders name: folder_1, folder_2, folder_3.

folder_list=folder_1 folder_2 folder_3
for folder in $folder_list; 
    do
    cd folder_1; 
    for f in `ls *.fq.gz | sed 's/_[12].fq.gz//g' | sort -u`;
        do
        echo "Processing sample $f"
        salmon quant -i gencode.v33.transcripts.index -l A -1 ${f}_1.fq.gz -2 ${f}_2.fq.gz -p 6 --gcBias --validateMappings -o quants/${f}_quant
    done;
    cd ..;
done;

Since I don't want to mess with your for loop, what I did is to put a for loop before that will go through folder_1, folder_2, folder_3, at each step will enter the current folder, it'll perform your for loop, then will go back and enter the next folder, and so on. This assumes that you're running your script in the directory that contains folder_1, folder_2, folder_3.

I think this is one possibility to do what you want. Though has mentioned by @Pierre Lindenbaum a workflow manager such as nextflow or snakeflow is the way to go (though never use it myself).

I hope this helps.

António

ADD COMMENT
1
Entering edit mode
4.5 years ago
ole.tange ★ 4.5k

Make a function that takes a dir and the base filename as argument:

doit() {
  dir="$1"
  f="$2"
  cd "$dir"
  echo "Processing sample $f"
  salmon quant -i gencode.v33.transcripts.index -l A -1 ${f}_1.fq.gz -2 ${f}_2.fq.gz -p 6 --gcBias --validateMappings -o quants/${f}_quant
}

Test that the function works:

doit path/to/a/dir base-filename

When it does, check that this gives the correct names to be used as input:

parallel echo {//} '{= s:.*/::;s:_\d+.(fastq|fq).gz::; =}' ::: path/to/folder1/*_1.fq.gz path/to/folder2/*_1.fq.gz

When that gives the right input, make a dry run:

export -f doit
parallel --dry-run doit {//} '{= s:.*/::;s:_\d+.(fastq|fq).gz::; =}' ::: path/to/folder1/*_1.fq.gz path/to/folder2/*_1.fq.gz

When that shows the correct commands, execute them:

parallel doit {//} '{= s:.*/::;s:_\d+.(fastq|fq).gz::; =}' ::: path/to/folder1/*_1.fq.gz path/to/folder2/*_1.fq.gz
ADD COMMENT

Login before adding your answer.

Traffic: 3409 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6