Developing script that loops through multiple reference genomes...
1
0
Entering edit mode
4.4 years ago
Ada ▴ 10

I am in need developing a script that loops through multiple reference genomes, runs the alignment software for each genome (one by one), extracts and saves the results (only alignment %) from the sam file, and deletes the sam file after execution.

My input would be paired data.

alignment genome gene • 1.6k views
ADD COMMENT
0
Entering edit mode

bbsplit is meant to be used with multiple genomes at the same time.

bbsplit.sh in1=reads_R1.fq in2=reads_R2.fq  ref=ecoli.fa,salmonella.fa basename=out_%_#.fq outu=clean.fq

You could put a loop around that to deal with multiple samples:

  for i in *R1*.fq.gz
  do
    name = $(basename ${i} .fq.gz)
    bbsplit.sh in1=${name}_R1.fq.gz in2=${name}_R2.fq.gz  \
    ref=ecoli.fa,salmonella.fa,something.fa \ 
    basename=${name}_%_#.fq outu=${name}_clean.fq 
  done
ADD REPLY
0
Entering edit mode

Do you mean I can do more than 1 paired-read at a time?

ADD REPLY
0
Entering edit mode

Not in one run. One sample at one time against multiple genomes. Doing more than 1 sample would not make logical sense. If you have access to a compute cluster you could start many such jobs in parallel.

ADD REPLY
0
Entering edit mode

Interesting. But I would like to extract alignment stats from bbtools output. How do I go about doing so without producing .sam file (is that possible)?

ADD REPLY
1
Entering edit mode

By default bbsplit.sh will produce fastq files (it can do SAM/BAM files, but that is not recommended if you need to visualize the data). If you do not provide out* directive then you will only get the stats.

scafstats=<file>     Write statistics on how many reads mapped to which scaffold to this file.
refstats=<file>      Write statistics on how many reads were assigned to which reference to this file.
                     Unmapped reads whose mate mapped to a reference are considered assigned and will be counted.
ADD REPLY
0
Entering edit mode

If I do one sample at a time, then is it really necessary to run a loop around multiple samples?

ADD REPLY
0
Entering edit mode

Does this loop run through the reference genomes one at a time or the samples 1 at a time? It looks like the latter.

ADD REPLY
1
Entering edit mode

The looping is using the samples. bbsplit.sh will align each read to each of to the reference genomes and will output reads to a file for the reference they best match. So, not exactly what you wanted, but likely still useful.

ADD REPLY
0
Entering edit mode

Ada : Since you changed the content of the original post significantly, above set of comments now appear to be off-topic.

ADD REPLY
0
Entering edit mode

in bash, using parallel:

$ tree
.
├── file1.fa
├── file2.fa
├── file3.fa
├── test1_R1.fastq
├── test1_R2.fastq
├── test2_R1.fastq
├── test2_R2.fastq
├── test3_R1.fastq
└── test3_R2.fastq

0 directories, 9 files

$ Reference_genomes=($(find . -type f -name "*.fa"))

parallel echo {} {=s/R1/R2/=} $(IFS=, ; echo "${Reference_genomes[*]}")  ::: *R1.fastq

test1_R1.fastq test1_R2.fastq ./file1.fa,./file2.fa,./file3.fa
test2_R1.fastq test2_R2.fastq ./file1.fa,./file2.fa,./file3.fa
test3_R1.fastq test3_R2.fastq ./file1.fa,./file2.fa,./file3.fa
ADD REPLY
1
Entering edit mode
4.4 years ago
GenoMax 147k

If you need to go through a set of reference genome one at one time you can use plain bbmap.sh from BBMap suite. Something like this would work (where your reference genomes are *.fa):

for y in *.fa 
  do
for i in *R1*.fq.gz
  do
    name = $(basename ${i} .fq.gz)
    bbmap.sh in1=${name}_R1.fq.gz in2=${name}_R2.fq.gz  \
    ref=${y} \ 
    other options for bbmap.sh
  done
done

If you do not provide out= directive then you will only get the stats. bbmap.sh writes stats to STDERR stream so capture that to get the info you need.

ADD COMMENT

Login before adding your answer.

Traffic: 1627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6