Using Make For Parallel Processing Multiple Files At Once
2
2
Entering edit mode
11.1 years ago
gammyknee ▴ 210

Hi guys, This might a ridiculously simple question on makefiles so bear with me :)

Im looking at implementing makefiles for my next-gen sequencing processing instead of tying everything up in bash scripts. The Makefile below uses fasta files and runs them using the BLASTX alignment program PAUDA). Im able to run the Makefile using individual files, but when I run this on multiple files (using the % pattern syntax), I just end up making one file called %.blastx. Ive had problems with this for a while, so Im really keen to find a solution asap

all: %.blastx

clean:
        rm -f *.sam *.pna

BLASTSH='/opt/local/pauda/lib'
DB='database_file.db'

%.pna: %.fasta
        sh $(PDA_PATH)/z3_dna2pna.sh $< $@

%.sam: %.pna
        sh $(PDA_PATH)/z4_bowtie-on-pna.sh $< $@ $(DB)

%.blastx: %.fasta %.sam
        $(PDA_PATH)/z5_sam2blastx.sh $^ $@ $(DB)

Cheers

• 5.6k views
ADD COMMENT
0
Entering edit mode

Btw, is this question borderline off-topic?

ADD REPLY
5
Entering edit mode
11.1 years ago
Hamish ★ 3.3k

Okay you have a set of target rules (see 10.5.2 Pattern Rule Examples):

# Rule for .fasta to .pna
%.pna: %.fasta
        sh $(PDA_PATH)/z3_dna2pna.sh $< $@

# Rule for .pna to .sam
%.sam: %.pna
        sh $(PDA_PATH)/z4_bowtie-on-pna.sh $< $@ $(DB)

# Rule for .fasta + .sam to .blastx
%.blastx: %.fasta %.sam
        $(PDA_PATH)/z5_sam2blastx.sh $^ $@ $(DB)

However you do not have the targets to trigger the rule chain, for that you need to use wildcard processing and a little pattern substitution:

# Set of fasta files to process.
fasta_files := $(wildcard *.fasta)

# For each .fasta generate a .blastx
blastx : $(fasta_files:.fasta=.blastx)

Then all you need is the primary target:

# Collation target
all : blastx

Then test with a small set in single process mode. If that works then up the number of processes with '-j' (assuming GNU make).

ADD COMMENT
0
Entering edit mode

Awesome thanks for that explanation. What happens with multiple wildcards? For example, say I have Illumina paired end sequencing files (sample1_R1.fastq.gz and sample1_R2.fastq.gz), can you easily run both paired files using the same pattern substitution rules as you showed above?

# Set of fastq files to process.
fastqR1_files := $(wildcard *R1.fastq.gz)
fastqR2_files := $(wildcard *R2.fastq.gz)

# For each paired fastq.gz generate a sam file
fastqpaired : $(fasta_files:.fastq.gz=.sam)
ADD REPLY
1
Entering edit mode

Assuming you have a rule something like:

%.sam : %_R1.fasta.gz %_R2.fasta.gz
    genSam.sh $^

Then you could use something like:

# Set of fastq files to process (since they occur in pairs use one to avoid duplicates)
fastqR1_files := $(wildcard *_R1.fastq.gz)

# For each pair of fastq files generate a sam file.
fastqpaired : $(fastqR1_files:_R1.fastq.gz=.sam)

That expands to a target like:

fastqpaired : sample1.sam sample2.sam sample3.sam

And thus invokes the appropriate rule chain.

For more details of how to write makefiles for GNU make see the manual and the book "Managing Projects with GNU Make".

ADD REPLY
0
Entering edit mode

excellent thanks mate, that was super helpful. I've been reading the GNU make manual (back to front) but couldn't get my head around that

ADD REPLY
0
Entering edit mode
11.1 years ago
Michael 55k

I assume it is GNU make? See the fine manual: https://www.gnu.org/software/make/manual/html_node/Wildcards.html#Wildcards and https://www.gnu.org/software/make/manual/html_node/Multiple-Targets.html#Multiple-Targets

Short % is not a wildcard but e.g. * is. Could you try something like:

*.blastx: *.fasta *.sam
        $(PDA_PATH)/z5_sam2blastx.sh $^ $@ $(DB) # assuming this is correct

If blastx is the "final" target, then I think you can just write it as:

blastx: *.fasta *.sam
ADD COMMENT
0
Entering edit mode

While the use of wildcard target names works for simple cases, this is not one of those cases. This fails due to the way the wildcards are resolved:

  • The file pattern (i.e. *) is resolved when the target dependencies are resolved. Since the '*.blastx' and '*.sam' files do not exist at this point the target fails to resolve correctly.
  • The 'z5_sam2blastx.sh' step requires the input files to be specified in pairs (i.e. blah.fasta with blah.sam), the '$^' macro gives all the dependencies, and the way they are specified here will give 'seq1.fasta seq2.fasta ... seqN.fasta seq1.sam seq2.sam ... seqN.sam' which is not what the script expects.
  • While '*.blastx' will work for updating existing files, since it cannot include new files till they are created, it cannot create the required dependency chain to generate the '.blastx' files from the '.fasta' files.
ADD REPLY

Login before adding your answer.

Traffic: 2904 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6