I'm using the tuxedo protocol for rna seq data .I m Z 840 workstation, 128 gb RAM and 56 cores.
I read about GNU parallel which can run job in parallel there by reducing the time .I have already run tophat to align the sequences ,now I have to run cufflinks I have 4 samples 1 wild type with replicates and the other one is test with a replicate. After running tophat I have all together 4 sets of accepted_hits.bam file. I want to run cufflinks on each of them. My folder structure is as such
/home/k/WT1/accepted_hits.bam
/home/k/WT2/accepted_hits.bam
/home/k/VD1/accepted_hits.bam
/home/k/VD2/accepted_hits.bam
How can I run cufflinks using gnu parallel ?I read through the manual but I'm not sure how do proceed .
I tried this cufflinks -g gencode.v21.annotation.gtf -o WTL1 /home/k/WT1/accepted_hits.bam | parallel
I'm not sure if its the right way or not !.
I want to run cufflinks on all the sample on using gnu parallel .
Any help or suggestion would be highly appreciated ..
Thanks
cufflinks
has a-p
option for utilising multiple threads, is that not what you're looking for? If you only have 4 samples, I would just run each one with a fourth of your total cores? e.g.for file in ~/k/*/*.bam ; do cufflinks -p 13 -g gencode.v21.annotation.gtf -o WTL1 $file ; done
Or something to that effect, not sure what your exact cufflinks command and output dir setup is. NB, I only specified 13 cores so that your machine doesn't totally grind to a halt as you'll still have 4 cores free for other taks.
yes i know about the -p option but now I have only 4 samples , but I have more samples total 20 .So for that purpose I would like to know how can i run the job using gnu parallel .
Why do you need to use
parallel
specifically? Just send them to background with&
at the end of the command. Much easier in my opinion if you are not familiar withparallel
.http://hacktux.com/bash/ampersand
It's a bit confusing as to whether you're trying to simple process several files concurrently, or to (also?) call
cufflinks
to run in a multithreaded manner too?How big are your 4 (20?) files? Do you even really need to multithread them that badly? You could still invoke cufflinks 20 times, with 2 cores each without needing to invoke
parallel
at all...okay can you tell me how do i do that?"You could still invoke cufflinks 20 times, " because it seems its taking like 8-9 hour to process on bam file
I have no idea what sort of timeframe it will take for each of your BAM files as you've not said how big they are. Are they still running after 8-9 hours or is that the finish time?
Here's a little snippet of some code I used to sequentially ssh in to 11 VMs and start 30 protein structure simulations on them concurrently (assuming you're currently in the
k
directory):This loops over each of your directories, picks up the bam file, executes
cufflinks
vianohup
so that the program will continue to run even after you disconnect. It should output files into the directory the bam it processed was in, and will also spit you out a nohup logfile. This business2>&1&
redirects STDERR to STDOUT so that errors appear in your logfile along with the STDOUT output. Basically every invocation ofnohup
(and thereforecufflinks
) is sent to the background as it's called in the loop.NB, I haven't tested this as written, so make sure you've backed up the BAMs. It should be putting the output files in a sensible place but since they've all got the same names, I'd be cautious of overwriting etc. You may need to fiddle with it for your needs.
okay I will try your solution and let know if it worked .