Question

gnu parallel for RNA seq

0

Entering edit mode

7.8 years ago

1769mkc ★ 1.2k

I'm using the tuxedo protocol for rna seq data .I m Z 840 workstation, 128 gb RAM and 56 cores.

I read about GNU parallel which can run job in parallel there by reducing the time .I have already run tophat to align the sequences ,now I have to run cufflinks I have 4 samples 1 wild type with replicates and the other one is test with a replicate. After running tophat I have all together 4 sets of accepted_hits.bam file. I want to run cufflinks on each of them. My folder structure is as such

/home/k/WT1/accepted_hits.bam
/home/k/WT2/accepted_hits.bam
/home/k/VD1/accepted_hits.bam
/home/k/VD2/accepted_hits.bam

How can I run cufflinks using gnu parallel ?I read through the manual but I'm not sure how do proceed .

I tried this cufflinks -g gencode.v21.annotation.gtf -o WTL1 /home/k/WT1/accepted_hits.bam | parallel I'm not sure if its the right way or not !.

I want to run cufflinks on all the sample on using gnu parallel .

Any help or suggestion would be highly appreciated ..

Thanks

alignment • 3.1k views

ADD COMMENT • link updated 7.8 years ago by Pierre Lindenbaum 164k • written 7.8 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

cufflinks has a -p option for utilising multiple threads, is that not what you're looking for? If you only have 4 samples, I would just run each one with a fourth of your total cores? e.g.

for file in ~/k/*/*.bam ; do cufflinks -p 13 -g gencode.v21.annotation.gtf -o WTL1 $file ; done

Or something to that effect, not sure what your exact cufflinks command and output dir setup is. NB, I only specified 13 cores so that your machine doesn't totally grind to a halt as you'll still have 4 cores free for other taks.

ADD REPLY • link 7.8 years ago by Joe 21k

0

Entering edit mode

yes i know about the -p option but now I have only 4 samples , but I have more samples total 20 .So for that purpose I would like to know how can i run the job using gnu parallel .

ADD REPLY • link 7.8 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Why do you need to use parallel specifically? Just send them to background with & at the end of the command. Much easier in my opinion if you are not familiar with parallel.

http://hacktux.com/bash/ampersand

ADD REPLY • link 7.8 years ago by igor 13k

0

Entering edit mode

It's a bit confusing as to whether you're trying to simple process several files concurrently, or to (also?) call cufflinks to run in a multithreaded manner too?

How big are your 4 (20?) files? Do you even really need to multithread them that badly? You could still invoke cufflinks 20 times, with 2 cores each without needing to invoke parallel at all...

ADD REPLY • link 7.8 years ago by Joe 21k

0

Entering edit mode

okay can you tell me how do i do that?"You could still invoke cufflinks 20 times, " because it seems its taking like 8-9 hour to process on bam file

ADD REPLY • link 7.8 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

I have no idea what sort of timeframe it will take for each of your BAM files as you've not said how big they are. Are they still running after 8-9 hours or is that the finish time?

Here's a little snippet of some code I used to sequentially ssh in to 11 VMs and start 30 protein structure simulations on them concurrently (assuming you're currently in the k directory):

for file in ./*/accepted_hits.bam ; do
   dir=$(dirname "$file")
   nohup cufflinks -p 2 -g gencode.v21.annotation.gtf -o "$dir" "$file" > "${dir}"/"${file%.*}.log 2>&1&
done

This loops over each of your directories, picks up the bam file, executes cufflinks via nohup so that the program will continue to run even after you disconnect. It should output files into the directory the bam it processed was in, and will also spit you out a nohup logfile. This business 2>&1& redirects STDERR to STDOUT so that errors appear in your logfile along with the STDOUT output. Basically every invocation of nohup (and therefore cufflinks) is sent to the background as it's called in the loop.

NB, I haven't tested this as written, so make sure you've backed up the BAMs. It should be putting the output files in a sensible place but since they've all got the same names, I'd be cautious of overwriting etc. You may need to fiddle with it for your needs.

ADD REPLY • link 7.8 years ago by Joe 21k

0

Entering edit mode

okay I will try your solution and let know if it worked .

ADD REPLY • link 7.8 years ago by 1769mkc ★ 1.2k

score 3 · Answer 1 · 2017-02-19

3

Entering edit mode

7.8 years ago

Pierre Lindenbaum 164k

try

echo -e "WT1\nWT2\nVD1\nVD2" |  parallel --   'cufflinks -g gencode.v21.annotation.gtf -o {} /home/k/{}/accepted_hits.bam'

?

ADD COMMENT • link 7.8 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

so I will ask part by part for the command to clear my doubt before executing it

parallel --   'cufflinks -g gencode.v21.annotation.gtf -o {} /home/k/{}/accepted_hits.bam'

I see the '-o' argument which specifies the output folder but I the code you gave ,how will it define the output folder which will be needed for individual sample? because if I run cufflink normally in the terminal i species the output folder .

Pardon me for asking this question it might be very basic but i'm just learning all these stuff ....

ADD REPLY • link 7.8 years ago by 1769mkc ★ 1.2k

2

Entering edit mode

The pipe expression that @Pierre Lindenbaum wrote does the following:

echo -e "WT1\nWT2\nVD1\nVD2" : prints the names of the folders to stdout (one by line)
Then cufflinks in parallel. The {} expression uses the output of the echo command as the names of the output folders. Also note that {} is used to call your bam files in /home/k/{}/accepted_hits.bam

ADD REPLY • link 7.8 years ago by ropolocan ▴ 830

0

Entering edit mode

I tried it but it seems nothing going on in the terminal the cursor is just still . I had to kill the process .

ADD REPLY • link 7.8 years ago by 1769mkc ★ 1.2k

2

Entering edit mode

just ctrl-z it to stop the job, then type bg to send it to the background (or just follow the command with an & in the first place).

When you send bg, the job will resume, you can check whether it's running in top. Just because you aren't seeing any STDOUT doesn't mean it isn't running.

ADD REPLY • link 7.8 years ago by Joe 21k

0

Entering edit mode

how can i do the same with tophat command i can use a for loop and do it , but i would like to know how to use parallel with tophat im using paired end files

parallel -- 'tophat2 -G gencode.v21.annotation.gtf /run/media/punit/data2/BowtieIndex/hg38' {} > {.}.bam' ::: *.fastq

but i think im making some error so i can;t make it run...

ADD REPLY • link 7.1 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Revise your code as your samples are paired.

ADD REPLY • link 7.1 years ago by cpad0112 21k

0

Entering edit mode

okay i just took the above example of using cufflinks in parallel i just followed the same im not sure what is the correct way , it would be helpful if you could cite me or add to my command ..

ADD REPLY • link 7.1 years ago by 1769mkc ★ 1.2k