Aligning multiple fastq files with genome in one script/one line with STAR
2
0
Entering edit mode
3.1 years ago
j_eag ▴ 10

Hi there! This is probably a VERY basic question but I don't have the best terminal skills so I'm struggling a little.

I want to apply what I wrote below for all my fastq scripts without doing a for loop or manually writing the code for each (ideally they all run in parallel, or at least in the same script):

STAR \
  --runThreadN 12 \
  --genomeDir /scratch/eg5/trial2/align/sequence/STARindex \
  --readFilesCommand gunzip -c \
  --readFilesIn /scratch/eg51/trial2/trimmed_files/SRR12045658_1_trimmed.fq.gz \
  --sjdbGTFfile /scratch/eg51/trial2/align/annotation/genes/GRCm39.ens.gtf \
  --outFileNamePrefix /scratch/eg51/trial2/align/results/STARresults \
  --limitGenomeGenerateRAM 32000000000 \
  --outSAMtype BAM SortedByCoordinate

I tried to use *.fq.gz, but that did not work and I couldn't specify the outFileNamePrefix as /*

STAR \
  --runThreadN 12 \
  --genomeDir /scratch/eag519/trial2/align/sequence/STARindex \
  --readFilesCommand gunzip -c \
  --readFilesIn /scratch/eag519/trial2/trimmed_files/*.fq.gz \
  --sjdbGTFfile /scratch/eag519/trial2/align/annotation/genes/GRCm39.ens.gtf \
  --outFileNamePrefix /scratch/eag519/trial2/align/results/STARresults/* \
  --limitGenomeGenerateRAM 32000000000 \
  --outSAMtype BAM SortedByCoordinate

I also tried putting all the files in the same line but that resulted in the following error:

SOLUTION:specify only one or two files in the --readFilesIn option. If file names contain spaces, use quotes: "file name"
STAR \
  --runThreadN 12 \
  --genomeDir /scratch/eag519/trial2/align/sequence/STARindex \
  --readFilesCommand gunzip -c \
  --readFilesIn /scratch/eag519/trial2/trimmed_files/SRR12045658_1_trimmed.fq.gz, /scratch/eag519/trial2/trimmed_files/SRR12045661_1_trimmed.fq.gz, /scratch/eag519/trial2/trimmed_files/SRR12045664_1_trimmed.fq.gz, /scratch/eag519/trial2/trimmed_files/SRR12045667_1_trimmed.fq.gz \
  --sjdbGTFfile /scratch/eag519/trial2/align/annotation/genes/GRCm39.ens.gtf \
  --outFileNamePrefix /scratch/eag519/trial2/align/alignmentresults/STARresults \
  --limitGenomeGenerateRAM 32000000000 \
  --outBAMsortingThreadN 4 \
  --outSAMstrandField intronMotif \
  --outFilterIntronMotifs RemoveNoncanonicalUnannotated \
  --outSAMattributes All \
  --outSAMtype BAM SortedByCoordinate

Anyone have any ideas? i.e., how to enter all the files in , and then have them output with their own file names. Sorry I'm sure this is a very basic question.

RNA-seq STAR SAM • 2.0k views
ADD COMMENT
2
Entering edit mode
3.1 years ago

with parallel

$ parallel --dry-run  STAR --runThreadN 12  --genomeDir /scratch/eg5/trial2/align/sequence/STARindex --readFilesCommand gunzip -c --readFilesIn /scratch/eg51/trial2/trimmed_files/{} --sjdbGTFfile /scratch/eg51/trial2/align/annotation/genes/GRCm39.ens.gtf  --outFileNamePrefix /scratch/eg51/trial2/align/results/STARresults --limitGenomeGenerateRAM 32000000000 \ --outSAMtype BAM SortedByCoordinate ::: *trimmed.fq.gz

This is a dry-run and common string in files is trimmed.fq.gz. Change the common string as per your requirements.

with find:

$ find * -type f -name "*trimmed.fq.gz" -exec echo STAR --runThreadN 12  --genomeDir /scratch/eg5/trial2/align/sequence/STARindex --readFilesCommand gunzip -c --readFilesIn /scratch/eg51/trial2/trimmed_files/{} --sjdbGTFfile /scratch/eg51/trial2/align/annotation/genes/GRCm39.ens.gtf  --outFileNamePrefix /scratch/eg51/trial2/align/results/STARresults --limitGenomeGenerateRAM 32000000000 \ --outSAMtype BAM SortedByCoordinate \; 

Check your output file name prefixes before you execute the parallel command and mind the threads.

ADD COMMENT
0
Entering edit mode

Thanks a lot! It didn't work for me for some reason but I'm sure I'll figure it out.

ADD REPLY
0
Entering edit mode

It's a dry run for both parallel and find. Remove dry-run if you are using parallel and remove echo if you are using find command. You need to post what you meant by not working for eg. error logs etc.

ADD REPLY
1
Entering edit mode
3.1 years ago

First and foremost simplify your command with environment variables:

export IDX=/scratch/eg5/trial2/align/sequence/STARindex 
export GTF=/scratch/eg51/trial2/align/annotation/genes/GRCm39.ens.gtf
export INP=/scratch/eg51/trial2/trimmed_files/
export RES=/scratch/eg51/trial2/align/results/STARresults

now the command looks more manageable and reusable:

STAR --runThreadN 12  --genomeDir $IDX --readFilesCommand gunzip -c --readFilesIn $INP/SRR12045658_1_trimmed.fq.gz --sjdbGTFfile\
    $GTF  --outFileNamePrefix $RES --limitGenomeGenerateRAM 32000000000 \ 
    --outSAMtype BAM SortedByCoordinate

Identify what changes in the command, suppose you want to replace $INP with SRR numbers, put those numbers into a file. Suppose the file contains:

SRR12045658
SRR12045659

now write a parallel script that PRINTS to the screen

 cat ids | parallel echo STAR --runThreadN 12  --genomeDir $IDX --readFilesCommand gunzip -c --readFilesIn $INP/{}_1_trimmed.fq.gz\
     --sjdbGTFfile $GTF  --outFileNamePrefix $RES --limitGenomeGenerateRAM 32000000000 \
     --outSAMtype BAM SortedByCoordinate

verify that the command it prints are valid. Now you can either remove the echo and directly run the commands, or even better, keep the echo have it write you a script:

 cat ids | parallel echo STAR --runThreadN 12  --genomeDir $IDX --readFilesCommand gunzip -c --readFilesIn $INP/{}_1_trimmed.fq.gz\
 --sjdbGTFfile $GTF  --outFileNamePrefix $RES --limitGenomeGenerateRAM 32000000000 \ 
 --outSAMtype BAM SortedByCoordinate > run.sh

you can now inspect the contents of run.sh and verify that each command is what you wanted, followed by running the script:

bash run.sh 

finally add your variable to the start of run.sh so that the whole script contains all the information that is necessary for it to work.

ADD COMMENT
0
Entering edit mode

Wow, thanks for the detailed explanation. I'll try this out. Sorry for the silly question but, what do you mean by add your variable to the start of run.sh ? Do you mean the ids?

ADD REPLY
0
Entering edit mode

I meant the section dealing with environment variables:

export IDX=/scratch/eg5/trial2/align/sequence/STARindex 
export GTF=/scratch/eg51/trial2/align/annotation/genes/GRCm39.ens.gtf
export INP=/scratch/eg51/trial2/trimmed_files/
export RES=/scratch/eg51/trial2/align/results/STARresults

should be at the start of the script

ADD REPLY
0
Entering edit mode

Got it, yes I did that, thanks!

Would you by any chance know how to resolve this: I immediately receive an error for gunzip ( " align.sh: line 7: -c: command not found ") but everything continues to run (" ..... started STAR run"), and then after about 30 mins it says finished successfully but I don't actually have anything in the STARresults folder.

I checked the log file and it looks like everything should have worked ( though what do I know) yet all I have it is : STARresults STARresultsAligned.sortedByCoord.out.bam STARresultsLog.out STARresultsLog.progress.out STARresults_STARgenome STARresults_STARtmp

ADD REPLY
0
Entering edit mode

also add

set -uex

at the top of the script, it will stop the run at the first error, plus it will print the commands as they get executed.

To troubleshoot run just one command (the first the fails) study it closely fix it up until it works. The devil is in the details. This

ADD REPLY

Login before adding your answer.

Traffic: 2560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6