HISAT2 for loop shell scripting
2
0
Entering edit mode
6.6 years ago
e.amiri79 ▴ 10

Dear all would you please help me to modify my loop in bash to align my samples of FASTQ files? I have paired end RNA-seq files as:

EGG12-Clean_ACTGAT_S33_L004_R1_001.fastq
EGG12-Clean_ACTGAT_S33_L004_R2_001.fastq
EGG14-Clean_GAGTGG_S34_L004_R1_001.fastq
EGG14-Clean_GAGTGG_S34_L004_R2_001.fastq

... I have tried:

#!/bin/bash
export RNA_HOME=~/workspace/rnaseq
cd $RNA_HOME
export RNA_DATA_DIR=$RNA_HOME/data
cd $RNA_DATA_DIR
export RNA_REF_INDEX=$RNA_REFS_DIR/amel_OGSv3.2

for i in $(ls *.fastq | rev | cut -c 22- | rev | uniq)
do
    hisat2 -p 8 \
        --rg-id=${i} \
        --rg SM:${i}\
        --rg PL:ILLUMINA \
        -x $RNA_REF_INDEX \
        --dta --rna-strandness RF \
        -1 $RNA_DATA_DIR/${i}_*_R1_001.fastq \
        -2 $RNA_DATA_DIR/${i}_*_R2_001.fastq \
        -S ./${i}.sam
done;

but unfortunately it can not find the paired end files and therefore will not be executed. Any help would be appreciated

RNA-Seq • 6.7k views
ADD COMMENT
1
Entering edit mode

Looks like you are entering twice in $RNA_DATA_DIR. You do:

...
cd $RNA_DATA_DIR
...
-1 $RNA_DATA_DIR/${i}_*_R1_001.fastq \
-2 $RNA_DATA_DIR/${i}_*_R2_001.fastq \
ADD REPLY
1
Entering edit mode

In addition $RNA_REFS_DIR is not defined.

$RNA_DATA_DIR/${i}_*_R1_001.fastq is also weird. Why do you use a wildcard here ?

I think your code is a bit messed up. You should probably test your loop first with something like:

for i in $(ls *.fastq | rev | cut -c 22- | rev | uniq)
do
  echo $i
done;

If this doesn't work, then you know you have issues with your loop.

ADD REPLY
0
Entering edit mode

Thanks but it works fine and it is exactly what I need

ADD REPLY
0
Entering edit mode

No it was only to let you know about the direction.

ADD REPLY
0
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

One issue you have is that your files are .fastq files but your code is trying to search for files that end in .fa. Change your for loop to this:

for i in $(ls *.fastq | rev | cut -c 22- | rev | uniq)

I don't think this is the only issue but try running this and seeing what happens.

ADD REPLY
0
Entering edit mode

Thanks, I did run it but still it is not working

ADD REPLY
0
Entering edit mode

can you update your original post with the modified code you are using, and any error message you are getting? Also can you give the ls or tree output of the directory you are running the script in?

ADD REPLY
0
Entering edit mode

ls: cannot access '*.fastq': No such file or directory

this is what I get.

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

DO YOU UNDERSTAND?

ADD REPLY
0
Entering edit mode

Do not use quotes with shell globs (aka *); the command needs to be: ls *.fastq

ADD REPLY
0
Entering edit mode

@OP: Your $i is EGG12-Clean_ACTGAT_ and when you reconstruct with $RNA_DATA_DIR/${i}_*_R1_001.fastq, your sample is EGG12-Clean_ACTGAT__*_R1_001.fastq (with an extra _) instead of EGG12-Clean_ACTGAT_*_R1_001.fastq. Remove _ after } (some thing like: ${i}*_R1). Try to print $i (echo $i for both reads) after renaming. Change fa to fastq as mentioned above.

Also try to use bash string manipulation:

$ for i in $(ls *.fastq); do echo ${i%_*_*_*_*.fastq};done
EGG12-Clean_ACTGAT
EGG12-Clean_ACTGAT
EGG12-Clean_ACTGAT
EGG12-Clean_ACTGAT

thanks @ram

ADD REPLY
2
Entering edit mode
6.6 years ago
h.mon 35k

note: Carlos Yague already pointed out your mistakes, but I will repeat then here because I wrote the post before reading the deeply branched comments.

There are a couple of errors. For example, $RNA_REFS_DIR hasn't been defined, but it is used here:

export RNA_REF_INDEX=$RNA_REFS_DIR/amel_OGSv3.2

Another one is that you want to loop over fastq files, but uses ls *.fa.

Also, if you cd $RNA_DATA_DIR, then on the loop you need:

-1 ${i}_*_R1_001.fastq \
-2 ${i}_*_R2_001.fastq \

Finally, you can't use * ([glob][1]) here, as you need specific file names as input.

It is never a good idea to parse ls, and almost always it is not needed. Instead of you convoluted:

for i in $(ls *.fastq | rev | cut -c 22- | rev | uniq)

You can just glob for filenames, then manipulate $i inside the loop:

for i in *_R1_001.fastq

This is just a skeleton to get you started:

for i in *_R1_001.fastq
do
    SAMPLE=$(echo ${i%_S*})
    R1=$(echo ${i#*_S})
    R2=$(echo ${i#*_S} | sed "s/_R1_/_R2_/")
    echo "${SAMPLE}_S${R1}"
    echo "${SAMPLE}_S${R2}"
done
ADD COMMENT
0
Entering edit mode

Thank you! I need to run it in my computer and if it worked then I can adjust it for my data

ADD REPLY
1
Entering edit mode
6.6 years ago
jkim ▴ 190

I'd like to introduce snakemake for a task like this.

https://snakemake.readthedocs.io/en/stable/

It's easy to use and well-documented. Here is a tutorial from Mr. Slowikowski's website.

https://slowkow.com/notes/snakemake-tutorial/

And this is my mRNAseq pipeline made by snakemake. (I use STAR and RSEM.)

https://github.com/heyyyjude/ngs-data-pipeline/blob/master/mrna-seq.fq.snakemake

ADD COMMENT

Login before adding your answer.

Traffic: 1692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6