bash script for RNA seq analysis
2
0
Entering edit mode
7.1 years ago
1769mkc ★ 1.2k

Normally i do it kind layman or novice , I put all the command in a script lets say i have 5 files then 5 command for each, make directory for each sample its tedious, but it is difficult when i have to handle large number of files

RD1_R1.fastq
RD1_R2.fastq
RD2_R1.fastq
RD2_R2.fastq
RD3_R1.fastq
RD3_R2.fastq
RD4_R1.fastq
RD4_R2.fastq
RD5_R1.fastq
RD5_R2.fastq
RD6_R1.fastq
RD6_R2.fastq
RD7_R1.fastq
RD7_R2.fastq
RD8_R1.fastq
RD8_R2.fastq
RD9_R1.fastq
RD9_R2.fastq
RD10_R1.fastq
RD10_R2.fastq

So i have a list of these paired end files ,i would like to make directory put paired end for in a single directory , then run fastqc , then would like to run tophat over the paired end files which would be in a single folder , lets say for files RD1_R1 & RD1_R2 which are paired end files , they are in RD1 folder then i would run fastqc , then tophat over the files and the result in the same folder. Like this i would like to do over all the paired end files..

I know basics of shell scripting like making folder for respective files, single file operation but this is bit complex i mean putting two files together and doing operation .

Any help or suggestion would be highly appreciated .

RNA-Seq • 7.2k views
ADD COMMENT
2
Entering edit mode

IMHO, Much of the work flows in NGS are parallel.

1) For creating 10 directories for 20 samples, following is the code:

$ mkdir RD{1..10}

You can achieve the same by:

$ parallel echo {.} ::: *.fastq | cut -f1 -d"_" | uniq | parallel mkdir {}

2) Now try to move each pair of files into their respective directories.

 $parallel echo {.} ::: *.fastq | cut -f1 -d"_" | uniq | parallel mv {}_R{1..2}.fastq {}

3) Like wise, you can parallelize fastqc as well.

ADD REPLY
0
Entering edit mode

thanks let me do this , these are normal things yet i have to break my head break my head because i dont know how to put things together

ADD REPLY
1
Entering edit mode

Just take the parallel command and start with checking the output of:

$parallel echo {.} ::: *.fastq
$parallel echo {.} ::: *.fastq | cut -f1 -d"_" 
$parallel echo {.} ::: *.fastq | cut -f1 -d"_" | uniq
$parallel echo {.} ::: *.fastq | cut -f1 -d"_" | uniq | parallel mv {}_R{1..2}.fastq {}

Step by step, try to figure out how each component works. With this understanding, it will get easier for you to modify this for different scenarios later on. A very helpful resource is explainshell, in which you can copy paste shell commands which are then, well, explained to you.

ADD REPLY
0
Entering edit mode

will try your solution

ADD REPLY
1
Entering edit mode

if you are not sure, what parallel is doing, append --dry-run to parallel command. This would display what it is going to do, but would not execute the command. Once you are sure, then remove --dry-run.

for eg.

$ parallel echo  {.} ::: *.fastq | cut -f1 -d"_" | uniq | parallel --dry-run  mkdir {}
mkdir RD10
mkdir RD1
mkdir RD2
mkdir RD3
mkdir RD4
mkdir RD5
mkdir RD6
mkdir RD7
mkdir RD8
mkdir RD9

PS: Btw, you mentioned tophat in your OP. I guess hisat2 is in, tophat is passe (as per the tophat devs) unless you have a valid requirement. You may also refer to https://digibio.blogspot.in/2015/08/rna-seq-data-analysis-tophat-htseq-and.html for bash solution for RNA seq analysis. It involves bash, loops and tophat-cufflinks-cummerbund analysis in addition to tophat-htseq-deseq2 workflow. It is written 2 yr ago, almost.

ADD REPLY
4
Entering edit mode
7.1 years ago
Satyajeet Khare ★ 1.6k

You will have to use for loop. List all the files, get basename using xarg, use sed to break "_R1" and "_R2" from the basename and finally sort unique.

for i in $(ls your_folder/*.fastq | xargs -n 1 basename | sed 's/\(.*\)_.*/\1/' | sort -u)

Now write commands inside for loop.

Best,

ADD COMMENT
0
Entering edit mode

so tell me how do i pipe your command output to make folder for each paired end file meanwhile for this "_R1" i think this is given to paired end read one and "R2" for paired end read two, why do you want to break part ,could to explain me?

ADD REPLY
0
Entering edit mode

So you get the list of unique samples and not paired end files. Once you have unique sample list you can again provide R1 and R2 in your alignment command by appending "_R1" and "_R2" to the $i.

You don't have to make separate folder for each sample.

ADD REPLY
0
Entering edit mode

yes understood , i since i have 20 samples i have to make ten folder as they are paired end isn't it? so i can you for loop to pass it to make 10 folder and then pass that to run tophat?

ADD REPLY
1
Entering edit mode

Here is a solution, considering you only have all the paired end fastq files in your directory

# Loop over all the files, jumping by 2, i.e. 0th and 1st file will be R1 and R2 of sample1, 
# 2nd and 3rd file will be R1 and R2 of sample2 and so on..
for ((i=0; i<$total_files; i+=2))

{
# variable "out" contains only the name of the sample
out=`echo ${arr[$i]} | awk -F "_R1" '{print $1}'`

# make a new folder with the variable out
mkdir out

# move paired fastq files to that folder
mv ${arr[$i]} ${arr[$i+1]} out

# go to the folder
cd out 

# write your commands here
do whatever you wish with these files

# move out once finished and loop over next pair of files
cd ..

}
ADD REPLY
0
Entering edit mode

For the popularly used RNA-seq pipeline HISAT2, StringTie and BallGown, the authors have provided a beautiful set of wrapper shell scripts here

ADD REPLY
0
Entering edit mode

Why would you need a subdirectory per sample?

ADD REPLY
0
Entering edit mode

Personal preference?

ADD REPLY
2
Entering edit mode

Some of the workflows require one directory per sample. for eg.StringTie and ballgown workflow.

ADD REPLY
2
Entering edit mode

StringTie and Ballgown does not need one directory per sample at fastq stage. You can do that to output bam and gtf files from Hisat2 and StringTie. Ballgown will use StringTie directory structure.

I agree with WouterDeCoster.

ADD REPLY
0
Entering edit mode

Fair enough. Just meant that it might not be necessary, but according to cpad0112 it is for some applications.

ADD REPLY
0
Entering edit mode

so i 'm doing something this

for i in $(ls *.fastq | rev | cut -c 9- | rev | uniq); do tophat2 -o $(%i.fastq)tout -G /run/media/punit/data2/gencode.v21.annotation.gtf -p 30 /run/media/punit/data2/BowtieIndex/hg38 ${i}_1.fastq ${i}_2.fastq ;done

But it seems that the output is going to same folder, i want the paired end aligned files to go into different folder ,it seems that i messed up somewhere with the tout ..

Any help or suggestion would be highly appreciated

ADD REPLY
0
Entering edit mode

You may have to create output folders beforehand. Also, assuming forward and reverse read files end with R1 and R2 and not 1 and 2, I changed the end of the command.

for i in $(ls *.fastq | rev | cut -c 9- | rev | uniq); 
do 
tophat2 -o ${i}tout -G /run/media/punit/data2/gencode.v21.annotation.gtf -p 30 /run/media/punit/data2/BowtieIndex/hg38 ${i}R1.fastq ${i}R2.fastq;
done
ADD REPLY
0
Entering edit mode

try *Note: Run this code only on two files RD1_R1.fastq and RD1_R2.fastq first in a test directory. Take a back up of your raw files, before you run the script.*

My assumption (from your earlier post) is that your fastq files are named like: RD1_R1.fastq and RD1_R2.fastq. Based on this assumption, try following code:

$ for i in $(ls *.fastq | rev | cut -c 8- | rev | uniq) ; do tophat2 -o ${i%}_tout -G /run/media/punit/data2/gencode.v21.annotation.gtf -p 30 /run/media/punit/data2/BowtieIndex/hg38 ${i%}_1.fastq ${i%}_2.fastq ;done
ADD REPLY
0
Entering edit mode

thanks i figured out it later.

ADD REPLY
3
Entering edit mode
7.1 years ago
Fatima ▴ 1000
 total=10  
 for (( i=1; i<${total}+1; i++ ));
    do
    script.bash  RD${i}_R1.fasta  RD${i}_R2.fasta   # if you want to run a command with paired input
    mkdir RD${i}
    mv RD${i}_R1.fastq RD${i}/
    mv RD${i}_R2.fastq RD${i}/
    done
ADD COMMENT

Login before adding your answer.

Traffic: 1752 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6