bash loop for alignment RNA-seq data
1
6
Entering edit mode
10.6 years ago
Paul ★ 1.5k

Dear all,

Could you help he please to modify my loop in bash to multiple alignment my FASTQ files?

I have a many fastq files R1 and R2 and I would like to align all those read in bash loop - lets say I have X1_R1_001.fastq + X1_R2_001-fastq and Y1_R1_001.fastq + Y1_R2_001.fastq. Sample X1 is pair-end read R1 + R2 and so on.

#!/bin/bash
for i in *fastq;
do tophat2 -o ${i%.fastq}tpout -G path/to/my/reference.gtf -p 8 path/to/my/bowtie_index ${i}R1_001.fastq ${i}R2_001.fastq --rg-id X1 --rg-sample X1 --rg-library rna-seq --rg-platform Illumina
done;

And I would like to also change my --rg-id tag with name of my fastq files (in this example X1 in first loop and X2 in second loop). Output folder should have the name of the sample too.

Please do you have any idea how to modify my bash loop?

Thank you so much for any ideas!

Paul

RNA-Seq next-gen alignment • 26k views
ADD COMMENT
1
Entering edit mode

Hi,

Quick question: Do your FASTQ files follow a consistent naming convention?

ADD REPLY
0
Entering edit mode

Hi Ram,

Thank you for reply, yeah there is always number 1_S1_L001_R1_001.fastq + 1_S1_L001_R2_001.fastq, 2_S2_L001_R1_001.fastq + 2_S2_L001_R2_001.fastq and so on...

This is typical output form MiSeq.

Paul

ADD REPLY
0
Entering edit mode

OK,so the _R[12]_001.fastq is the constant part of the names, anything before that is prefix. Cool, I'll get back to you in some time.

ADD REPLY
0
Entering edit mode

Thank you so much... I try to figure out some solution also :-)

ADD REPLY
0
Entering edit mode
for i in $(ls *.fa.gz | rev | cut -c 4- | rev | uniq)
do
    mixcr align -p rna-seq -OallowPartialAlignments=true -OrelativeMinVFR3CDR3Score=0.7 -OallowChimeras=true -OminSumScore=30 ${i}.fa.gz ${i}.fa.gz  ../Alignments/${fq1%.fa.gz}.alignments.vdjca; 
done

Hi Ram, I used code to align my pair end RNA seq. we have performed single cells TCR seq and in one sample, I mixed several samples by adding unique adaptor to each samples. So after demultiplexing I got around 10 files of R1(as I used 10 barcode) and 10 files of R2. however name samples are same except before .fa.gz, it comes like _1.fa.gz and for next sample _2.fa.gz etc. Now when I used your script by editing specific thing and run the command. Scripting is not working. I get msg that there are no such file call .fa.gz

Here Ii am pasting name of the samples for your consideration. Could you please help me solve the issue.

9_S9_L001_R1_001_1.fa.gz
9_S9_L001_R1_001_2.fa.gz
9_S9_L001_R1_001_3.fa.gz
9_S9_L001_R1_001_4.fa.gz
9_S9_L001_R1_001_5.fa.gz
9_S9_L001_R1_001_6.fa.gz
ADD REPLY
0
Entering edit mode

Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized.

ADD REPLY
0
Entering edit mode

code for i in $(ls *.fa.gz | rev | cut -c 4- | rev | uniq) prints sample_name.fa instead of sample_name. ${i}.fa.gz in your code appends extra fa (for example 9_S9_L001_R1_001_1.fa.fa.gz) to the sample name before extension .fa.gz. This might be one problem. In addition to this, you are using same file (${i}.fa.gz ${i}.fa.gz) twice. Is there a reason for this?

I tried to echo the input with three of your sample files (for R1 and R2):

$ for i in $(ls *.fa.gz | rev | cut -c 4- | rev | uniq); do echo $i; echo ${i}.fa.gz;done
9_S9_L001_R1_001_1.fa
9_S9_L001_R1_001_1.fa.fa.gz
9_S9_L001_R1_001_2.fa
9_S9_L001_R1_001_2.fa.fa.gz
9_S9_L001_R1_001_3.fa
9_S9_L001_R1_001_3.fa.fa.gz
9_S9_L001_R2_001_1.fa
9_S9_L001_R2_001_1.fa.fa.gz
9_S9_L001_R2_001_2.fa
9_S9_L001_R2_001_2.fa.fa.gz
9_S9_L001_R2_001_3.fa
9_S9_L001_R2_001_3.fa.fa.gz

$ ls *.gz
9_S9_L001_R1_001_1.fa.gz  9_S9_L001_R1_001_2.fa.gz  9_S9_L001_R1_001_3.fa.gz  9_S9_L001_R2_001_1.fa.gz  9_S9_L001_R2_001_2.fa.gz  9_S9_L001_R2_001_3.fa.gz
ADD REPLY
0
Entering edit mode

If I will use the sample name then I have to write sample name for all. It means I have total 1-49 samples and then for each samples 10 barcode so in total I have 490 R1 file and 490 R2 file. No there are no reason to use that but I am getting how to specify R1 and R2, as my sample unique name is 9_S9..._1.fa.gz. so I to specify that I am not getting.

ADD REPLY
0
Entering edit mode

What I meant was that code (provided by you) is appending an extra fa to your input files. In stead of passing 9_S9_L001_R2_001_3.fa.gz, it is passing 9_S9_L001_R2_001_3.fa.fa.gz for execution. That might be one reason why it is not finding files.

I printed the input files that are being sent to program in above command.

ADD REPLY
0
Entering edit mode

that true, how could modify this script so it can work?

ADD REPLY
0
Entering edit mode

See my response below. Unless you want the fa as part of the prefix, use a different index to cut with, or better yet, use sed.

ADD REPLY
0
Entering edit mode

I've updated my answer below with some tips.

ADD REPLY
0
Entering edit mode

These are shell questions, my friend - it takes a little trial and error to get them right. for examples, why a cut -c 4- when the common ending (.fa.gz) itself is 6 characters? Only a rev | cut -c 7- will give you a list of prefixes. In fact, you can just use ls *.fa.gz | sed 's/.fa.gz//'

ADD REPLY
0
Entering edit mode
for i in $(ls *.gz | sed s/.fa.gz//)
do
    mixcr align -p rna-seq -OallowPartialAlignments=true -OrelativeMinVFR3CDR3Score=0.7 -OallowChimeras=true -OminSumScore=30 ${i} ../Alignments/${i%.fa.gz}.alignments.vdjca; 
done

Hi Now I have changed things as you suggested. But still it says there are no such file or directory. Here I do not understand one point if I am not providing R1 and R2 separately then how will recognize the partner. I am sorry I am bit new to system that why my questions are bit basic level.

ADD REPLY
0
Entering edit mode

About file name based assumptions, I'm guessing that's on the tool, not on bash. If your tool (mixcr align) substitutes an R1. with a R2. automatically to look for a corresponding file, it can (in your words) recognize the partner. If not, like most other tools, it will have two input parameters, one for each file.

Also, I'm guessing your changing-suffix-after-common-prefix as you describe below is the core challenge.

In your command above, you've sed-substituted away the .fa.gz and still are using ${i%.fa.gz}. Why is that?

ADD REPLY
0
Entering edit mode

It would help if you could post the example code for accepted and intended parameters for this tool (mixcr) with one paired end sample. Otherwise, visitors would not know how input files are passed to the tool and in what order.

ADD REPLY
23
Entering edit mode
10.6 years ago
Ram 44k

So, this is what I came up with, based on our discussion:

for i in $(ls *.fastq | rev | cut -c 13- | rev | uniq)
do
tophat2 -o ${i%.fastq}tpout  -G path/to/my/reference.gtf -p 8 path/to/my/bowtie_index ${i}_R1_001.fastq ${i}_R2_001.fastq --rg-id ${i} --rg-sample ${i} --rg-library rna-seq --rg-platform Illumina
done

Hope this helps!


OK, this answer is gotten a little popular, and my shell knowledge has improved so I'm going to enhance this with some fail-safes to address commonly observed challenges.

First off, the rev | cut -c 13- is just asking for trouble. Let's substitute that with a sed so the command makes a little more sense. Here, I'm trying to remove all _R[12]_001.fastq.

ls *.fastq | sed -r 's/_R[12]_001[.]fastq//' | uniq looks better. I've also edited the variable name from i so it makes a little more sense. And I've added some formatting changes plus quotes to avoid shell problems.

Here's the updated code:

for prefix in $(ls *R1*.fastq | sed -r 's/_R1_001[.]fastq//' | uniq)
do
tophat2 -o "${prefix%.fastq}tpout" \
    -G path/to/my/reference.gtf -p 8 \
    path/to/my/bowtie_index \
    "${prefix}_R1_001.fastq" "${prefix}_R2_001.fastq" \
    --rg-id "${prefix}" --rg-sample "${prefix}" --rg-library rna-seq --rg-platform Illumina
done

Note (21-Sep-2022): This will only work for one-pair-per-sample FASTQs. Multi-lane FASTQs will need to be concatenated so each sample has just one pair. The command can be modified easily for compressed FASTQ files.

ADD COMMENT
1
Entering edit mode

Wau that looks good - pleae could you explain the first line -

for i in $(ls *.fastq | rev | cut -c 13- | rev | uniq)

Thank you very much,

Paul.

ADD REPLY
2
Entering edit mode

Thank you. Sure, the first line extracts unique prefixes. The pipe does something like this:

  1. Pick all .fastq files
  2. Reverse their names
  3. Pick from char 13 to end
  4. reverse again (as the last 13 characters are constant, we use that to pick first n characters)
  5. take a uniq on this prefix set.

Use this uniq set to run the script :)

Edit: $(command) is an alternative, clearer way of writing `command`

ADD REPLY
0
Entering edit mode

Paul,

If this helped, I'd appreciate it if you accepted the answer please. Thank you!

ADD REPLY
1
Entering edit mode

Yeah of course Ram, thank you so much for your time and help!!!

Paul.

ADD REPLY
1
Entering edit mode

Thank you, Paul. It is always fun working on bash! Have a great weekend!

ADD REPLY
0
Entering edit mode

+1 for rev

ADD REPLY
0
Entering edit mode

It's a wonderful piece of our toolkit, isn't it? It is for these reasons that I love Unix!

ADD REPLY
0
Entering edit mode

do I have to run this thing from the same directory i have my fastq files ? or I can run it from anywhere ?

ADD REPLY
1
Entering edit mode

This script is customized to OP's situation. You may want to modify it to fit your use case. Given the script involves a ls *.fastq, you'll need to either run a valid version of it in the dir with the fastq files, or change the ls part of the script suitably.

ADD REPLY
0
Entering edit mode

so in place of ls i just defined the path that contains my fastq files so that would work?

ADD REPLY
0
Entering edit mode

as i defined the whole path which contains my fastq files its showing that its a directory ..the command is not being executed

ADD REPLY
1
Entering edit mode

How well did you understand/customize the script? The cut operates on exact character numbering, and might not be suitable to your case. If you cut the first 13 characters of a long path name, you might end up with a character string that doesn't mean much.

Have fun with bash! :)

ADD REPLY
0
Entering edit mode

Hi,

Just a general question about the arguments used for the loop designed here. What is the difference between ${i%.fastq} used to indicate the output or simply using ${i}?

Thanks!

ADD REPLY
7
Entering edit mode

I've found this page (and the website in general) very useful for advanced bash programming: http://wiki.bash-hackers.org/syntax/pe

Let's say i="HELLO.txt.txt"

echo $i #HELLO.txt.txt
echo $iX #No output - $iX is not a variable with any value here
echo ${i}X #HELLO.txt.txtX - ${} isolates the string inside the {} as the variable name
echo ${i%.txt} #HELLO.txt - removes shortest match for '.txt' from the end of the string

So, ${i%.fastq} removes the smallest .fastq from the end of the filename (so, its extension). Hope this helps.

EDIT: With context, I'm using the (kinda) basename of the file - the name of the file stripped of its fastq extension to name the output. I'm sure there are other ways to achieve this, maybe with basename or seds, but this fit best here.

ADD REPLY
0
Entering edit mode

Hi Ram, thank you so much for the quick reply. Very helpful information for scripting loops!

ADD REPLY
0
Entering edit mode

I think its getting over written in the tpout folder ...can you suggest whats wrong with my loop , im using this have a look

for i in $(ls *.fastq | rev | cut -c 9- | rev | uniq); do tophat2 -o $(%i.fastq)tout -G /run/media/punit/data2/gencode.v21.annotation.gtf -p 30 /run/media/punit/data2/BowtieIndex/hg38 ${i}_1.fastq ${i}_2.fastq ;done
ADD REPLY
1
Entering edit mode

$() executes the statement within the (). ${} is used to delimit variable names. Are you sure you're doing it right with the $(%i.fastq)tout part?

ADD REPLY
0
Entering edit mode

Immediate issue is $(%i} in your code. You should use ${i%}, not ${%i}, to my knowledge, in your code. However your variable depends on your sample names. could you paste your sample names? one for each type (forward and reverse)?

ADD REPLY
0
Entering edit mode

Yes, it's always ${} for parameter expansion/ string substitution

ADD REPLY
0
Entering edit mode

okay I'm doing as you mentioned but for this I would like to know tophat2 -o ${i%.fastq}tpout is it going to put my output in two different folders , as I have two pairs of paired end read

ADD REPLY
0
Entering edit mode

I don't know how tophat works, I was helping with bash here :-)

ADD REPLY
0
Entering edit mode

Thanks Ram! one of the issue with my samples are that they are coming from barcoding demultiplex that means all the name would be same except number just before .fa.gz such as _1.fa.gz. So how could I give my input sample name such as, example you have provided is common in all the samples so that will work well but in my sample only common part is .fa.gz

  ${prefix}_R1_001.fastq" "${prefix}_R2_001.fastq"
    9_S9_L001_R1_001_1.fa.gz
    9_S9_L001_R1_001_2.fa.gz
    9_S9_L001_R1_001_3.fa.gz
ADD REPLY
1
Entering edit mode

I'd recommend breaking down your problem into at least 2 steps. For the first step, collect all filenames in a separate file. Once you do that, you can create dry-run commands that just echo stuff so you can test your cutting and sed-ing.

For example, if a file named filelist.list had the 3 file names you gave me, I could do:

sed 's/.fa.gz//' filelist.list | sed -r 's/(_[0-9])$/\t\1/' | tee prefixes_and_suffixes.txt

Output:

9_S9_L001_R1_001    _1
9_S9_L001_R1_001    _2
9_S9_L001_R1_001    _3

You could then use this new tab separated file (the arbitrarily named prefixes_and_suffixes.txt) to do what you will - awk and cut as well as uniq will be really useful here. The only way you'll learn bash is by this sort of trial and error. This will be a good exercise. Good luck!

ADD REPLY
0
Entering edit mode

Life would be so much simpler if you had chosen to include that _1, _2 before _L001_R1_001.

ADD REPLY
0
Entering edit mode

You can use nested loop as following:

$ for i in $(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in $(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do echo $i"_R1"_$j;done;done.

Replace R2 for another read pair in place R1.

My assumption is that, for 2 samples with paired end reads (R1 and R2), your files are :

$ ls *.gz
9_S9_L001_R1_001_1.fa.gz  
9_S9_L001_R2_001_2.fa.gz
9_S9_L001_R2_001_1.fa.gz 
9_S9_L001_R1_001_2.fa.gz

output for R1:

  $ for i in $(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in $(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do echo $i"_R1"_$j;done;done
9_S9_L001_R1_1.fa.gz
9_S9_L001_R1_2.fa.gz

output for R2:

$ for i in $(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in $(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do echo $i"_R2"_$j;done;done
9_S9_L001_R2_1.fa.gz
9_S9_L001_R2_2.fa.gz

Let us say your program takes following commands and you have several samples with R1 and R2:

$ program -p <R1>.fa <R2>.fa

Above script would be:

$ for i in $(ls *.fa.gz | cut -f1-3 -d"_" | uniq); do for j in $(ls *.fa.gz | cut --complement -f1-5 -d"_" | sort| uniq); do program -p $i"_R1"_$j  $i"_R2"_$j;done;done

Try with two samples at first to see if program is working. Let me know if there are any issues.

ADD REPLY
0
Entering edit mode

Hi, thank you for this. Can you explain the sed -r 's/_R[12]_001[.]fastq// change?

ADD REPLY
0
Entering edit mode

Sure, that command uses sed to delete a part of a string from the string. Are you familiar with sed and regular expressions?

ADD REPLY

Login before adding your answer.

Traffic: 1680 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6