Dear all,
Could you help he please to modify my loop in bash to multiple alignment my FASTQ files?
I have a many fastq files R1 and R2 and I would like to align all those read in bash loop - lets say I have X1_R1_001.fastq
+ X1_R2_001-fastq
and Y1_R1_001.fastq
+ Y1_R2_001.fastq
. Sample X1 is pair-end read R1 + R2 and so on.
#!/bin/bash
for i in *fastq;
do tophat2 -o ${i%.fastq}tpout -G path/to/my/reference.gtf -p 8 path/to/my/bowtie_index ${i}R1_001.fastq ${i}R2_001.fastq --rg-id X1 --rg-sample X1 --rg-library rna-seq --rg-platform Illumina
done;
And I would like to also change my --rg-id
tag with name of my fastq files (in this example X1
in first loop and X2
in second loop). Output folder should have the name of the sample too.
Please do you have any idea how to modify my bash loop?
Thank you so much for any ideas!
Paul
Hi,
Quick question: Do your FASTQ files follow a consistent naming convention?
Hi Ram,
Thank you for reply, yeah there is always number
1_S1_L001_R1_001.fastq
+1_S1_L001_R2_001.fastq
,2_S2_L001_R1_001.fastq
+2_S2_L001_R2_001.fastq
and so on...This is typical output form MiSeq.
Paul
OK,so the
_R[12]_001.fastq
is the constant part of the names, anything before that is prefix. Cool, I'll get back to you in some time.Thank you so much... I try to figure out some solution also :-)
Hi Ram, I used code to align my pair end RNA seq. we have performed single cells TCR seq and in one sample, I mixed several samples by adding unique adaptor to each samples. So after demultiplexing I got around 10 files of R1(as I used 10 barcode) and 10 files of R2. however name samples are same except before
.fa.gz
, it comes like_1.fa.gz
and for next sample_2.fa.gz
etc. Now when I used your script by editing specific thing and run the command. Scripting is not working. I get msg that there are no such file call.fa.gz
Here Ii am pasting name of the samples for your consideration. Could you please help me solve the issue.
Please use
ADD REPLY/ADD COMMENT
when responding to existing posts to keep threads logically organized.code
for i in $(ls *.fa.gz | rev | cut -c 4- | rev | uniq)
printssample_name.fa
instead ofsample_name
.${i}.fa.gz
in your code appends extrafa
(for example9_S9_L001_R1_001_1.fa.fa.gz
) to the sample name before extension.fa.gz
. This might be one problem. In addition to this, you are using same file (${i}.fa.gz ${i}.fa.gz
) twice. Is there a reason for this?I tried to echo the input with three of your sample files (for R1 and R2):
If I will use the sample name then I have to write sample name for all. It means I have total 1-49 samples and then for each samples 10 barcode so in total I have 490 R1 file and 490 R2 file. No there are no reason to use that but I am getting how to specify R1 and R2, as my sample unique name is
9_S9..._1.fa.gz
. so I to specify that I am not getting.What I meant was that code (provided by you) is appending an extra fa to your input files. In stead of passing 9_S9_L001_R2_001_3.
fa.
gz, it is passing 9_S9_L001_R2_001_3.fa.fa
.gz for execution. That might be one reason why it is not finding files.I printed the input files that are being sent to program in above command.
that true, how could modify this script so it can work?
See my response below. Unless you want the
fa
as part of the prefix, use a different index to cut with, or better yet, usesed
.I've updated my answer below with some tips.
These are shell questions, my friend - it takes a little trial and error to get them right. for examples, why a
cut -c 4-
when the common ending (.fa.gz
) itself is 6 characters? Only arev | cut -c 7-
will give you a list of prefixes. In fact, you can just usels *.fa.gz | sed 's/.fa.gz//'
Hi Now I have changed things as you suggested. But still it says there are no such file or directory. Here I do not understand one point if I am not providing R1 and R2 separately then how will recognize the partner. I am sorry I am bit new to system that why my questions are bit basic level.
About file name based assumptions, I'm guessing that's on the tool, not on bash. If your tool (
mixcr align
) substitutes anR1.
with aR2.
automatically to look for a corresponding file, it can (in your words) recognize the partner. If not, like most other tools, it will have two input parameters, one for each file.Also, I'm guessing your changing-suffix-after-common-prefix as you describe below is the core challenge.
In your command above, you've
sed
-substituted away the.fa.gz
and still are using${i%.fa.gz}
. Why is that?It would help if you could post the example code for accepted and intended parameters for this tool (mixcr) with one paired end sample. Otherwise, visitors would not know how input files are passed to the tool and in what order.