Super basic question, probably more related to programming than RNA-seq itself...
I have multiple FASTQs that I want to align (from 3 separate lanes to provide more sequencing depth). According to the STAR manual, I should be able to separate these with commas in between all read 1s, then space, then commas between all read 2s. Like this: L1_R1,L2_R1,L3_R1 L1_R2,L2_R2,L3_R2
I keep getting the error: gzip: /Sample_1/fastq/*L003_001.R2.fastq.gz: No such file or directory
When I bring it out to the terminal to check, this is what I see:
echo $FASTQ1 $FASTQ2
/Sample_1/fastq/Sample_1_TTAGGC_BCCRUJANXX_L003_001.R1.fastq.gz /Sample_1/fastq/Sample_1_TTAGGC_BCCRUJANXX_L003_001.R2.fastq.gz
but
echo $FASTQ1,$FASTQ2
/Sample_1/fastq/*L003_001.R1.fastq.gz,/Sample_1/fastq/*L003_001.R2.fastq.gz
So clearly it recognizes the variable name, but can't deal with the comma in between. Is there a way around this or do I need to provide full variable names in the STAR script (which would be pretty annoying, since I have many files like this)?
Thanks! Here's the code for reference (I've modified it slightly since the whole path names are long):
IN=/Sample_1
genomeDir=/STAR
GTF=/gencode.v19.chr_patch_hapl_scaff.annotation.gtf
FASTQ1=$IN/fastq/*L003_001.R1.fastq.gz
FASTQ2=$IN/fastq/*L003_001.R2.fastq.gz
FASTQ3=$IN/fastq/*L004_001.R1.fastq.gz
FASTQ4=$IN/fastq/*L004_001.R2.fastq.gz
FASTQ5=$IN/fastq/*L005_001.R1.fastq.gz
FASTQ6=$IN/fastq/*L005_001.R2.fastq.gz
STAR --runThreadN 8 --genomeDir $genomeDir --sjdbGTFfile $GTF --sjdbOverhang 149 --bamRemoveDuplicatesType UniqueIdentical --readFilesIn $FASTQ1,$FASTQ3,$FASTQ5 $FASTQ2,$FASTQ4,$FASTQ6 --twopassMode Basic --outSAMtype BAM SortedByCoordinate Unsorted --quantMode TranscriptomeSAM GeneCounts --readFilesCommand zcat
Defining variable this way is making the OS look for that folder/file at the base of the root directory when that variable is expanded (which is reflected in error you get
gzip: /Sample_1/fastq/*L003_001.R2.fastq.gz: No such file or directory
. Do you actually have a folderSample_1
just under/
?Other than that what is going on with
*
in*L003_001.R1.fastq.gz
? What is the*
supposed to indicate?Thanks for your reply! I just cut out the rest of the path name for my question online because it's super long, but I don't think that part of the variable is the issue. I think you are right that the wildcard is causing the problem, but I don't understand why. Every sample is named something like this in each respective fastq folder:
Sample_1_TTAGGC_BCCRUJANXX_L003_001.R2.fastq.gz
so I wanted to make a universal script that could get rid of the sample name and adapter sequence and take only the part:
*L003_001.R2.fastq.gz
Doing this to all my scripts seems to have fixed the problem, after I cd into the proper folder:
But this requires me to fill out that variable for every sample, which seems like a silly way to do it. You're right that the wildcard is the issue, but why?
Try using
${FASTQ1},${FASTQ2}
instead of$FASTQ1,$FASTQ2
- I'm not sure if it will work, but it's worth a shot. What would be better is to use something likeor something that uses
ls
instead offind
if all FASTQ are in the same folder:Thank you! Yeah the ${FASTQ1},${FASTQ2} doesn't work either, it's something about having the comma right between the two variables makes it break down when using a wildcard. But I'll give the other options a shot, thanks!
If your file names are
Sample_1_TTAGGC_BCCRUJANXX_L003_001.R1.fastq.gz
thenname=$(basename ${i} _001.R2.fastq.gz)
would be the best way to extract the part that is variant. Then you reconstitute the file names for actual run using${name}_001.R1.fastq.gz
and${name}_001.R2.fastq.gz
.