Hello,
I am new to RNA-seq analysis, and I am attempting to align my sequence reads to my indexed genome in STAR (I've already indexed the genome). I have 50 samples in total, and each sample has 4 respective fastq.gz files – this was PE with 50 bp reads, and run in two lanes. Thus, per sample, 2 fastq.gz files are from lane 1 and the other 2 are from lane 2. Would the following code work? I'm slightly confused about how to iterate through each file in my directory. The file names look like: samplename_L001_R1 or samplename_L002_R1 or samplename_L001_R2 or samplename_L002_R2. Thank you in advance!!
for i in $(find /data/rawdata); do STAR
--genomeDir <path to genome index> \
--readFilesIn /data/rawdata/${i}_R1.fq.gz,/data/rawdata/${i}_R2.fq.gz \
--runThreadN 20
--outFileNamePrefix aligned/$i. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts
--readFilesCommand zcat ; done
It won't work as written for a few reasons. Your find command will find the file names (as well as the dir name /data/rawdata) with their extensions in place, so when you specify them later in the command they will have double extensions, and this command will be executed for each file, whereas you would like it to happen for pairs of files. You can confirm this with:
Also, the way to hand STAR paired-end data is via space-delimited R1 and R2 files, not comma delimited as in your example. Lastly, if just just loop through basenames (samplename_L001), you'll have to combine your aligned reads later (samplename_L001.bam and samplename_L002.bam to create samplename.bam). You could work up a solution to give STAR all the lane 1 and 2 reads for a given sample at once, using something like:
(I'm just using a shorthand, sname_L001_R1 = sname_L001_R1.fq.gz). The command above is using process substitution to zcat a set of files and hand that data to STAR. How you implement something like this...depends on how you want to implement it. I'm not sure it's a bash one-liner, but there are a lot of smart people here :)
Thanks so much for the help! Would the ls command be appropriate instead of the find command?
You should just test it and see! But yes, ls will work. And you can throw in the basename command as well, so I suppose you could just grab all the R1 files, strip off the suffix, and then create names for the R1, R2, and output file using a variation of the form: