Hi,
If I understood right your problem, what you're doing does not make sense. You're looping over the forward (R1) and reverse (R2) fastq files inside the folder 1_raw_data
. This means that for each sample you'll attempt to align it twice, one for the forward file and a second time for the reverse, because you're looping over the files inside the folder, which have different names for the forward and reverse fastq samples.
Since you're looping over the file names that are inside the folder, the variable $i
will assume the names of the files, and of course, when you add to that the constant string _R1_001.fastq.gz
or _R2_001.fastq.gz
to $i
, it'll just append these strings to the file names.
So, what you want to do is to loop over the forward or the reverse fastq files once for each sample, retrieve only the string that matches the sample, and align once the respective forward and reverse fastq samples against the reference.
To do so you can attempt the following (first backup your files - never run analyses on the original copy!) - I'm using your loop above and I'm assuming that the sample names match between forward and reverse and the only difference between them is the R1 and R2 tags, respectively:
for i in $(ls 1_raw_data/*_R1_* | sort -u); do echo STAR --genomeDir /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_STAR/ \
--readFilesIn 1_raw_data/${i} 1_raw_data/${i/_R1_/_R2_} \
--runThreadN 20 --outFileNamePrefix 3_aligned/${i/_R1_001.fastq.gz/} \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_gtf/Drosophila_melanogaster.BDGP6.32.104.chr.gtf \
--readFilesCommand gunzip -c ; done;
Now, you'll only loop over the forward files (s 1_raw_data/*_R1_*
) which is the same that looping over each sample once. Then you will make use of parameter expansion in linux to substitute the R1 tag from the forward file name to R2 in the reverse file name. You do this by specifying ${i/_R1_/_R2_}
which means to pick up the name of the variable $i
which represents Dros_01_S48_L001_R1_001.fastq.gz
, find _R1_
and replace by _R2_
. I'm assuming that you want to use as prefix the sample name without the suffix _R1_001.fastq.gz
that's why I changed the previous code to --outFileNamePrefix 3_aligned/${i/_R1_001.fastq.gz/}
.
I just want to reinforce the message of Pierre Lindenbaum by strongly recommend you the use of a workflow manager. It is the best way to make your analysis more reproducible, scalable etc, and it will save you tons of time later.
I hope this helps,
António
you should use a workflow manager
what's in 1_raw_data ?
1_raw_data is a folder where all the fastq.gz file are located.
I'm not sure what is that.
https://www.nextflow.io/
https://snakemake.readthedocs.io/en/stable/