Question

STAR alignment in a full directory

0

Entering edit mode

3.0 years ago

Pahi ▴ 30

Hi!

I'm working with STAR and I would like to align multiple file, but separately. I have file paires like these two:

Dros_01_S48_L001_R1_001.fastq.gz
Dros_01_S48_L001_R2_001.fastq.gz

etc. Only the S[number] changes and the R1 and R2 in the names.

I have a code, what I find on other forums, and reworked it a bit, to match my files:

for i in $(ls 1_raw_data | sort -u); do echo STAR --genomeDir /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_STAR/ \
--readFilesIn 1_raw_data/${i}_R1_001.fastq.gz 1_raw_data/${i}_R2_001.fastq.gz \
--runThreadN 20 --outFileNamePrefix 3_aligned/$i. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_gtf/Drosophila_melanogaster.BDGP6.32.104.chr.gtf \
--readFilesCommand gunzip -c ; done

My problem, that during the readFilesIn part the files what STAR load in renamed as the following:

Dros_01_S48_L001_R1_001.fastq.gz_R1_001.fastq.gz
Dros_01_S48_L001_R1_001.fastq.gz_R2_001.fastq.gz

So basically it adds more to its name.

What could be the problem?

STAR loop • 2.6k views

ADD COMMENT • link updated 2.1 years ago by DareDevil ★ 4.3k • written 3.0 years ago by Pahi ▴ 30

0

Entering edit mode

for i in $(ls 1_raw_data | sort -u); do

you should use a workflow manager

My problem, that during the readFilesIn

what's in 1_raw_data ?

ADD REPLY • link 3.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

1_raw_data is a folder where all the fastq.gz file are located.

you should use a workflow manager

I'm not sure what is that.

ADD REPLY • link 3.0 years ago by Pahi ▴ 30

0

Entering edit mode

1_raw_data is a folder where all the fastq.gz file are located. of course, can you please give us an example

I'm not sure what is that.

https://www.nextflow.io/

https://snakemake.readthedocs.io/en/stable/

ADD REPLY • link 3.0 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

2.1 years ago

DareDevil ★ 4.3k

You can try following:

for i in *_R1_001.fastq.gz; do name=$(basename ${i} _R1_001.fastq.gz);

  STAR --genomeDir /path/to/Genome \ #Path to the index generated previously
  --runThreadN 12 \ #Number of cores
  --readFilesIn ${name}__R1_001.fastq.gz ${name}_R2_001.fastq.gz \ #Path to the input files (forward and reverse)
  --outFileNamePrefix ${name}_aligned_transcriptome \ #Prefix to the output files
....
....
.....
done

ADD COMMENT • link 2.1 years ago by DareDevil ★ 4.3k

score 3 · Accepted Answer · 2021-11-26

Hi,

If I understood right your problem, what you're doing does not make sense. You're looping over the forward (R1) and reverse (R2) fastq files inside the folder 1_raw_data. This means that for each sample you'll attempt to align it twice, one for the forward file and a second time for the reverse, because you're looping over the files inside the folder, which have different names for the forward and reverse fastq samples.

Since you're looping over the file names that are inside the folder, the variable $i will assume the names of the files, and of course, when you add to that the constant string _R1_001.fastq.gz or _R2_001.fastq.gz to $i, it'll just append these strings to the file names.

So, what you want to do is to loop over the forward or the reverse fastq files once for each sample, retrieve only the string that matches the sample, and align once the respective forward and reverse fastq samples against the reference.

To do so you can attempt the following (first backup your files - never run analyses on the original copy!) - I'm using your loop above and I'm assuming that the sample names match between forward and reverse and the only difference between them is the R1 and R2 tags, respectively:

for i in $(ls 1_raw_data/*_R1_* | sort -u); do echo STAR --genomeDir /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_STAR/ \
--readFilesIn 1_raw_data/${i} 1_raw_data/${i/_R1_/_R2_} \
--runThreadN 20 --outFileNamePrefix 3_aligned/${i/_R1_001.fastq.gz/} \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile /home/pahib/RNA_SEQ_Pipeline/Reference_genome/Drosophila_gtf/Drosophila_melanogaster.BDGP6.32.104.chr.gtf \
--readFilesCommand gunzip -c ; done;

Now, you'll only loop over the forward files (s 1_raw_data/*_R1_*) which is the same that looping over each sample once. Then you will make use of parameter expansion in linux to substitute the R1 tag from the forward file name to R2 in the reverse file name. You do this by specifying ${i/_R1_/_R2_} which means to pick up the name of the variable $i which represents Dros_01_S48_L001_R1_001.fastq.gz, find _R1_ and replace by _R2_. I'm assuming that you want to use as prefix the sample name without the suffix _R1_001.fastq.gz that's why I changed the previous code to --outFileNamePrefix 3_aligned/${i/_R1_001.fastq.gz/}.

I just want to reinforce the message of Pierre Lindenbaum by strongly recommend you the use of a workflow manager. It is the best way to make your analysis more reproducible, scalable etc, and it will save you tons of time later.

I hope this helps,

António