Question

How to read fastqs from different sequencing runs rather than merging?

0

Entering edit mode

3.3 years ago

Vasu ▴ 800

I have fastqs of samples from the first sequencing and second sequencing runs and they are kept in different directories like below:

First Run:

Data1
    |_____fastq_folder
               |_______ sample1
                           |______ sample1_L1_R1.fastq.gz
                           |______ sample1_L1_R2.fastq.gz
                           |______ sample1_L2_R1.fastq.gz
                           |______ sample1_L2_R2.fastq.gz
               |_______ sample2
                           |______ sample2_L1_R1.fastq.gz
                           |______ sample2_L1_R2.fastq.gz
                           |______ sample2_L2_R1.fastq.gz
                           |______ sample2_L2_R2.fastq.gz

Second Run:

Data2
    |_____fastq_folder
               |_______ sample1
                           |______ sample1_L1_R1.fastq.gz
                           |______ sample1_L1_R2.fastq.gz
               |_______ sample2
                           |______ sample2_L1_R1.fastq.gz
                           |______ sample2_L1_R2.fastq.gz

Usually, when I want to run Salmon or Kallisto on First Run files which are in the directory Data1 in my script I give it like the below:

Let's say I'm inside directory Data1 where I have a script named kallisto.sh. Inside the script, I have it like below to read the fastq files.

r1=$(ls $fastq_folder/$sample/$sample*_R1.fastq.gz)
r2=$(ls $fastq_folder/$sample/$sample*_R2.fastq.gz)

But now I would like to also use Second Run files also in my script. How to make the change for r1 and r2 to read all the files in First Run and also Second Run?

P.S: I know there is a way to merge and then perform the analysis, but it might take huge time at my workplace.

ngs rnaseq fastq • 1.1k views

ADD COMMENT • link updated 3.3 years ago by dsull ★ 7.6k • written 3.3 years ago by Vasu ▴ 800

1

Entering edit mode

in any case you should need to merge the data from Data1, they are run on different lanes but represent the same biological sample.

so something like cat sample1_L1_R1.fastq.gz sample1_L2_R1.fastq.gz > sample1_R1.fastq.gz (== join the data from different lanes in to one file per biological sample)

ADD REPLY • link 3.3 years ago by lieven.sterck 15k

0

Entering edit mode

Yes, I know this. Please check the last line of my post. It might take a huge time at my workplace for merging, so I'm looking for alternative way.

ADD REPLY • link 3.3 years ago by Vasu ▴ 800

0

Entering edit mode

But now I would like to also use Second Run files also in my script.

You can use find command with a certain depth like here: How to concatenate multiple fastq files (located in different directories) for each sample Is sample1 naming consistent across folders and files?

I know there is a way to merge and then perform the analysis, but it might take huge time at my workplace.

Why would this take huge time? It will take up space since you will duplicate the data for some time.

ADD REPLY • link 3.3 years ago by GenoMax 152k

0

Entering edit mode

Not sure if this is what you're asking, but if the runs represent the same biological sample, you can just put them one right after another in kallisto:

kallisto quant -i index.idx -o output/ run1.r1.fq.gz run1.f2.fq.gz run2.r1.fq.gz run2.r2.fq.gz run3.r1.fq.gz run3.r2.fq.gz

ADD REPLY • link 3.3 years ago by dsull ★ 7.6k