Question

RNAseq .fastq files on multiple flowcells, how to proceed

0

Entering edit mode

13 months ago

christiantd • 0

Hi all,

DISCLAIMER: Please bear with me as I am doing a bioinformatics internship at a department that has very little bioinformatics knowledge so I am a bit on my own here and therefore may ask very stupid questions. Please accept my apologies beforehand :).:

I have a question regarding an RNA seq experiment performed on NovaSeq 6000. I have searched this forum quite elaborately but still haven't really come to a complete understanding.

My supervisor has given me access to the raw data files and I am trying to make sense of it prior to doing any analysis. In total 24 samples were submitted for sequencing. the folder contains 129 .fastq.qz files and I have a few questions about this:

For each sample, the folder contains read 1 (R1), UMI (R2) and read 2 (R3) files. I should procede utilizing only the R1 and R3 files, correct?
Three different flowcells were used (all NovaSeq 6000). For some samples, fastq files have been generated for multple flowcells. Why is this? And how should I account for this in my analysis? Can I simply merge them? Do I need to account for batch effects?

Hope you can help this semi-desperate individual out!

Kind regards,
Theo

rna-seq illumina novaseq fastq • 926 views

ADD COMMENT • link 13 months ago by christiantd • 0

score 0 · Answer 1 · 2024-06-24

0

Entering edit mode

13 months ago

GenoMax 153k

For some samples, fastq files have been generated for multple flowcells. Why is this? And how should I account for this in my analysis? Can I simply merge them? Do I need to account for batch effects?

It is possible to run the same sample library (more typically a pool of libraries) on multiple flowcells to gather adequate amount of data. This is "technical" replication of sequencing. You can keep track of additional information about flowcells/lanes using a concept called "Read Groups" (LINK). Depending on what the ultimate aim is it may or may not be necessary to use read groups. If you are doing a simple differential expression analysis then it may be fine to merge the data at some point in process (can be done after alignment at BAM stage).

the folder contains read 1 (R1), UMI (R2) and read 2 (R3) files. I should procede utilizing only the R1 and R3 files, correct?

You are certain the data you are looking at has UMI's. If you want to make use of UMI (https://dnatech.genomecenter.ucdavis.edu/faqs/what-are-umis-and-why-are-they-used-in-high-throughput-sequencing/ ) then your analysis protocol may need to become a little more complicated. An example workflow here: https://broadinstitute.github.io/warp/docs/Pipelines/RNA_with_UMIs_Pipeline/README/

ADD COMMENT • link 13 months ago by GenoMax 153k

0

Entering edit mode

Thanks for your response!

Yes the ultimate aim of my analysis is just a simple DEG. Is there a reason to merge at a later stage rather than prior to analysis? Is it more beneficial for parallelisation? Im trying to wrap my head around this and come up with a simple protocol and not drown in all the countless options out there so I guess I could go without utilising UMI's...

Kind regards,

Theo

ADD REPLY • link 13 months ago by christiantd • 0

0

Entering edit mode

Is it more beneficial for parallelisation?

Correct. If you have access to a compute cluster then you can align/process the file pieces for each sample in parallel and then merge/sort the BAM before counting.

Look into NF-Core pipeline for RNAseq as an option: https://nf-co.re/rnaseq/3.14.0 for a proper workflow.