Hi all,
DISCLAIMER: Please bear with me as I am doing a bioinformatics internship at a department that has very little bioinformatics knowledge so I am a bit on my own here and therefore may ask very stupid questions. Please accept my apologies beforehand :).:
I have a question regarding an RNA seq experiment performed on NovaSeq 6000. I have searched this forum quite elaborately but still haven't really come to a complete understanding.
My supervisor has given me access to the raw data files and I am trying to make sense of it prior to doing any analysis. In total 24 samples were submitted for sequencing. the folder contains 129 .fastq.qz files and I have a few questions about this:
- For each sample, the folder contains read 1 (R1), UMI (R2) and read 2 (R3) files. I should procede utilizing only the R1 and R3 files, correct?
- Three different flowcells were used (all NovaSeq 6000). For some samples, fastq files have been generated for multple flowcells. Why is this? And how should I account for this in my analysis? Can I simply merge them? Do I need to account for batch effects?
Hope you can help this semi-desperate individual out!
Kind regards,
Theo
Thanks for your response!
Yes the ultimate aim of my analysis is just a simple DEG. Is there a reason to merge at a later stage rather than prior to analysis? Is it more beneficial for parallelisation? Im trying to wrap my head around this and come up with a simple protocol and not drown in all the countless options out there so I guess I could go without utilising UMI's...
Kind regards,
Theo
Correct. If you have access to a compute cluster then you can align/process the file pieces for each sample in parallel and then merge/sort the BAM before counting.
Look into NF-Core pipeline for RNAseq as an option: https://nf-co.re/rnaseq/3.14.0 for a proper workflow.
Very helpful, I will take a look. Thanks again!