Question

Paired-end sequencing with inconsistent quality/number of reads

0

Entering edit mode

6 months ago

anissa • 0

Hi everyone,

I am currently dealing with a dataset from paired-end sequencing (Illumina NextSeq1k2k, 2 lanes, 101 cycles per read). After using nf-core/rnaseq pipeline with STAR/salmon to quantify the gene counts, I get a multiQC report which shows that each file has very different number of reads. I can't see a pattern on both ends (see histogram of STAR alignment scores per sample) I tried browsing this on google and here but I couldn't find any similar post.

I suspect there was a problem during the sequencing, but I don't know what exactly (I am not an expert in NGS, I have a general knowledge of sequencing). Could you explain me what could be the origin of the problem ? Is it possible to get rid of the end that is poor quality to have a 'single-end-like' data of better quality ?

Thanks in advance for your help.

PS: any resource to help me understand all of the steps of sequencing RNA would be useful!

paired-end NGS STAR flowcell alignment • 830 views

ADD COMMENT • link 6 months ago by anissa • 0

1

Entering edit mode

This is a question for the people doing the benchwork. You would have to look at the QC to see how they normalized the libraries. There is nothing you can do at your end other than drop the lowest samples, and maybe downsample the highest ones (but that might not even be necessary)

Running the exact same library multiple times does not cause batch artifacts, so you could ask the people who load the instrument to run these again, using the read counts of the fastqs to rebalance them.

ADD REPLY • link 6 months ago by swbarnes2 14k

score 1 · Answer 1 · 2024-10-24

I suspect there was a problem during the sequencing

No there was no obvious "problem". This is a result of how the pool got made from individual libraries. These libraries were of different concentration (had varying amounts of material) and that is why you ended up with different number of reads post-demultiplexing based on what was in the pool.

There are ways to make a balanced pool to get equivalent read numbers for all samples in a pool (by doing qPCR on libraries and/or by running a miseq nano run to get actual number of reads from the pool so it can then be adjusted by adding more amounts of certain libraries to balance things out). These additional steps require effort and are generally charged for extra by sequencing centers.

While the differential expression analysis programs will try and account for such differences in numbers if the order is large, then you may need to take that into account when doing data analysis.

There are plenty of resources to understand how RNA is sequenced. Here is a random video on the topic: YouTube LINK