Question

Expected number of raw reads in fq.gz files in PE sequencing

0

Entering edit mode

19 months ago

Lada ▴ 40

Hi guys,

I have a ridiculously fundamental question so can someone please help me out?

Background: An external company did RNA-seq for me. I ordered 20M read depth, paired-end 150 bp seq.

Question: Am I supposed to expect 20M or 10M reads in both R1 and R2 fq.gz files?

The company sent me a Data Quality Summary with the total raw read count per sample (without indication if this is R1, R2 or both, I just have a sample name) but I also double-checked the total no of reads per fast.gz file on my cluster and I see that for some samples both the R1 and R2 count equals the raw read count reported by the sequencing company (which is roughly around 20M), and for some samples the number in R1 and R2 is twice as much so I am totally confused.

I am talking about different sequencing projects so maybe the reporting scheme changed. I asked them but no reply so far.

Tnx!

RNA-seq fastq transcriptomics • 1.5k views

ADD COMMENT • link updated 19 months ago by Ram 45k • written 19 months ago by Lada ▴ 40

1

Entering edit mode

An external company did RNAseq for me

ask the company

ADD REPLY • link 19 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

yeah I did, still no answer + not very satisfied with their feedback on technical stuff in general so I wanted to check with the community

ADD REPLY • link 19 months ago by Lada ▴ 40

1

Entering edit mode

Question: Am I supposed to expect 20M or 10M reads in both R1 and R2 fq.gz files?

That is ambiguous and an eternal source of confusion. So you always have to make sure that all involved have the same idea of what counts as a read and what not.

ADD REPLY • link 19 months ago by WouterDeCoster 47k

score 1 · Accepted Answer · 2023-08-25

1

Entering edit mode

19 months ago

GenoMax 150k

I ordered 20M read depth, paired-end 150 bp seq.

20M is simply number of reads. There is no depth dimension. Each library fragment produces two potential reads. Let us say you wanted to get 20M distinct library fragments sequenced. If you only do single end sequencing then you will get 20M reads from these fragments (each of which forms a cluster). If you did paired end sequencing then you will end up with 40M total reads. In either case you sequenced 20M library fragments. SIngle end sequence will only result in one R1 file. Where are with paired-end sequencing you will get R1 and R2 files each containing 20M reads.

Illumina traditionally counts reads from both ends so you will see 2x library fragment/cluster number reported in flowcell output specifications.

ADD COMMENT • link 19 months ago by GenoMax 150k

0

Entering edit mode

Yes, I totally agree, seq depth in terms of RNASeq is not the best expression at all, I just got used to it but sure, we should use the appropriate scientific language as much as possible.

You confirmed my presumptions, thank you so much!

I just think the company sometimes express raw read count per sample as 2x and sometimes as 1x, but I'll try to get the information from them too.

When I look at the raw read count for each fq.gz (R1 or R2) I get the expected numbers so I guess I'm good to go for the downstream analysis.

ADD REPLY • link 19 months ago by Lada ▴ 40