Question

Comparison of samples to find the best parameters for RNA-seq

0

Entering edit mode

3.7 years ago

lluc.cabus ▴ 20

Hi everyone,

I'm trying to do an analysis to see which parameters of the RNA-seq are better for my samples. To do that, I have 9 samples that I sequenced with 20M reads, 150bp and paired-end. I wanted to see which parameters generate more accurate results (paired-end vs single-end, 150bp vs 100bp vs 50bp and 20M vs 10M reads).

To do that, I did a hard-trimming (to obtain 100bp and 50bp), taken only the first 10M reads instead of the 20M and taken only the first fastq (to generate the single-end) and run the same programs: trim_galore for the trimming, STAR for the mapping and RSEM for the quantification, all with the same parameters (only changing the parameters regarding the paired-end).

The results are that beginning with 50bp reads generates more transcript counts than beginning with 150bp reads, and that single-end generates more transcript counts than the paired-end. I'm a bit concerned, since I don't know how single-end reads could generate more transcripts than single-end, and I think I'm analyzing something wrong, do you know how could I do this type of analysis?

Thank you all very much, Lluc

NGS rna fastq R sequencing • 2.3k views

ADD COMMENT • link updated 3.7 years ago by Istvan Albert 102k • written 3.7 years ago by lluc.cabus ▴ 20

1

Entering edit mode

3.7 years ago

ATpoint 86k

Longer reads are always better than shorter ones, and more depth increases statistical power, and paired-end is almost always preferred over single-end as it improves coverage and alignment accuracy. I see little point in trying this out, especially because you would need a ground-truth set for the benchmarking. Without that it is rather pointless. Use the paired-end 150bp data with all available reads would be my recommendation, that is pretty much the standard in the field.

ADD COMMENT • link 3.7 years ago by ATpoint 86k

0

Entering edit mode

Yes, I was trying that in order to establish what should be the parameters for the next sequencing experiments and if we could reduce the depth or use shorter reads in order to reduce costs. Sorry if I didn't explain myself.

ADD REPLY • link 3.7 years ago by lluc.cabus ▴ 20

1

Entering edit mode

Ok, I see. Then I suggest to use the PE-150-20M data as the "gold standard" and compare everything else to it as this is the best possible combination.

ADD REPLY • link 3.7 years ago by ATpoint 86k

0

Entering edit mode

When it comes to any method that "counts fragments" paired-end reads add substantially to the cost.

That is because at the same coverage, single end reads will sample twice as many fragments, and the statistical power increases accordingly.

When you use paired-end reads, the same fragment is read twice (from each end), thus you lose the independence of the measurements.

In general, and considering the realities of funding and resource availability, I would only recommend paired-end reads when identifying novel transcripts is of importance.

For all other cases, generally speaking, the cost/benefits of paired-end reads do not materialize.

The quality of the RNA and the library preparation will have a larger effect, and usually you can get excellent results with 100-150bp single-end reads.

ADD REPLY • link 3.7 years ago by Istvan Albert 102k

score 4 · Accepted Answer · 2021-03-31

a bit concerned, since I don't know how single-end reads could generate more transcripts

thinks of it this way, imagine you have a transcript like:

ATAGCATG

now take a really short read from it, a length of say just one base

how many places does it match? Three. Now make it two base pairs long

AT

how many places does it match? Two. Now make it three base pairs long:

ATG

how many places does it match? One.

See how longer reads offer more specificity and won't match in incorrect locations. Shorter reads will generate false positives.

The increased counts for shorter reads indicate error rates, all those reads would not be matching if they were longer and more similar to the "real" transcript.

Your experiment of hard trimming explored the artifacts of what is happening when the reads are short.