Question

Normalizing Illumina fastq read-lengths from different GEO datasets

1

Entering edit mode

7.3 years ago

rbronste ▴ 420

When dealing with fastq files from different datasets in the literature which may be comprised of read lengths such as 50bp or 75bp etc., is there a good way to normalize across them with your own data for comparison in mind - before/after alignment? Thanks.

GEO fastq RNA-Seq • 2.1k views

ADD COMMENT • link 7.3 years ago by rbronste ▴ 420

score 2 · Answer 1 · 2017-08-04

2

Entering edit mode

7.3 years ago

GouthamAtla 12k

Do you see any systematic bias towards readlenghts like in PCA ? In that case you can apply the batch correction methods.

ADD COMMENT • link 7.3 years ago by GouthamAtla 12k

0

Entering edit mode

I have not approached the problem via PCA yet, however that is a good idea as well. I just assumed that when datasets from different sources are repurposed for ones study (right from the fastq files in GEO), the differing read lengths off the Illumina machine would present some coverage confound when compared to our 75bp reads. Maybe I am mislead in looking at it that way?

ADD REPLY • link 7.3 years ago by rbronste ▴ 420

1

Entering edit mode

I guess, more than read length, the other factors play a major role, like different library prep methods, different platforms, time etc etc. So its better to see the PCA plot with all the information available to identify the major confounding factors and correct for them.

ADD REPLY • link 7.3 years ago by GouthamAtla 12k

0

Entering edit mode

Is there a particular PCA tool you would use for this instance?

ADD REPLY • link 7.3 years ago by rbronste ▴ 420

0

Entering edit mode

Anything should work. Its just PCA. Only thing is you need to quantify all the samples and get the matrix. May be you can feed into DESeq2 which has a PCA function and many tutorials available

ADD REPLY • link 7.3 years ago by GouthamAtla 12k