Question

sequencing coverage information: Trying to understand the data format

0

Entering edit mode

4.8 years ago

mito ▴ 10

Hi,

I want to determine how well different regions of the genome are sequenced based on the GNomad data source. I have downloaded the following genome-wide tabular data with the first 10 rows shown here:

chrom  pos    mean        median  over_1      over_5      over_10      over_(and so on...)
1      12141  2.9005e-02  0       2.1939e-02  1.0547e-04  0.0000e+00
1      12142  2.9216e-02  0       2.1622e-02  1.0547e-04  0.0000e+00
1      12143  2.7951e-02  0       2.1200e-02  1.0547e-04  0.0000e+00
1      12144  2.9111e-02  0       2.1728e-02  1.0547e-04  0.0000e+00
1      12145  2.9216e-02  0       2.1833e-02  1.0547e-04  0.0000e+00
1      12146  2.6790e-02  0       2.0251e-02  1.0547e-04  0.0000e+00
1      12147  3.2802e-02  0       2.4048e-02  1.0547e-04  0.0000e+00
1      12148  3.3330e-02  0       2.4470e-02  1.0547e-04  0.0000e+00
1      12149  3.4279e-02  0       2.4786e-02  0.0000e+00  0.0000e+00

But I have not found out, what ...

the coverage numbers actually mean. I am aware of this question. There appear to be different possible meanings for coverage beyond just the number of reads overlapping a region. Is there any way to determine which of the many meanings applies here?
what the meaning of over_1, over_5, over_10 (and so on) is. Do these refer to the mean or median over neighboring n positions? And is it n positions on both sides or is it a centered window of n positions?

Is this a standard TSV-based data format or is it GNomad-specific?

coverage gnomad • 861 views

ADD COMMENT • link 4.8 years ago by mito ▴ 10

score 2 · Accepted Answer · 2020-01-28

I figured it out. I did not notice the corresponding FAQ entry before and after a bit of thinking I came to the following conclusions:

They used multiple samples, each with it's own coverage
The coverage is the number of reads overlapping a specific position (which is an integer)
The mean and median columns are based on the coverage across samples.
The over_n columns refer to the fraction of samples which have a coverage of at least n

And it seems that there are a lot of samples where the coverage is close to zero.