Question

UK Biobank l2r data holds segment means but no segment locations?

0

Entering edit mode

6.0 years ago

tohc • 0

I've been looking at UK Biobank data and it seems the data holds the segment mean l2r (or log base 2) values for the Copy Number Variation but doesn't actually have the segment start and end positions. Each file is for a particular chromosome and contains all 500,000 patients but I was wondering if anyone knows where we might find the actual location on the chromosome the values correspond to.

genome biobank cnv • 1.9k views

ADD COMMENT • link updated 5.8 years ago by Eric T. ★ 2.8k • written 6.0 years ago by tohc • 0

0

Entering edit mode

biobank, cnv seem to be tags relevant to your post, whereas genome doesn't really say much. Please add relevant tags so people following those tags would get to see your posts.

ADD REPLY • link 6.0 years ago by Ram 45k

0

Entering edit mode

Added the tags you mentioned

ADD REPLY • link 6.0 years ago by tohc • 0

0

Entering edit mode

Well, in addition to Ram's comments, at which data are you looking, exactly? - you have provided no links. I can probably just contact the relevant person directly if you let me know from where you obtained your data.

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Hey Kevin,

I can't exactly give you a link to the data itself. UK Biobank has a strict policy on how data is given out however this is the project website UK Biobank. A lot of the documentation seems to be talking about raw sequencing reads, however the inferred l2r CNV data is technically using these files to create the output files from my understanding.

This is the link to the actual instructions for data download Resource 664. The data we are using is the CNV log2r data however as you can read, the files downloaded are per chromosome. The issue is the files essentially only hold the log2r values but give no indication of which portion of the chromosome they are from, which is not very helpful.

Hope that clarifies things.

Thanks in advance!

Edit: I should also mention that segment means are the log2r values, I've been using them interchangeably.

ADD REPLY • link 6.0 years ago by tohc • 0

score 3 · Accepted Answer · 2019-07-28

My understanding is that the UK Biobank l2r files are the copy ratio estimates at each probe in the SNP array -- they have not been segmented in the publicly available dataset, so there are no segment breakpoints.

There are a couple of papers that survey CNVs; they used PennCNV on these input files to smooth the CNV signal and detect breakpoints. I'd retrieve the processed calls from those studies, rather than UKB; reprocessing would be incredibly expensive, and the original studies were done well.

I'm aware of efforts to call CNVs from the recently available whole-exome sequencing datasets as well. These aren't available for the full 500k cohort yet, but it's worth keeping an eye on these efforts, as the SNP arrays may not have used probes at some potentially important / likely CNV locations.