why do bed files not contain information about the reference?
2
0
Entering edit mode
6.3 years ago
Marvin ▴ 220

Let's say I have a transcript NM_12345 whose first exon starts at position 1000 on chromosome 15 in hg19.

What if in hg38 they had discovered the following: "oh, the ten repeats of length 5 at the beginning of chromosome 15 are actually twenty repeats (in most people). We should insert those nucleotides at the beginning of chromosome 15."

This means that my first exon doesn't start at position 1000 anymore but instead at position 1000 + 5*10 = 1050.

This means that a bed file which was created based on hg19 should not be used for hg38 based work, right?

Thus bed files should actually have a header line which makes it clear which reference genome the features refer to ... why isn't there such a line?

bed hg19 hg38 reference • 1.7k views
ADD COMMENT
1
Entering edit mode

You can add comments (using # at the beginning of the line) to a bed file:

Comments and all lines that do not match the format described above (starting with "chr" and containing at least two integers with genomic positions) are skipped.

So you can add metadata to a bed file. It will be ignored by bed parsers, but may be useful for humans dealing with the files.

ADD REPLY
4
Entering edit mode
6.3 years ago
ATpoint 85k

Because the BED format is a pure coordinate-based format without any metadata. It is your responsibility to make sure that you use the correct genome version. By the way, hg19 and hg38 coordinates are not 1-to-1 interchangable. If you want to lift one system to the other, have a look at liftOver.

ADD COMMENT
1
Entering edit mode

BED format files can contain track definitions at the top of the file. But in general they are used without them.

ADD REPLY
2
Entering edit mode

Fair point. The problem is still that not all tools allow this. Try to load a BED into IGV with anything else than the coordinates and it will complain (at least in my limited experience).

ADD REPLY
1
Entering edit mode

Additionally, if you say "hg19" or "hg38", that is actually not entirely accurate either. This post illustrates that problem very clearly: http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Unless you are embedding the entire reference FASTA, you really can't be entirely sure what you working with.

ADD REPLY
0
Entering edit mode

thanks everyone for bringing all the details together :)

ADD REPLY
3
Entering edit mode
6.3 years ago

Because of the eternal fight between simplicity and full documentation.

BED files are super simple and basic. Yes, you can add a comment in the beginning (or you could just make a note of the genome version in the file name). Or you can use the infinitely more complex R objects for storing genomic ranges to dump all the metadata anybody could ever want.

ADD COMMENT

Login before adding your answer.

Traffic: 1915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6