Question

Why does UCSC gap definitions disagree with the Ns in reference genome?

1

Entering edit mode

9.0 years ago

Duck ▴ 40

For example, in region chr2:33,142,137-33,142,249 in hg19, you can only see Ns. However, this region is not defined as a gap based on "Gap" track.

Similar thing is seen in hg38, where there are 752 N gaps on chromosomes chr1~chr22, chrX,chrY, but only 603 gaps defined in UCSC gap file.

How to explain the difference? Thanks.

UCSC-genome-browser Gap • 2.4k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.0 years ago by Duck ▴ 40

Ram · Answer 1 · 2015-12-14

When you say that there are 752 N gaps in hg38, how to you measure them? What is your definition?

Let's play with it a bit:

> library(AnnotationHub)
> library(Homo.sapiens)
> library(BSgenome.Hsapiens.UCSC.hg19)

> ahub = AnnotationHub()
> AnnotationHub::query(ahub, c('hg19', 'gap'))
> human.gaps = ahub[['AH6444']]
> human.nongaps = BiocGenerics::setdiff(GRanges(seqinfo(human.gaps)), human.gaps)
> alphabetFrequency(Views(BSgenome.Hsapiens.UCSC.hg19, human.gaps)) %>% head
     A C G T M R W S Y K V H D B        N - + .
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18000000 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0  3000000 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0   150000 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0   100000 0 0 0
[5,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0    50000 0 0 0
[6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0   150000 0 0 0

This shows that all the regions in the gaps track contains only N and no other nucleotides. So far so good, this is what we expected.

> summary(width(human.gaps))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
      47    40440    50000   524800   100000 30000000

The distribution of gaps sizes shows that on average a gap is 524,800 bases long, with a median of 50,000.

What is the distribution of lengths in your 752-603=149 missing gaps?

The example on chr2 you posted may be simply too short. As you can see from the distribution, most gaps are a few orders of magnitude larger. For example there is only one gap <500bp, and it is on a hap chromosome:

> subset(human.gaps, width(human.gaps) < 500)
UCSC track 'gap'
UCSCData object with 1 range and 0 metadata columns:
           seqnames             ranges strand
              <Rle>          <IRanges>  <Rle>
  [1] chr6_mcf_hap5 [4149580, 4149626]      *
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome