When you say that there are 752 N gaps in hg38, how to you measure them? What is your definition?
Let's play with it a bit:
> library(AnnotationHub)
> library(Homo.sapiens)
> library(BSgenome.Hsapiens.UCSC.hg19)
> ahub = AnnotationHub()
> AnnotationHub::query(ahub, c('hg19', 'gap'))
> human.gaps = ahub[['AH6444']]
> human.nongaps = BiocGenerics::setdiff(GRanges(seqinfo(human.gaps)), human.gaps)
> alphabetFrequency(Views(BSgenome.Hsapiens.UCSC.hg19, human.gaps)) %>% head
A C G T M R W S Y K V H D B N - + .
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18000000 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3000000 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 150000 0 0 0
[4,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100000 0 0 0
[5,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50000 0 0 0
[6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 150000 0 0 0
This shows that all the regions in the gaps track contains only N and no other nucleotides. So far so good, this is what we expected.
> summary(width(human.gaps))
Min. 1st Qu. Median Mean 3rd Qu. Max.
47 40440 50000 524800 100000 30000000
The distribution of gaps sizes shows that on average a gap is 524,800 bases long, with a median of 50,000.
What is the distribution of lengths in your 752-603=149 missing gaps?
The example on chr2 you posted may be simply too short. As you can see from the distribution, most gaps are a few orders of magnitude larger. For example there is only one gap <500bp, and it is on a hap chromosome:
> subset(human.gaps, width(human.gaps) < 500)
UCSC track 'gap'
UCSCData object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr6_mcf_hap5 [4149580, 4149626] *
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
Hi,
Thank you for the example. I would like to know, if I can't those gaps as a bed file. I have couple bed files and my aim is to find overlaps across all the samples. But during this process, my overlapping regions come over to NNNNNNNNNs when I check them on IGV. Therefore, I would like to mask them out prior to visualisation.
Thank you for your help, (sorry for asking with a comment)
Tunc.
So what is the meaning of these 149 short missing gaps? Do they represent areas where the sequencing got interrupted?