Identifying long stretches of Ns
4
0
Entering edit mode
7.0 years ago

Is there a R package to identify long stretches of gaps (Ns) in human genome (hg19)?

genome R • 4.2k views
ADD COMMENT
0
Entering edit mode

Are you looking for hard-masked repeat regions or just N's at ends of chromosomes?

ADD REPLY
0
Entering edit mode

I am looking for regions were it is obvious to find gaps (N's) in hg19, I am not looking into hard masked repeat regions.

ADD REPLY
4
Entering edit mode
7.0 years ago
ATpoint 85k

Update 2023 -- Use Biostrings::vmatchPattern().

library(BSgenome.Hsapiens.UCSC.hg38)
library(Biostrings)

# Will take a few moments
vmp <- Biostrings::vmatchPattern(pattern="NNNNNNNNNNNNNNNNNNNN", subject=BSgenome.Hsapiens.UCSC.hg38)
vmp

GRanges object with 323184822 ranges and 0 metadata columns:
                          seqnames        ranges strand
                             <Rle>     <IRanges>  <Rle>
          [1]                 chr1          1-20      +
          [2]                 chr1          2-21      +
          [3]                 chr1          3-22      +
          [4]                 chr1          4-23      +
          [5]                 chr1          5-24      +
          ...                  ...           ...    ...
  [323184818] chr19_KV575249v1_alt 210330-210349      -
  [323184819] chr19_KV575249v1_alt 210331-210350      -
  [323184820] chr19_KV575249v1_alt 210332-210351      -
  [323184821] chr19_KV575249v1_alt 210333-210352      -
  [323184822] chr19_KV575249v1_alt 210334-210353      -
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

This returns every distinct interval as a separate entry. Use GenomicRanges::reduce() on the output to collapse adjacent and overlapping intervals into one.

ADD COMMENT
0
Entering edit mode

with all these solutions it requires now a benchmarking :-p

ADD REPLY
3
Entering edit mode
7.0 years ago

You could use the stringy package to detect all N position in the genome and then apply some function to extract the number of consectuive Ns and then order them by occurence

chromosome <- "ATGTAGATATGAATGCCCNNNNNACGACGACGAGNNNNNNNNNNNNNNNNNNACGACGACGAGAGGGANNAAAAN"
# find all position in one chromosome
n.pos <- stri_locate_all_fixed(chromosome,"N")[[1]][,1]
# group consecutive Ns in elements of a list
n.pos.list <- split(n.pos, cumsum(c(1, diff(n.pos) != 1)))
# extract for each group of Ns the length (thus the length of the gap)
n.length <-sapply(n.pos.list,length)
# construct results
n.pos.res <- cbind(do.call(rbind,lapply(n.pos.list,function(x){return(c(min(x),max(x)))})),length=n.length)
# order by length
n.pos.res <- n.pos.res[order(n.pos.res[,3],decreasing = T),]
# name column
colnames(n.pos.res)<-c("start","end","length")

Results :

  start end length
2    35  52     18
1    19  23      5
3    69  70      2
4    75  75      1

You could also perform some thresold (here minimum gaps of 1000bp are keep.

n.pos.res.filtered <- n.pos.res[n.pos.res[,3]>1000,]
ADD COMMENT
3
Entering edit mode
7.0 years ago

not R

my answer to

ADD COMMENT
3
Entering edit mode
ADD REPLY
0
Entering edit mode
11 months ago
André • 0

The GC selector (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gc5Base.txt.gz) has data for basically the entire genome and doesn't include the N positions. You can just subtract that from the entire hg19. Something like:

cat /references/hg19.fa.fai | awk -F "\t" '{print $1 "\t" 1 "\t" $2'} > hg19.bed
cut -f 2-4 gc5Base.txt | bedtools subtract -a hg19.bed -b - | awk -F "\t" '{print $1 "\t" $2+1 "\t" $3}' > hg19_Ns.bed
ADD COMMENT

Login before adding your answer.

Traffic: 2723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6