Is there a R package to identify long stretches of gaps (Ns) in human genome (hg19)?
Is there a R package to identify long stretches of gaps (Ns) in human genome (hg19)?
Update 2023 -- Use Biostrings::vmatchPattern()
.
library(BSgenome.Hsapiens.UCSC.hg38)
library(Biostrings)
# Will take a few moments
vmp <- Biostrings::vmatchPattern(pattern="NNNNNNNNNNNNNNNNNNNN", subject=BSgenome.Hsapiens.UCSC.hg38)
vmp
GRanges object with 323184822 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1-20 +
[2] chr1 2-21 +
[3] chr1 3-22 +
[4] chr1 4-23 +
[5] chr1 5-24 +
... ... ... ...
[323184818] chr19_KV575249v1_alt 210330-210349 -
[323184819] chr19_KV575249v1_alt 210331-210350 -
[323184820] chr19_KV575249v1_alt 210332-210351 -
[323184821] chr19_KV575249v1_alt 210333-210352 -
[323184822] chr19_KV575249v1_alt 210334-210353 -
-------
seqinfo: 711 sequences (1 circular) from hg38 genome
This returns every distinct interval as a separate entry. Use GenomicRanges::reduce()
on the output to collapse adjacent and overlapping intervals into one.
You could use the stringy package to detect all N position in the genome and then apply some function to extract the number of consectuive Ns and then order them by occurence
chromosome <- "ATGTAGATATGAATGCCCNNNNNACGACGACGAGNNNNNNNNNNNNNNNNNNACGACGACGAGAGGGANNAAAAN"
# find all position in one chromosome
n.pos <- stri_locate_all_fixed(chromosome,"N")[[1]][,1]
# group consecutive Ns in elements of a list
n.pos.list <- split(n.pos, cumsum(c(1, diff(n.pos) != 1)))
# extract for each group of Ns the length (thus the length of the gap)
n.length <-sapply(n.pos.list,length)
# construct results
n.pos.res <- cbind(do.call(rbind,lapply(n.pos.list,function(x){return(c(min(x),max(x)))})),length=n.length)
# order by length
n.pos.res <- n.pos.res[order(n.pos.res[,3],decreasing = T),]
# name column
colnames(n.pos.res)<-c("start","end","length")
Results :
start end length
2 35 52 18
1 19 23 5
3 69 70 2
4 75 75 1
You could also perform some thresold (here minimum gaps of 1000bp are keep.
n.pos.res.filtered <- n.pos.res[n.pos.res[,3]>1000,]
not R
my answer to
Who's got a command for me to output coordinates of FASTA scaffolds gaps as a BED file? Looking for a one-liner. Saw https://t.co/7Znt8tPLZQ
— Shaun Jackman (@sjackman) June 29, 2016
5.4 years later: use ScatterIntervalsByNs https://gatk.broadinstitute.org/hc/en-us/articles/360041416072-ScatterIntervalsByNs-Picard-
The GC selector (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gc5Base.txt.gz) has data for basically the entire genome and doesn't include the N positions. You can just subtract that from the entire hg19. Something like:
cat /references/hg19.fa.fai | awk -F "\t" '{print $1 "\t" 1 "\t" $2'} > hg19.bed
cut -f 2-4 gc5Base.txt | bedtools subtract -a hg19.bed -b - | awk -F "\t" '{print $1 "\t" $2+1 "\t" $3}' > hg19_Ns.bed
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Are you looking for hard-masked repeat regions or just N's at ends of chromosomes?
I am looking for regions were it is obvious to find gaps (N's) in hg19, I am not looking into hard masked repeat regions.