Identifying long stretches of Ns
4
0
Entering edit mode
7.2 years ago

Is there a R package to identify long stretches of gaps (Ns) in human genome (hg19)?

genome R • 4.3k views
ADD COMMENT
0
Entering edit mode

Are you looking for hard-masked repeat regions or just N's at ends of chromosomes?

ADD REPLY
0
Entering edit mode

I am looking for regions were it is obvious to find gaps (N's) in hg19, I am not looking into hard masked repeat regions.

ADD REPLY
4
Entering edit mode
7.2 years ago
ATpoint 86k

Update 2023 -- Use Biostrings::vmatchPattern().

library(BSgenome.Hsapiens.UCSC.hg38)
library(Biostrings)

# Will take a few moments
vmp <- Biostrings::vmatchPattern(pattern="NNNNNNNNNNNNNNNNNNNN", subject=BSgenome.Hsapiens.UCSC.hg38)
vmp

GRanges object with 323184822 ranges and 0 metadata columns:
                          seqnames        ranges strand
                             <Rle>     <IRanges>  <Rle>
          [1]                 chr1          1-20      +
          [2]                 chr1          2-21      +
          [3]                 chr1          3-22      +
          [4]                 chr1          4-23      +
          [5]                 chr1          5-24      +
          ...                  ...           ...    ...
  [323184818] chr19_KV575249v1_alt 210330-210349      -
  [323184819] chr19_KV575249v1_alt 210331-210350      -
  [323184820] chr19_KV575249v1_alt 210332-210351      -
  [323184821] chr19_KV575249v1_alt 210333-210352      -
  [323184822] chr19_KV575249v1_alt 210334-210353      -
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

This returns every distinct interval as a separate entry. Use GenomicRanges::reduce() on the output to collapse adjacent and overlapping intervals into one.

ADD COMMENT
0
Entering edit mode

with all these solutions it requires now a benchmarking :-p

ADD REPLY
3
Entering edit mode
7.2 years ago

You could use the stringy package to detect all N position in the genome and then apply some function to extract the number of consectuive Ns and then order them by occurence

chromosome <- "ATGTAGATATGAATGCCCNNNNNACGACGACGAGNNNNNNNNNNNNNNNNNNACGACGACGAGAGGGANNAAAAN"
# find all position in one chromosome
n.pos <- stri_locate_all_fixed(chromosome,"N")[[1]][,1]
# group consecutive Ns in elements of a list
n.pos.list <- split(n.pos, cumsum(c(1, diff(n.pos) != 1)))
# extract for each group of Ns the length (thus the length of the gap)
n.length <-sapply(n.pos.list,length)
# construct results
n.pos.res <- cbind(do.call(rbind,lapply(n.pos.list,function(x){return(c(min(x),max(x)))})),length=n.length)
# order by length
n.pos.res <- n.pos.res[order(n.pos.res[,3],decreasing = T),]
# name column
colnames(n.pos.res)<-c("start","end","length")

Results :

  start end length
2    35  52     18
1    19  23      5
3    69  70      2
4    75  75      1

You could also perform some thresold (here minimum gaps of 1000bp are keep.

n.pos.res.filtered <- n.pos.res[n.pos.res[,3]>1000,]
ADD COMMENT
3
Entering edit mode
7.2 years ago

not R

my answer to

$ flex input.l && gcc -O3 lex.yy.c && curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz" | gunzip -c | ./a.out 
chr22	0	16050000
chr22	16697850	16847850
chr22	19178139	19178140
chr22	19178159	19178160
chr22	19178161	19178164
chr22	19178165	19178167
chr22	19178168	19178169
chr22	19178263	19178264
chr22	19178267	19178268
chr22	19178273	19178274
chr22	19178283	19178284
chr22	19178310	19178311
chr22	19178314	19178315
chr22	19179518	19179519
chr22	19193925	19193926
chr22	19200634	19200635
chr22	19200640	19200641
chr22	19205396	19205397
chr22	19205400	19205401
chr22	19205448	19205449
chr22	20509431	20609431
chr22	50364777	50414777
view raw README.md hosted with ❤ by GitHub
%{
/**
flex input.l && gcc lex.yy.c && curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz" | gunzip -c | ./a.out
*/
#include <stdio.h>
#include <string.h>
static char* name=NULL;
static long pos=0L,beg=0L;
#define BED if(name!=NULL && beg<pos) printf("%s\t%ld\t%ld\n",name,beg,pos)
%}
%option noyywrap
%%
>.*\n {
BED;
pos = beg = 0L;
free(name);
name = strndup(&yytext[1],yyleng-2);
}
[nN]+ pos+=yyleng;
[ \n] ;
[^nN> \n]+ {
BED;
pos += yyleng;
beg = pos;
}
%%
int main()
{
yylex();
return 0;
}
view raw input.l hosted with ❤ by GitHub

ADD COMMENT
3
Entering edit mode
ADD REPLY
0
Entering edit mode
13 months ago
André • 0

The GC selector (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gc5Base.txt.gz) has data for basically the entire genome and doesn't include the N positions. You can just subtract that from the entire hg19. Something like:

cat /references/hg19.fa.fai | awk -F "\t" '{print $1 "\t" 1 "\t" $2'} > hg19.bed
cut -f 2-4 gc5Base.txt | bedtools subtract -a hg19.bed -b - | awk -F "\t" '{print $1 "\t" $2+1 "\t" $3}' > hg19_Ns.bed
ADD COMMENT

Login before adding your answer.

Traffic: 3892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6