Identifying long stretches of Ns

0

Entering edit mode

7.7 years ago

anshupa.vssut ▴ 50

Is there a R package to identify long stretches of gaps (Ns) in human genome (hg19)?

genome R • 4.8k views

ADD COMMENT • link updated 18 months ago by André • 0 • written 7.7 years ago by anshupa.vssut ▴ 50

0

Entering edit mode

Are you looking for hard-masked repeat regions or just N's at ends of chromosomes?

ADD REPLY • link 7.7 years ago by GenoMax 152k

0

Entering edit mode

I am looking for regions were it is obvious to find gaps (N's) in hg19, I am not looking into hard masked repeat regions.

ADD REPLY • link 7.7 years ago by anshupa.vssut ▴ 50

4

Entering edit mode

7.7 years ago

ATpoint 88k

Update 2023 -- Use Biostrings::vmatchPattern().

library(BSgenome.Hsapiens.UCSC.hg38)
library(Biostrings)

# Will take a few moments
vmp <- Biostrings::vmatchPattern(pattern="NNNNNNNNNNNNNNNNNNNN", subject=BSgenome.Hsapiens.UCSC.hg38)
vmp

GRanges object with 323184822 ranges and 0 metadata columns:
                          seqnames        ranges strand
                             <Rle>     <IRanges>  <Rle>
          [1]                 chr1          1-20      +
          [2]                 chr1          2-21      +
          [3]                 chr1          3-22      +
          [4]                 chr1          4-23      +
          [5]                 chr1          5-24      +
          ...                  ...           ...    ...
  [323184818] chr19_KV575249v1_alt 210330-210349      -
  [323184819] chr19_KV575249v1_alt 210331-210350      -
  [323184820] chr19_KV575249v1_alt 210332-210351      -
  [323184821] chr19_KV575249v1_alt 210333-210352      -
  [323184822] chr19_KV575249v1_alt 210334-210353      -
  -------
  seqinfo: 711 sequences (1 circular) from hg38 genome

This returns every distinct interval as a separate entry. Use GenomicRanges::reduce() on the output to collapse adjacent and overlapping intervals into one.

ADD COMMENT • link 2.2 years ago by ATpoint 88k

0

Entering edit mode

with all these solutions it requires now a benchmarking :-p

ADD REPLY • link 2.2 years ago by Nicolas Rosewick 11k

3

Entering edit mode

7.7 years ago

Nicolas Rosewick 11k

You could use the stringy package to detect all N position in the genome and then apply some function to extract the number of consectuive Ns and then order them by occurence

chromosome <- "ATGTAGATATGAATGCCCNNNNNACGACGACGAGNNNNNNNNNNNNNNNNNNACGACGACGAGAGGGANNAAAAN"
# find all position in one chromosome
n.pos <- stri_locate_all_fixed(chromosome,"N")[[1]][,1]
# group consecutive Ns in elements of a list
n.pos.list <- split(n.pos, cumsum(c(1, diff(n.pos) != 1)))
# extract for each group of Ns the length (thus the length of the gap)
n.length <-sapply(n.pos.list,length)
# construct results
n.pos.res <- cbind(do.call(rbind,lapply(n.pos.list,function(x){return(c(min(x),max(x)))})),length=n.length)
# order by length
n.pos.res <- n.pos.res[order(n.pos.res[,3],decreasing = T),]
# name column
colnames(n.pos.res)<-c("start","end","length")

Results :

  start end length
2    35  52     18
1    19  23      5
3    69  70      2
4    75  75      1

You could also perform some thresold (here minimum gaps of 1000bp are keep.

n.pos.res.filtered <- n.pos.res[n.pos.res[,3]>1000,]

ADD COMMENT • link 7.7 years ago by Nicolas Rosewick 11k

3

Entering edit mode

7.7 years ago

Pierre Lindenbaum 166k

not R

my answer to

Who's got a command for me to output coordinates of FASTA scaffolds gaps as a BED file? Looking for a one-liner. Saw https://t.co/7Znt8tPLZQ
— Shaun Jackman (@sjackman) June 29, 2016

    $ flex input.l && gcc -O3 lex.yy.c && curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz" | gunzip -c | ./a.out 
chr22	0	16050000
chr22	16697850	16847850
chr22	19178139	19178140
chr22	19178159	19178160
chr22	19178161	19178164
chr22	19178165	19178167
chr22	19178168	19178169
chr22	19178263	19178264
chr22	19178267	19178268
chr22	19178273	19178274
chr22	19178283	19178284
chr22	19178310	19178311
chr22	19178314	19178315
chr22	19179518	19179519
chr22	19193925	19193926
chr22	19200634	19200635
chr22	19200640	19200641
chr22	19205396	19205397
chr22	19205400	19205401
chr22	19205448	19205449
chr22	20509431	20609431
chr22	50364777	50414777

  

view raw README.md hosted with ❤ by GitHub


	%{
	/**
	flex input.l && gcc lex.yy.c && curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz" \| gunzip -c \| ./a.out
	*/

	#include <stdio.h>
	#include <string.h>
	static char* name=NULL;
	static long pos=0L,beg=0L;
	#define BED if(name!=NULL && beg<pos) printf("%s\t%ld\t%ld\n",name,beg,pos)
	%}

	%option noyywrap

	%%

	>.*\n {
	BED;
	pos = beg = 0L;
	free(name);
	name = strndup(&yytext[1],yyleng-2);
	}

	[nN]+ pos+=yyleng;

	[ \n] ;

	[^nN> \n]+ {
	BED;
	pos += yyleng;
	beg = pos;
	}


	%%

	int main()
	{
	yylex();
	return 0;
	}

view raw input.l hosted with ❤ by GitHub

ADD COMMENT • link 7.7 years ago by Pierre Lindenbaum 166k

3

Entering edit mode

5.4 years later: use ScatterIntervalsByNs https://gatk.broadinstitute.org/hc/en-us/articles/360041416072-ScatterIntervalsByNs-Picard-

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

18 months ago

André • 0

The GC selector (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gc5Base.txt.gz) has data for basically the entire genome and doesn't include the N positions. You can just subtract that from the entire hg19. Something like:

cat /references/hg19.fa.fai | awk -F "\t" '{print $1 "\t" 1 "\t" $2'} > hg19.bed
cut -f 2-4 gc5Base.txt | bedtools subtract -a hg19.bed -b - | awk -F "\t" '{print $1 "\t" $2+1 "\t" $3}' > hg19_Ns.bed

ADD COMMENT • link 18 months ago by André • 0

Login before adding your answer.