Question

Position of N nucleotides in reference genome

0

Entering edit mode

9.7 years ago

QVINTVS_FABIVS_MAXIMVS ★ 2.6k

Title says it all.

Is there a resource that has the positions for every "N" nucleotide in the reference genome (hg38)?

If not, is there a method to quickly generate these positions.

I was thinking using bedtools nuc with a bed file that windows the genome, but that seems too much work.

Thanks!

reference-genome nucleotide • 3.0k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.7 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.6k

Ram · Accepted Answer · 2015-11-16

On Linux, you could do the following to get sequences, extract them to fasta, and write output to BED format:

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
$ for i in `seq 1 22` X Y; do \
    ./twoBitToFa -seq=chr$i hg38.2bit hg38.2bit.chr$i.fa; \
  done
$ for i in `seq 1 22` X Y; do \
    awk -vchr="chr$i" ' \
        BEGIN { \
            p = 0; \
        } \
        { \
            if ($0 !~ /^>/) { \
                n = split($0, chars, ""); \
                for (i = 1; i <= n; i++) { \
                    if (chars[i] == "N") { \
                        printf("%s\t%d\t%d\n", chr, p, p + 1); \
                    } \
                    p++; \
                } \
            } \
        }' hg38.2bit.chr$i.fa \
        > hg38.2bit.chr$i.N.bed; \
  done

The work in the for loops can be parallelized, if you need a speed-up, by splitting out work to per-chromosome tasks.

If you need contiguous regions of Ns in the output BED, you can use BEDOPS bedops --merge:

$ for i in `seq 1 22` X Y; do \
    bedops --merge hg38.2bit.chr$i.N.bed > hg38.2bit.chr$i.N.contiguous.bed; \
  done

Be prepared for large BED files, unless you do a bedops --merge at the end.

Since these are three-column BED files, using the Starch format to archive BED files would be pretty useful here - there's a lot of redundancy to be removed. You could pipe the output of awk to starch to make an archive:

$ for i in `seq 1 22` X Y; do \
    awk -vchr="chr$i" ' \
        BEGIN { \
            p = 0; \
        } \
        { \
            if ($0 !~ /^>/) { \
                n = split($0, chars, ""); \
                for (i = 1; i <= n; i++) { \
                    if (chars[i] == "N") { \
                        printf("%s\t%d\t%d\n", chr, p, p + 1); \
                    } \
                    p++; \
                } \
            } \
        }' hg38.2bit.chr$i.fa \
        | starch - \
        > hg38.2bit.chr$i.N.starch; \
  done