On Linux, you could do the following to get sequences, extract them to fasta, and write output to BED format:
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
$ for i in `seq 1 22` X Y; do \
./twoBitToFa -seq=chr$i hg38.2bit hg38.2bit.chr$i.fa; \
done
$ for i in `seq 1 22` X Y; do \
awk -vchr="chr$i" ' \
BEGIN { \
p = 0; \
} \
{ \
if ($0 !~ /^>/) { \
n = split($0, chars, ""); \
for (i = 1; i <= n; i++) { \
if (chars[i] == "N") { \
printf("%s\t%d\t%d\n", chr, p, p + 1); \
} \
p++; \
} \
} \
}' hg38.2bit.chr$i.fa \
> hg38.2bit.chr$i.N.bed; \
done
The work in the for
loops can be parallelized, if you need a speed-up, by splitting out work to per-chromosome tasks.
If you need contiguous regions of Ns in the output BED, you can use BEDOPS bedops --merge:
$ for i in `seq 1 22` X Y; do \
bedops --merge hg38.2bit.chr$i.N.bed > hg38.2bit.chr$i.N.contiguous.bed; \
done
Be prepared for large BED files, unless you do a bedops --merge at the end.
Since these are three-column BED files, using the Starch format to archive BED files would be pretty useful here - there's a lot of redundancy to be removed. You could pipe the output of awk to starch to make an archive:
$ for i in `seq 1 22` X Y; do \
awk -vchr="chr$i" ' \
BEGIN { \
p = 0; \
} \
{ \
if ($0 !~ /^>/) { \
n = split($0, chars, ""); \
for (i = 1; i <= n; i++) { \
if (chars[i] == "N") { \
printf("%s\t%d\t%d\n", chr, p, p + 1); \
} \
p++; \
} \
} \
}' hg38.2bit.chr$i.fa \
| starch - \
> hg38.2bit.chr$i.N.starch; \
done
I would vote you up and say it's the solution but I keep getting a "Unable to submit vote" error
Don't worry, I get that a lot.