This would be for a reference genome, but the underlying principles are the same:
$ fetchChromSizes hg38 \
| awk -vOFS="\t" '{ print $1, "0", $2; } \
| grep -vE '*_*' \
| sort-bed - \
| bedops --chop 2 - \
> dinucleotides.bed
The fetchChromSizes
binary comes from the UCSC Kent utilities kit. The sort-bed
and bedops
binaries come from the BEDOPS kit.
Adding -x
with --chop
would ensure grabbing the last interval, regardless of size. But if you want strict dinucleotide intervals, you probably want to leave out the last interval, if it is one base long.
If you're not working with a reference genome, then you just need to feed sort-bed
a BED file of intervals defining the bounds of your scaffolds, i.e.:
$ sort-bed scaffolds.bed | bedops --chop 2 - > dinucleotides.bed
The resulting BED file can be converted to FASTA with samtools faidx
and a helper script like bed2fastaidx.pl
, available on Github Gist: https://bit.ly/2nCvej2
$ /path/to/bed2fastaidx.pl --fastaIsUncompressed --fastaDir=/path/to/genome/fasta < dinucleotides.bed > dinucleotides.fa
Great, this worked perfectly! For future reference, to create a bedtools genome file, all it takes is a fasta index (can be generated by samtools faidx) and to run the following on it: awk -v OFS='\t' {'print $1,$2'} genome.fa.fai > genomeFile.txt