Question

Is it possible to generate a BED/BEDGRAPH containing every dinucleotide in the genome?

0

Entering edit mode

6.3 years ago

dansapozhnikov ▴ 10

Hi, I am interested, if possible in creating a file that will contain the position every possible dinucleotide in the genome FASTA, like so:

chr1    1    2
chr1    3    4
chr1    5    6
...etc...

I would do this manually but I have a non-reference genome with hundreds of scaffolds instead of chromosomes.

Thanks!

genome bed • 2.0k views

ADD COMMENT • link updated 6.3 years ago by dr_bantz ▴ 110 • written 6.3 years ago by dansapozhnikov ▴ 10

0

Entering edit mode

6.3 years ago

Alex Reynolds 36k

This would be for a reference genome, but the underlying principles are the same:

$ fetchChromSizes hg38 \
    | awk -vOFS="\t" '{ print $1, "0", $2; } \
    | grep -vE '*_*' \
    | sort-bed - \
    | bedops --chop 2 - \
    > dinucleotides.bed

The fetchChromSizes binary comes from the UCSC Kent utilities kit. The sort-bed and bedops binaries come from the BEDOPS kit.

Adding -x with --chop would ensure grabbing the last interval, regardless of size. But if you want strict dinucleotide intervals, you probably want to leave out the last interval, if it is one base long.

If you're not working with a reference genome, then you just need to feed sort-bed a BED file of intervals defining the bounds of your scaffolds, i.e.:

$ sort-bed scaffolds.bed | bedops --chop 2 - > dinucleotides.bed

The resulting BED file can be converted to FASTA with samtools faidx and a helper script like bed2fastaidx.pl, available on Github Gist: https://bit.ly/2nCvej2

$ /path/to/bed2fastaidx.pl --fastaIsUncompressed --fastaDir=/path/to/genome/fasta < dinucleotides.bed > dinucleotides.fa

ADD COMMENT • link 6.3 years ago by Alex Reynolds 36k

0

Entering edit mode

Hi, thank you for your help, but this seems to only work for UCSC-curated genomes, not FASTA files. The other answer worked in my situation. Thank you for your help regardless

ADD REPLY • link 6.3 years ago by dansapozhnikov ▴ 10

score 1 · Accepted Answer · 2018-08-22

This python script will do it. The genome_file variable is a bedtools genome file, which is just a tab-delimitted text file with chromosome/scaffold name in the first column and chromosome/scaffold size in the second column. Note that this will produce a 0-index bed file (which is what bed files are supposed to be).

genome_file = "bedtools_genome_file.txt"

output_file = "dinucleotides.bed"

stepsize = 2

with open(genome_file) as f, open(output_file, "w") as fo:
    for line in f:
        chrom = line.strip().split()[0]
        chrom_size = int(line.strip().split()[1])
        start = 0
        while start <= (chrom_size-stepsize):
            end = start + stepsize
            fo.write("\t".join([chrom, str(start), str(end)])+"\n")
            start = end