Is it possible to generate a BED/BEDGRAPH containing every dinucleotide in the genome?
2
0
Entering edit mode
6.3 years ago

Hi, I am interested, if possible in creating a file that will contain the position every possible dinucleotide in the genome FASTA, like so:

chr1    1    2
chr1    3    4
chr1    5    6
...etc...

I would do this manually but I have a non-reference genome with hundreds of scaffolds instead of chromosomes.

Thanks!

genome bed • 2.0k views
ADD COMMENT
1
Entering edit mode
6.3 years ago
dr_bantz ▴ 110

This python script will do it. The genome_file variable is a bedtools genome file, which is just a tab-delimitted text file with chromosome/scaffold name in the first column and chromosome/scaffold size in the second column. Note that this will produce a 0-index bed file (which is what bed files are supposed to be).

genome_file = "bedtools_genome_file.txt"

output_file = "dinucleotides.bed"

stepsize = 2

with open(genome_file) as f, open(output_file, "w") as fo:
    for line in f:
        chrom = line.strip().split()[0]
        chrom_size = int(line.strip().split()[1])
        start = 0
        while start <= (chrom_size-stepsize):
            end = start + stepsize
            fo.write("\t".join([chrom, str(start), str(end)])+"\n")
            start = end
ADD COMMENT
1
Entering edit mode

Great, this worked perfectly! For future reference, to create a bedtools genome file, all it takes is a fasta index (can be generated by samtools faidx) and to run the following on it: awk -v OFS='\t' {'print $1,$2'} genome.fa.fai > genomeFile.txt

ADD REPLY
0
Entering edit mode
6.3 years ago

This would be for a reference genome, but the underlying principles are the same:

$ fetchChromSizes hg38 \
    | awk -vOFS="\t" '{ print $1, "0", $2; } \
    | grep -vE '*_*' \
    | sort-bed - \
    | bedops --chop 2 - \
    > dinucleotides.bed

The fetchChromSizes binary comes from the UCSC Kent utilities kit. The sort-bed and bedops binaries come from the BEDOPS kit.

Adding -x with --chop would ensure grabbing the last interval, regardless of size. But if you want strict dinucleotide intervals, you probably want to leave out the last interval, if it is one base long.

If you're not working with a reference genome, then you just need to feed sort-bed a BED file of intervals defining the bounds of your scaffolds, i.e.:

$ sort-bed scaffolds.bed | bedops --chop 2 - > dinucleotides.bed

The resulting BED file can be converted to FASTA with samtools faidx and a helper script like bed2fastaidx.pl, available on Github Gist: https://bit.ly/2nCvej2

$ /path/to/bed2fastaidx.pl --fastaIsUncompressed --fastaDir=/path/to/genome/fasta < dinucleotides.bed > dinucleotides.fa
ADD COMMENT
0
Entering edit mode

Hi, thank you for your help, but this seems to only work for UCSC-curated genomes, not FASTA files. The other answer worked in my situation. Thank you for your help regardless

ADD REPLY

Login before adding your answer.

Traffic: 1584 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6