Randomly sample motifs from reference sequence
2
2
Entering edit mode
5.9 years ago
Gene_MMP8 ▴ 240

I want to randomly sample 'n' number of sequence motifs of length 'k' from the reference sequence hg38.fa. Can someone help me out?

genome Assembly • 2.3k views
ADD COMMENT
2
Entering edit mode

I am reasonably certain that what you are looking for can be found in @Pierre and @Kevin's answers here: A: Finding 16 mer not present in GRCh38 Kevin's answer will work, if k < 17.

ADD REPLY
4
Entering edit mode
5.9 years ago
Joe 21k

This code does the job. It's not terrible at scale either.

#!/usr/bin/env python
import re, random

n = 20
k = 5
kmers = re.compile("(?=(\w{%s}))" % k)
selected = random.sample(kmers.findall(seq), n)    # Where 'seq' is the genome sequence as a string.

Change k (motif/kmer length) and n (random sample size) to suit.


Since posting this answer I've done some benchmarking, and this enumerated (n = ) 10 random (k = ) 9-mers from each entry in GCA_000001405.15_GRCh38_genomic.fna. Note that this should scale reasonably well for sampling larger n, but larger k will eventually become problematic

TL;DR:

The total code execution time was: 00:22:03.37 (slightly over 22 minutes). See the gist below for the the results for each individual fasta in the multi .fna.

Link to gist (since it makes the post too long)

ADD COMMENT
0
Entering edit mode

Thanks a lot for your reply. Your code randomly samples it from each chromosome. Is it possible to do this sampling from the entire genome, instead of individual chromosomes?

ADD REPLY
1
Entering edit mode

Well, it's essentially the same thing, since you get kmers from all sequences as a 'pool'. It might not be 100% fully statistically random, but someone with more intuition on that would have to weigh in. Certain kmers could be over or underrepresented in this approach I guess.

If you concatenated all of the fastas together and run it, it should still work, but there are 2 issues:

  1. You're reading an enormous file in to memory in one hit by that method.
  2. You'd create 'artificial' kmers, which correspond to the joins between chromosomes. (e.g the last 4 bases of chr1, and the first five of chr2 etc.). This might not be an issue if those kmers have already arisen normally, but thats not guaranteed.

I suppose the way to do it might be to read each chromosome, get its kmers, and add all of them to a 'master' pool of all kmers, and then randomly select from that. Storing those kmers will require a fair amount of memory though.

The concatenated approach would look something like:

$ cat GRCh38.fasta | sed -e '1!{/^>.*/d;}' | sed ':a;N;$!ba;s/\n//2g' | sed '1!s/.\{80\}/&\n/g' > GRCh38_concatenated.fasta

Then run the code as before on the newly created file.

Alternatively, iterate each chromosome and populate the pool along the lines of the following (I'm not going to run/test it as it will likely take too long):

#!/usr/bin/env python
import re, random
from Bio import SeqIO

n = 20
k = 5
kmers = re.compile("(?=(\w{%s}))" % k)
all_kmers = []
for record in SeqIO.parse('GCRh38.fasta', 'fasta'):
    all_kmers.append(kmers.findall(str(record.seq)))

selected = random.sample(all_kmers, n)
ADD REPLY
0
Entering edit mode

I tried your code in the last comment and it's taking an unusually long time. You were right! So I am sticking to the chromosome-wise sampling. Just a thought: What if I generate multiple random ranges of length 10 (say) and then use these to extract sequences. So I will generate something like 189-200, 7767655-7767666, 8987870-8987881 etc.

ADD REPLY
0
Entering edit mode

I’m not sure I fully understand what you mean but I don’t think you can just randomly choose ranges. There would be no guarantee of you coming anywhere close to sampling the full kmer space.

There is a reason people don’t tend to do these sorts of brute force kmer calculations (exactly this reason). You could probably parallelise the code and work on each sequence separately, which would speed it up enormously however.

ADD REPLY
0
Entering edit mode

Say there are 2899452029 base pairs in the reference genome. Then I can choose the starting position of the kmer uniformly from 1 to 2899452029 at random. This will continue until I get the required number of kmers.

ADD REPLY
1
Entering edit mode

Hmm yeah perhaps that would work. You'll have to decide what you do if it returns the same kmer in that case (and if you use a concatenated sequence for that, there is still the issue of 'artificial kmers').

ADD REPLY
1
Entering edit mode
5.9 years ago
bernatgel ★ 3.4k

If I understand your question correctly, you want to randomly extract n sequences of length k from the hg38 genome. If this is the case, in R you can use regioneR package in Bioconductor. You can use the function createRandomRegions to randomly select a number of positions in the genome and then use getSeq from BSgenome to extract the sequences.

For example

library(BSgenome.Hsapiens.UCSC.hg38)
library(regioneR)

regs <- createRandomRegions(nregions = 100, length.mean = 10, length.sd = 0, genome = "hg38", mask=NA)
seqs <- getSeq(BSgenome.Hsapiens.UCSC.hg38, regs)

Will extract 100 region 10bp each from the genome. By default createRandomRegions will avoid placing the regions in certain parts of the genome (highly repetitive, low complexity, stretches of N's...) using the default mask in the BSgenome object. Setting mask=NA will disable any mask and allow it to place the regions anywhere in the genome with uniform probability.

Is this what you were looking for?

ADD COMMENT
0
Entering edit mode

Hi, This is a great tool. thanks a lot for suggesting. Can you also specify how not to get Ns in my results? How do I mask that? Removing mask=NA from the above code isn't helping.

ADD REPLY
0
Entering edit mode

Yes, true. To use the mask you need to load the masked version of the BSgenome package

library(BSgenome.Hsapiens.UCSC.hg38.masked)
library(regioneR)

regs <- createRandomRegions(nregions = 100, length.mean = 10, length.sd = 0, genome = "hg38")
seqs <- getSeq(BSgenome.Hsapiens.UCSC.hg38, regs)

With that you should not get the N's. However, take into account that the default mask do not have only the N regions but many regions identified by repeat masker, and others (refer to the BSgenome documentation for a detailed account of the masked regions). This will affect your results, since some motifs are heavily enriched in repeats and other masked elements. Also take into account that this problem will be present also with any other approach.

Maybe a better option would be to get more seqs than needed, remove the ones with Ns and then randomly sample the reduced list.

ADD REPLY

Login before adding your answer.

Traffic: 2506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6