I want to randomly sample 'n' number of sequence motifs of length 'k' from the reference sequence hg38.fa. Can someone help me out?
I want to randomly sample 'n' number of sequence motifs of length 'k' from the reference sequence hg38.fa. Can someone help me out?
This code does the job. It's not terrible at scale either.
#!/usr/bin/env python
import re, random
n = 20
k = 5
kmers = re.compile("(?=(\w{%s}))" % k)
selected = random.sample(kmers.findall(seq), n) # Where 'seq' is the genome sequence as a string.
Change k
(motif/kmer length) and n
(random sample size) to suit.
Since posting this answer I've done some benchmarking, and this enumerated (n
= ) 10 random (k
= ) 9-mers from each entry in GCA_000001405.15_GRCh38_genomic.fna
. Note that this should scale reasonably well for sampling larger n
, but larger k
will eventually become problematic
The total code execution time was: 00:22:03.37
(slightly over 22 minutes). See the gist below for the the results for each individual fasta in the multi .fna
.
Well, it's essentially the same thing, since you get kmers from all sequences as a 'pool'. It might not be 100% fully statistically random, but someone with more intuition on that would have to weigh in. Certain kmers could be over or underrepresented in this approach I guess.
If you concatenated all of the fastas together and run it, it should still work, but there are 2 issues:
I suppose the way to do it might be to read each chromosome, get its kmers, and add all of them to a 'master' pool of all kmers, and then randomly select from that. Storing those kmers will require a fair amount of memory though.
The concatenated approach would look something like:
$ cat GRCh38.fasta | sed -e '1!{/^>.*/d;}' | sed ':a;N;$!ba;s/\n//2g' | sed '1!s/.\{80\}/&\n/g' > GRCh38_concatenated.fasta
Then run the code as before on the newly created file.
Alternatively, iterate each chromosome and populate the pool along the lines of the following (I'm not going to run/test it as it will likely take too long):
#!/usr/bin/env python
import re, random
from Bio import SeqIO
n = 20
k = 5
kmers = re.compile("(?=(\w{%s}))" % k)
all_kmers = []
for record in SeqIO.parse('GCRh38.fasta', 'fasta'):
all_kmers.append(kmers.findall(str(record.seq)))
selected = random.sample(all_kmers, n)
I tried your code in the last comment and it's taking an unusually long time. You were right! So I am sticking to the chromosome-wise sampling. Just a thought: What if I generate multiple random ranges of length 10 (say) and then use these to extract sequences. So I will generate something like 189-200, 7767655-7767666, 8987870-8987881 etc.
I’m not sure I fully understand what you mean but I don’t think you can just randomly choose ranges. There would be no guarantee of you coming anywhere close to sampling the full kmer space.
There is a reason people don’t tend to do these sorts of brute force kmer calculations (exactly this reason). You could probably parallelise the code and work on each sequence separately, which would speed it up enormously however.
If I understand your question correctly, you want to randomly extract n sequences of length k from the hg38 genome. If this is the case, in R you can use regioneR package in Bioconductor. You can use the function createRandomRegions
to randomly select a number of positions in the genome and then use getSeq
from BSgenome to extract the sequences.
For example
library(BSgenome.Hsapiens.UCSC.hg38)
library(regioneR)
regs <- createRandomRegions(nregions = 100, length.mean = 10, length.sd = 0, genome = "hg38", mask=NA)
seqs <- getSeq(BSgenome.Hsapiens.UCSC.hg38, regs)
Will extract 100 region 10bp each from the genome. By default createRandomRegions
will avoid placing the regions in certain parts of the genome (highly repetitive, low complexity, stretches of N's...) using the default mask in the BSgenome object. Setting mask=NA
will disable any mask and allow it to place the regions anywhere in the genome with uniform probability.
Is this what you were looking for?
Yes, true. To use the mask you need to load the masked version of the BSgenome package
library(BSgenome.Hsapiens.UCSC.hg38.masked)
library(regioneR)
regs <- createRandomRegions(nregions = 100, length.mean = 10, length.sd = 0, genome = "hg38")
seqs <- getSeq(BSgenome.Hsapiens.UCSC.hg38, regs)
With that you should not get the N's. However, take into account that the default mask do not have only the N regions but many regions identified by repeat masker, and others (refer to the BSgenome documentation for a detailed account of the masked regions). This will affect your results, since some motifs are heavily enriched in repeats and other masked elements. Also take into account that this problem will be present also with any other approach.
Maybe a better option would be to get more seqs than needed, remove the ones with Ns and then randomly sample the reduced list.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I am reasonably certain that what you are looking for can be found in @Pierre and @Kevin's answers here: A: Finding 16 mer not present in GRCh38 Kevin's answer will work, if k < 17.