Question

How to generate a short sequence that does not align to the hg38?

4

Entering edit mode

5.8 years ago

r.tor ▴ 50

I would like to find a short sequence (25 bp) that does not match anywhere within Human Reference Genome (hg38/hg19) through R scripts. I am not sure if there is a quick solution to do what I am looking for. So, I guess it would be solve, if I can create a list of random 25 bp fragments and then do an alignment. Ultimately, the lowest score output could be the supposed answer! If it might be a correct approach to do, could you please guide me how it can be done by R scripting. Thank you.

R alignment • 1.7k views

ADD COMMENT • link 5.8 years ago by r.tor ▴ 50

2

Entering edit mode

Just thinking aloud, but you could screen for low abundance kmers in the Human genome, and then create a random sequence from any low abundance/absent kmers.

That should reduce the likelihood of any matching, rather than a 'guess and check' approach.

ADD REPLY • link 5.8 years ago by Joe 21k

2

Entering edit mode

Keep in mind also that to ask for a sequence that does not align you need to decide what makes an alignment valid (e.g, maximum number of mismatches or minimum percent identity, E-value, etc). If you don't define any constraint, then any sequence can be aligned to any other sequence by allowing enough edits. (Presumably you leave this decision to your aligner by executing it with default settings but maybe worth to think about it...)

ADD REPLY • link 5.8 years ago by dariober 15k

1

Entering edit mode

I’ll leave it to OP to edit, as I don’t like changing peoples wording if I can avoid it, but i suggest RefSeq in the thread title be changed to hg38 or whatever. It sounds like trying to create a sequence which doesn’t align anywhere in RefSeq at the moment which is probably nigh on impossible.

ADD REPLY • link 5.8 years ago by Joe 21k

0

Entering edit mode

Use this for some inspiration: https://bioinformatics.stackexchange.com/questions/2853/least-present-k-mers-in-the-human-genome

ADD REPLY • link 5.8 years ago by GenoMax 146k

score 2 · Answer 1 · 2019-01-20

What's it for? 25 bp sequence not found in human genome?

A previous answer in another post suggested a good method by mapping k-mers to genome and filtering that not mapped.

Here's a brute force method by myself: counting smaller k-mers (k=14) in genome and filtering them out from all theoretical k-mers, smaller K uses less RAM and runs faster but it becomes less specific.

Splitting human genome into files with single sequence using seqkit for speedup in later steps, rush is optional used for parallelization.

# splitting
seqkit split -s 1 Homo_sapiens.GRCh38.dna.primary_assembly.fa -O GRCh38/

# renaming
ls GRCh38/*.fa | rush 'mv {} {/}/$(seqkit head -n 1 {} | seqkit seq -n -i).fa'

Dumping all k-mers for k = 14 using unikmer:

# 268435455 == 1<<28 - 1
time seq 0 268435455 | unikmer decode -k 14 | unikmer dump -K -o all.k14

# sorting to reduce file size
time unikmer sort all.k14.unik -o all.k14.sorted

Counting k-mers in chromesomes: (Threads number (-j) depends on your RAM, mine is 48GB).
```
ls GRCh38/*.fa  | rush -j 5 'unikmer count -k 14 -s -K -c -o {}.k14 {} --verbose'
```

Finding k-mers not in GRCh38: (Threads number (-j) depends on your RAM, mine is 48GB).

time unikmer diff all.k14.sorted.unik GRCh38/*.k14.unik -o all.k14.sorted.uniq@GRCh38 -s -j 4 --verbose

Results: 20,423,600 / 268,435,456 == 7.61% 14-mers are not found in human genome:

unikmer stats all.k14.sorted.uniq@GRCh38.unik -a
file                              k  gzipped  compact  canonical  sorted      number
all.k14.sorted.uniq@GRCh38.unik  14  ✓        ✕        ✓          ✓       20,423,600

Checking:

# sampling 20 k-mers
unikmer view all.k14.sorted.uniq@GRCh38.unik -a \
    | seqkit sample -p 0.000001 | seqkit head -n 20 > t.fa

seqkit seq -s t.fa
AAGCGTAAGCGGCA
ACGCACCGAGAGTG
ACGGATCCCGGTCG
ACGTAGAACGGACC
AGCAGGTGACGTCG
AGCGCGCGTTCAGA
ATAAGTCGCGCCCA
ATATCGACTCCGTG
ATGCACTTGATCGG
ATTGACGCAGCGAA
CACGCGGCAGCGAG
CGCGTTGGACATAG
CGCTCGCCAGTTGC
CGCTTTGTCGAACA
GACGGCCGACGATA
GAGTCTCGCACGAA
GATCGCATATGTTA

seqkit locate -i -f t.fa Homo_sapiens.GRCh38.dna.primary_assembly.fa
seqID   patternName     pattern strand  start   end     matched

# no hit

At last, you can freely extend the 14-kmers to 25bp.