Question

How to get the sequence differences between multiple bacterial genomes

1

Entering edit mode

4.0 years ago

Rose ▴ 20

I am working on some closely related bacterial species (complete genomes from NCBI). I would like to extract the sequence differences between them. To be more specific, I want to find unique sequences (50 -100 nts) in each of the bacterial species (n=6) under my study. I found many other related posts suggesting tools like Mauve, Mummer, Artemis ACT etc.. and I tried all of them. Mauve gave pretty good results, but when I extracted the sequences in the range as suggested to be different from other genomes under comparison, and performed blast, I got multiple hits to other bacterial species.

Thanks in advance.

genome alignment • 4.2k views

ADD COMMENT • link updated 3.2 years ago by ahmed_elsherbini91 ▴ 10 • written 4.0 years ago by Rose ▴ 20

1

Entering edit mode

If you want the sequence to be inclusive/exclusive to your specific genome then you need to include a representative set of all available bacterial genomes in your exclusion set..

ADD REPLY • link 4.0 years ago by 5heikki 11k

0

Entering edit mode

Yes, excluding similar sequences in human genome would be better.

ADD REPLY • link 4.0 years ago by shenwei356 8.7k

score 4 · Answer 1 · 2020-11-25

4

Entering edit mode

4.0 years ago

shenwei356 8.7k

Here's one solution with unikmer, for limited number of input files. for loop can be replaced with parallel for saving time.

Generating k-mers.

for f in *.fa.gz; do
    unikmer count -k 31 -K -s -o $f.k31 $f;
done

Computing k-mers shared by two or more strains. (updated here)

unikmer common -n 2 *.k31.unik -o shared

# method 2, faster
unikmer sort  *.k31.unik --repeated -o shared

Removing k-mers shared by two or more strainss, left are unique k-mers.

for f in *.k31.unik; do 
    unikmer diff $f shared.unik -s -o $f.uniq;
done

Retrieving unique sequences from genome with unique k-mers. Output format can be in BED (default) or FASTA (-a), minimum lengths (-m) are configurable.
```
for f in *.fa.gz; do
    unikmer uniqs -g $f $f.k31.unik.uniq.unik -M -m 50 -a -o $f.uniq.fa;
done
```

Stats of result.

unikmer stats *.uniq.unik -a 
file                                            k  canonical  hashed  scaled  include-taxid  global-taxid  sorted  compact  gzipped  version  number
1773-GCA_000706665.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  16,217
1773-GCA_000738445.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  11,939
1773-GCA_000738475.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1   7,074
1773-GCA_000756525.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1   9,679
1773-GCA_000756545.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  72,867
1773-GCA_000934585.1.fa.gz.k31.unik.uniq.unik  31          ✓       ✕       ✕              ✕                     ✓        ✕        ✓     v4.1  20,135

seqkit stats *.uniq.fa
file                                format  type  num_seqs  sum_len  min_len  avg_len  max_len
1773-GCA_000706665.1.fa.gz.uniq.fa  FASTA   DNA        358   28,030       50     78.3    1,005
1773-GCA_000738445.1.fa.gz.uniq.fa  FASTA   DNA        379   23,088       50     60.9       89
1773-GCA_000738475.1.fa.gz.uniq.fa  FASTA   DNA        220   13,486       50     61.3       99
1773-GCA_000756525.1.fa.gz.uniq.fa  FASTA   DNA        239   16,306       50     68.2      354
1773-GCA_000756545.1.fa.gz.uniq.fa  FASTA   DNA      1,888  129,254       50     68.5    6,865
1773-GCA_000934585.1.fa.gz.uniq.fa  FASTA   DNA        479  211,054       50    440.6    9,400

ADD COMMENT • link 4.0 years ago by shenwei356 8.7k

0

Entering edit mode

Perfect!! Thanks very much @shenwei356. This is exactly what I was looking for.

ADD REPLY • link 4.0 years ago by Rose ▴ 20

1

Entering edit mode

Sorry, I just notice step 2 used the wrong command, and have edited the answer.

We should removing k-mers shared by >= 2 genomes (unikmer common), not just that shared by all genomes (unikmer inter).

ADD REPLY • link 4.0 years ago by shenwei356 8.7k

0

Entering edit mode

I used this method and I can see that I got unique sequences from my genome, I sorted them according to length using Jalview and blasted them in NCBI blast. Just my question, Is using this method is the best to design PCR primers uniquely for my genome of interest?

ADD REPLY • link 3.2 years ago by ahmed_elsherbini91 ▴ 10

0

Entering edit mode

Once you got the unique sequence,s then you can design PCR primers using primer blast which designs primers and check the specificity.

Another tool: Fur: Find unique genomic regions for diagnostic PCR.

ADD REPLY • link 3.2 years ago by shenwei356 8.7k

1

Entering edit mode

Thank sir for your answer, much appreciated.

I tried to use the FUR tool via Docker. But It did not give me any unique sequences. maybe because the sequences are very closely related and the bacteria is quite conserved. However, your tool provided me with around 37 sequences with an average len of 60 bp. Uniqueness for them stems from SNPs.(not sure, but I tried the mos lengthy) is there a way to sort them from the perspective of which sequence (out of the 37 ) has the higher variability? instead of manually blasting them one by one?

ADD REPLY • link 3.2 years ago by ahmed_elsherbini91 ▴ 10

0

Entering edit mode

Sorry, I did not go that far. Please Blast and check them for now.

ADD REPLY • link 3.2 years ago by shenwei356 8.7k

0

Entering edit mode

I blasted them. And indeed I have found what I was looking for. One non-coding fragment lies in an intergenic sequence that represents the best marker for my genome among its own lineage. So is unikemr has an upper hand over FUR if you dealing with very close genomes comparing them with each other (ex: inside a uropathogenic E. coli). Fur is better if you are comparing inside a uropathogenic E. coli to intestinal pathogneic E. coli. or further. Any way my judemnet is based on my own experience. Thanks for your help

ADD REPLY • link 3.2 years ago by ahmed_elsherbini91 ▴ 10

0

Entering edit mode

use cd-hit with high identity cut-off. optimize identity cutoff till results catch the differences. cd-hit clusters sequences based on user furnished identity cut offs for nucleotide sequences.