Question

strainEST database construction, resources

0

Entering edit mode

5.4 years ago

malteherold ▴ 60

Hello,

In trying to do typing of bacterial strains from metagenomic samples I recently came across this tool, strainEST:

https://github.com/compmetagen/strainest

https://www.nature.com/articles/s41467-017-02209-5

It uses mapping against an alignment of representative strains and SNV profiles to predict relative abundances of individual strains in metagenomic samples. As the species I am interested in does not have a pre-built database, the first step was to construct this. After reducing around 3500 reference strains to 650 clusters or representatives, which is quite a lot compared to what is used in the paper, I am now stuck in generating the alignments and am now also wondering if with this number the following steps will be feasible at all.

Does anyone have experience with this tool and knows what kind of resources are required for database construction and running the tool with a larger amount of representative strains? Also are there any comparable tools that you can recommend for the same purpose (identifying and tracking a large number of bacterial strains in few metagenomic samples)?

My testing:

Stuck at this point (at 92% for hours, so close...):


root@cccebff7163a:/strainest/representative_strains# strainest mapgenomes *fasta  SR.fa MR.fasta
  [#################################---]   92%  0d 00:10:04

Edit:After some more hours the step completed, but in the following step no SNV matrix is created (the step finishes but the resulting snp.dgrp is empty but for the header):


root@cccebff7163a:/strainest/representative_strains#  strainest map2snp SR.fa MR.fasta snp.dgrp
Find the core...
Find SNPs  [####################################]  100%             
Build the SNP matrix  [####################################]  100%             
Write the snp matrix...

The empty SNP matrix is probably due to the fact that it is not possible to determine core regions that are shared across all the genomes. This is a requirement and nothing is reported in case no core region is found. Checking the script api/_map2snp.py I noticed that the coverage is rather low for a few of the genomes, which I now removed and reran generation of MR.fasta (also making sure MR.fasta is not included as a genome here, because it will also have coverage 0).

genome strains metagenome • 1.9k views

ADD COMMENT • link 5.4 years ago by malteherold ▴ 60

0

Entering edit mode

Your prompt says:

root@cccebff7163a
^^^^

You should not perform bioinformatics analyses as root, you should create and use a regular user for that.

ADD REPLY • link 5.4 years ago by h.mon 35k

0

Entering edit mode

this is inside docker

ADD REPLY • link 5.4 years ago by malteherold ▴ 60

score 0 · Accepted Answer · 2020-04-14

To answer my own question if someone else runs into similar problems:

Even with a larger amount of reference genomes (initially 3500, 400 references selected after clustering of mash distances) it is possible to run this without a great amount of resources (4 cores and 32G ram allocated to docker on desktop computer). Database creation takes some time, most by the mapgenomes step (4-5 hours), and building the bowtie2 index took 5-6 hours.

Mapping with bowtie2 and running strainest est is then the most time consuming step afterwards. Which for larger samples (>3G of reads) can take up to 10 hours for me.

The problem with SNP matrix being empty was a result of a few genomes having very low coverage in the initial alignment, which can be checked by running the following part of the script (https://github.com/compmetagen/strainest/blob/master/strainest/api/_map2snp.py) separately:


    seqn = 0
    coverage = np.zeros(reference_seq_len, dtype=np.int)
    click.echo("Find the core...")
    for record in SimpleFastaParser(mapped_handle):
        mapped_seq_arr = np.asarray(list(record[1]))
        coverage += (mapped_seq_arr != '-')
        seqn += 1
    core_idx = np.where(coverage == seqn)[0]

After removing these genomes that lead to a non-existent or substantially reduced core genome, creating the SNP matrix worked.

As described in the tutorial the following steps are necessary after clustering the mash distances of the genomes and selecting references:


strainest mapgenomes genomes/*fasta  genomes/SR.fa MR.fa

strainest map2snp genomes/SR.fa MR.fa snp.dgrp

strainest snpdist snp.dgrp snp_dist.txt hist.png

strainest snpclust snp.dgrp snp_dist.txt snp_clust.dgrp clusters.txt

#see edit below
strainest mapgenomes MA_genomes/*fasta  genomes/SR.fa MA.fa

bowtie2-build MA.fa MA

...

edit: I made a mistake, mapping should not be done against the alignment of representatives to create a SNP matrix but to a separate (smaller!) set of references. This set of references (referred to as MA in the paper) can be obtained by clustering the mash distances of the genomes with a fixed number of clusters, e.g. 10.