Question

How to determine which NCBI sequence to map against (multiple sequences for single taxID)

2

Entering edit mode

5.2 years ago

LRStar ▴ 200

I am aiming to determine the species/subspecies identity of a (possibly bacteria) sample. I applied the following sendsketch command:

sendsketch.sh in=contigs.fasta,

where the input file (contigs.fasta) was previously generated from the raw paired-end read file using SPAdes. The results of the above command can be viewed here: here. The sendsketch website states that WKID is "the column that tells you the actual sequence similarity (disregarding size)." As a result, I believe I should focus on H. macacae since it had the highest WKID value (along with KID = 0.05%; ANI = 96.6%; Contam = 2.5%).

1) Are the output values here sound or a cause for alarm? Is the WKID (38.9%) or KID (0.05%) value too low? Or are other values questionable?

2) If the values are sound, I hope to check the mapping rate (using BWA) of the sample to the H. macacae genome (so that I can use it as a reference genome for further steps). However, I am a bit stuck on figuring out the appropriate sequence to download. The taxID number output from the sendsketch was 398626. It seems there are 46 sequences available on NCBI (here). How should I go about determining the optimal H. macacae sequence to map my samples to?

Side note: I have a similar problem with H. mastomyrinus, the second-highest WKID value from sendsketch. There are 17 sequences available on NCBI here. In this case, some are 16S and some are 23S. How do I know which sequence to map against in this scenario?

ncbi sendsketch bwa • 1.9k views

ADD COMMENT • link 5.2 years ago by LRStar ▴ 200

0

Entering edit mode

@5heikki. Thank you for your suggestion. I am probably only asking this because I am lacking very basic knowledge, but why is the one you posted necessarily the best compared to those listed on NCBI? I noticed on NCBI that 25 of the 46 taxID= 398626 results are of the strain you listed (MIT 99-5501). Thank you for the link as well. On your link, I am guessing I should download the one with extension genomic.fna.gz to use as reference genome for DNA sequence? Finally, how did you obtain a "local RefSeq"? Thank you again for your support.

ADD REPLY • link 5.2 years ago by LRStar ▴ 200

1

Entering edit mode

but why is the one you posted necessarily the best compared to those listed on NCBI?

NCBI's RefSeq project aims to provide

A comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein.

So the genome linked by @5heikki can be considered a good representative of H. macacae for the analysis you are looking to do.

You can get the genomic.fna.gz file which will have the sequence in fasta format.

local RefSeq is likely referring to a local copy of the RefSeq database. Some institutions may download a local copy for ease of access. You can find a copy of RefSeq sequences here. Refseq blast preformatted BLAST databases are also available here.

Do you have a pure sample of the bacteria that was sequenced? i.e. you are sure there should be only species present.

ADD REPLY • link 5.2 years ago by GenoMax 148k

0

Entering edit mode

Thanks for your help, @genomax and sorry for my earlier thread-organization blunder. Thanks for your question. I am not sure if there is only one species present. I only just began analyzing this type of data and received this dataset from my advisor (who received it from a wet lab researcher). I believe the wet lab researchers think the sample is one strain (or at least one species), but I am not sure about it. I have been attacking this dataset with numerous software suggested to me kindly by biostars users these past weeks, including sourmash (which resulted at k-mer=11 to about 43.5% similarity to various H. pylori strains and some other Helicobacter strains), Kraken on Galaxy (which had 86% bacteria unmapping and only resulted in 5.66% mapping to H.pylori and H. cetorum and other Helicobacter strains), and most recently sendsketch with WKID=38.9% mapping to H. macacae. If you are familiar with this type of analysis, what do you think of these results? I am having a very hard time interpreting it. The wet lab researchers would like me to replicate something like Table 2 from this paper. I believe to do so, I have to first determine what "genome" this sample belongs to. I wonder, if you have experience with this type of analysis, what would you do next given my results so far? I would be grateful to hear any ideas if you have any!

ADD REPLY • link 5.2 years ago by LRStar ▴ 200

0

Entering edit mode

@genomax. Also, a quick update that I downloaded the H. macacae fasta file and used BWA to map the paired-end raw read sample to it. Then, I investigated the results with Qualimap (this pipeline was kindly suggested to me by a biostars user a few weeks ago). I may be doing something wrong, but am disappointed to see that only 1.71% of reads mapped. To me (with little experience), the mapping rate seems surprisingly low given the WKID=38.9% mapping rate to H. macacae in sendsketch (but perhaps that is also considered low anyway)? Sorry for another message here. I am just having difficulty figuring out what to do :oD

ADD REPLY • link 5.2 years ago by LRStar ▴ 200

1

Entering edit mode

LRStar : Can you take a small selection of reads (~20-25, convert them to fasta format using following code for use with BLAST)

 zcat file.fastq.gz  | grep -A 1 "^@" --no-group-separator | sed 's/^@/>/'g | head -40

and blast them using NCBI's web blast page. Check what comes back on those reads.

If this is a pure culture you should get some clear indication of that (and what organism it could be) from this search. If results are inconclusive then let us know.

ADD REPLY • link 5.2 years ago by GenoMax 148k

0

Entering edit mode

@genomax. Thank you for your support. I followed your suggestion running BLAST (using a Database of "Nucleotide collection (nr/nt)". I ran it using "Optimize for" values of megablast and blastN as listed here. I did this for both the first 25 bases and the last 25 bases in the file. BlastN on the first 25 bases had Helicobacter cetorum MIT 00-7128 with lowest e-value (3e-39) (Screenshot here), megablast on the first 25 bases had "no significant similarity found message" (Screenshot here), BlastN on the last 25 bases had Helicobacter felis ATCC 49179 with lowest e-value (1e-25) (Screenshot here), and megablast on the last 25 bases had Helicobacter felis ATCC 49179 with lowest e-value (3e-19) (Screenshot here). I am not sure if there are certain parameters I should use (such as blastN versus megablast) and whether theses results seem consistent - especially because the highest WKID value from sendsketch was a different Helicobacter species.

ADD REPLY • link 5.2 years ago by LRStar ▴ 200

0

Entering edit mode

@genomax. I also checked with my advisor and he believes these samples are "isolated". He also said they could not be bacteria (could be archaea or even other organisms); his contacts just have a suspicion this is bacteria. So, with that information, I am not sure if I should be doing yet even more exhaustive searching.

ADD REPLY • link 5.2 years ago by LRStar ▴ 200

0

Entering edit mode

This will tell you how many percent of the reference genome your reads cover (or vice versa if you map ref reads on your assembly). Here reads are unpaired, so adapt to your needs where necessary..

bowtie2 \
    -x "$GENOME" \
    -U "$READS" \
    --end-to-end \
    --very-sensitive \
    --all \
    | awk 'BEGIN{OFS=FS="\t"}{if($1 ~ /^@/ || $12~/AS:i:0$/){print $0}}' \
    | samtools sort - \
    | samtools depth -a - \
    | awk 'BEGIN{FS="\t";COV=TOT=0}{if($3>0) TOT++;COV++}END{print (TOT/COV)*100}'

ADD REPLY • link 5.2 years ago by 5heikki 11k

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

ADD REPLY • link 5.2 years ago by GenoMax 148k

0

Entering edit mode

What is "listed on NCBI" (your link) includes any DNA sequence from txid398626, not just genome assemblies. You can check "the best available" bacterial genome assemblies from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt (I recommend that you don't open the link in browser, it's a large file). Supposedly RefSeq is curated, but from experience I can say that it still includes numerous falsely annotated (taxonomy) genomes, like 1-5% from total (not really relevant to your current question)..

ADD REPLY • link 5.2 years ago by 5heikki 11k

score 2 · Accepted Answer · 2019-09-26

There's only one H. macacae genome in my slightly dated local RefSeq Bacteria copy:

GCF_000507845.1 PRJNA224116 SAMN00690723    AZJI00000000.1  representative genome   1357400 398626  Helicobacter macacae MIT 99-5501    strain=MIT 99-5501      latest  Scaffold    Major   Full    2013/12/13  Heli_maca_MIT_99-5501_V1    Broad Institute GCA_000507845.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/507/845/GCF_000507845.1_Heli_maca_MIT_99-5501_V1     assembly from type material