It seems there may be potential mislabeling of the taxid of some sequences deposited in the NCBI virus database. For example, when I use taxid:694009 (SARS-CoV) to search for sequences, I see results reported for SARS-CoV2 as well (e.g., NC_045512). SARS-CoV2 has a taxid of 2697049. I wonder whether this is more widespread and not restricted only to SARS-related viruses.
If it is widespread, then is there a way to get around this problem? My project depends on downloading many viral species and aligning them to their respective reference sequences. However, if I cannot trust taxid-based downloads, then I will need additional filtering of the data. What I can think of is, after alignment, to use some cut-off to remove noisy downloads. What would be a systemmatic way to determine this cut-off?
EDIT: Based on comments below, I realize the above is a rookie error, and the taxid I used is not for a single virus, but rather a collection of viruses related to SARS. The question below is still of interest to me. I have received the following suggestions: minimap2, LASTZ, and Nextstrain. Thanks!
Also, what is a good way to align long sequences? I am currently using the striped smith waterman aligner given in skbio (which is adapted from https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library). However, the skbio version seems to have a sequence length cut-off of 16384. Is there another tool that can help me align longer sequences?
That's not a mislabeling, SARS-COV-2 is under that taxid
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=694009
NCBI taxids are hierarchical, e.g. humans are under many different taxids: eukaryotes, animals, chordata, mammals, etc.
I see. Thanks for this information! So I will essentially need to get the taxid of the leaf nodes in this tree to ensure it is purely from one species.
I don't think this is a mistake.
694009
refers to the broad class ofSevere acute respiratory syndrome-related coronaviruses
, of which SARS-CoV-2 is a member. You can go to the NCBI taxonomy browser and search with this taxID to see that.SARS-CoV-2 was recognized as a separate species sometime in February 2020 (if I recall right) so it was given an independent taxID, probably after that point in time.
minimap2
is perfect aligner for long reads. There are others likeLASTZ
which can do chromosomal alignments. You could also use the tools used by Nextstrain projects to do these alignments.Seems I made a rookie error! Thanks for the correction.
I will look into both LASTZ and Nextstrain.
I am not sure minimap2 is suitable, since it is designed for sequencing reads with high error-rate (especially indel error rates) rather than finished assemblies. I believe for it to perfectly work, I would need to change the parameters and increase the mismatch/deletion penalties? Please correct me if wrong.
Depends on what you are aligning to these genomes. You have not told us that. If you have long reads then
minimap2
may be a valid option. If you tell us what your source data is and what kind then you can get more specific recommendations.Right. I am trying to align a finished sequence for a virus to the virus' reference sequence. So most mismatches/indels are just from viral mutations. I am not looking to do a multiple sequence alignment at this point.