Question

(EDITED) Aligning complete viral RNAs

0

Entering edit mode

4.4 years ago

oddjobs ▴ 10

It seems there may be potential mislabeling of the taxid of some sequences deposited in the NCBI virus database. For example, when I use taxid:694009 (SARS-CoV) to search for sequences, I see results reported for SARS-CoV2 as well (e.g., NC_045512). SARS-CoV2 has a taxid of 2697049. I wonder whether this is more widespread and not restricted only to SARS-related viruses.

If it is widespread, then is there a way to get around this problem? My project depends on downloading many viral species and aligning them to their respective reference sequences. However, if I cannot trust taxid-based downloads, then I will need additional filtering of the data. What I can think of is, after alignment, to use some cut-off to remove noisy downloads. What would be a systemmatic way to determine this cut-off?

EDIT: Based on comments below, I realize the above is a rookie error, and the taxid I used is not for a single virus, but rather a collection of viruses related to SARS. The question below is still of interest to me. I have received the following suggestions: minimap2, LASTZ, and Nextstrain. Thanks!

Also, what is a good way to align long sequences? I am currently using the striped smith waterman aligner given in skbio (which is adapted from https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library). However, the skbio version seems to have a sequence length cut-off of 16384. Is there another tool that can help me align longer sequences?

RefSeq skbio StripedSmithWaterman Alignment • 1.1k views

ADD COMMENT • link 4.4 years ago by oddjobs ▴ 10

1

Entering edit mode

That's not a mislabeling, SARS-COV-2 is under that taxid

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=694009

NCBI taxids are hierarchical, e.g. humans are under many different taxids: eukaryotes, animals, chordata, mammals, etc.

ADD REPLY • link 4.4 years ago by 5heikki 11k

0

Entering edit mode

I see. Thanks for this information! So I will essentially need to get the taxid of the leaf nodes in this tree to ensure it is purely from one species.

ADD REPLY • link 4.4 years ago by oddjobs ▴ 10

1

Entering edit mode

I don't think this is a mistake. 694009 refers to the broad class of Severe acute respiratory syndrome-related coronaviruses, of which SARS-CoV-2 is a member. You can go to the NCBI taxonomy browser and search with this taxID to see that.

SARS-CoV-2 was recognized as a separate species sometime in February 2020 (if I recall right) so it was given an independent taxID, probably after that point in time.

minimap2 is perfect aligner for long reads. There are others like LASTZ which can do chromosomal alignments. You could also use the tools used by Nextstrain projects to do these alignments.

ADD REPLY • link 4.4 years ago by GenoMax 148k

0

Entering edit mode

Seems I made a rookie error! Thanks for the correction.

I will look into both LASTZ and Nextstrain.

I am not sure minimap2 is suitable, since it is designed for sequencing reads with high error-rate (especially indel error rates) rather than finished assemblies. I believe for it to perfectly work, I would need to change the parameters and increase the mismatch/deletion penalties? Please correct me if wrong.

ADD REPLY • link 4.4 years ago by oddjobs ▴ 10

0

Entering edit mode

Depends on what you are aligning to these genomes. You have not told us that. If you have long reads then minimap2 may be a valid option. If you tell us what your source data is and what kind then you can get more specific recommendations.

ADD REPLY • link 4.4 years ago by GenoMax 148k

0

Entering edit mode

Right. I am trying to align a finished sequence for a virus to the virus' reference sequence. So most mismatches/indels are just from viral mutations. I am not looking to do a multiple sequence alignment at this point.

ADD REPLY • link 4.4 years ago by oddjobs ▴ 10