Question

How to compare Sars-Cov-2 RefSeq to Influenza A RefSeq

0

Entering edit mode

2.6 years ago

bwillkie • 0

I'm trying to learn bioinformatics. As an exercise, I'd like to compare the RefSeq for Sars-Cov-2 to that of Influenza A (H1N1) using Biopython to get a score of how similar / dissimilar the two viruses are. So something like:

alignments = pairwise2.align.globalxx(sars_cov_2, influenza_a)

If I go to the NCBI Virus database, I can find the RefSeq for Sars-Cov-2 (GCF_009858895.2) https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome&VirusLineage_ss=Severe%20acute%20respiratory%20syndrome%20coronavirus%202%20(SARS-CoV-2),%20taxid:2697049

If I click on GCF_009858895.2 for details, a drawer slides open from the right of the screen and documents one Nucleotide Accession Segment: NC_045512.2

If I download the file and view the contents, I see one long segment.

I can also find multiple RefSeq's for Influenza A. I pick one, GCF_001343785.1 https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome&VirusLineage_ss=Influenza%20A%20virus,%20taxid:711320

If I click on GCF_001343785.1 for details, a drawer slides open from the right of the screen and documents eight Nucleotide Accession Segments: NC_026438.1 NC_026435.1 NC_026437.1 NC_026433.1 NC_026436.1 NC_026434.1 NC_026431.1 NC_026432.1

If I download the file and view the contents, I see eight short segments (not in numerical order, segment 4 followed by 7, etc).

The data structures for the two viruses are very different. I can pass in the contents of Sars-Cov-2 (GCF_009858895.2) to pairwise2.align.globalxx no problem. For Influenza A (GCF_001343785.1), I don't get just one segment I can pass into the function.

I've read the Wikipedia page on fasta file format and the documentation for the Fasta software, and various posts on this forum. I still don't understand how I can compare these files.

This leaves me with many questions, such as: If I read the NCBI documentation correctly, both RefSeq's are "complete". What does "complete" mean when data is broken up into eight segments? Can I expect that the noncoding RNA is included? What transformations can I apply to Influenza A? Can I simply append the eight Influenza A segments together? If so, in what order, the order of the segment number, or the order in which they appear in the file, or some other order? Is there documentation somewhere that explains why Sars-Cov-2 is stored as one segment, and Influenza A is broken up into eight? is a globalxx comparison between these two files possible and, if so, how?

Thanks! -Brian

NCBI fasta Biopython • 1.4k views

ADD COMMENT • link 2.5 years ago by bwillkie • 0

0

Entering edit mode

Once you have the sequences it's a bit like comparing Moby Dick with A Tale of Two Cities, there are different levels of comparisons and different questions you can ask, you should be more focused on the biological question and then you could find the tools to help you with it.

ADD REPLY • link 2.6 years ago by Asaf 10k

0

Entering edit mode

Thanks Asaf. I've tried to clarify what I'm trying to do. For me, it's really more of an exercise to learn how to use the technology and learn more about the domain than to answer a specific question.

ADD REPLY • link 2.6 years ago by bwillkie • 0

0

Entering edit mode

I think a more instructive exercise would be to compare, e.g., SARS-COV-2 to other coronaviruses. Influenza and SC2 are so phylogenetically distant the alignments etc are going to be pretty rubbish.

You will shoot yourself in the foot because trying to use bad alignments downstream (e.g. for creating a tree, or looking at protein differences etc) will be affected by the bad data going in.

ADD REPLY • link 2.6 years ago by Joe 21k

score 0 · Answer 1 · 2022-05-29

0

Entering edit mode

2.6 years ago

patrickdm ▴ 240

I'm trying to learn bioinformatics. As an exercise..

Because you mentioned fasta and Biopython, I'd suggest to start by learning how to read and parse the single sequence .fasta of SARS-CoV-2 and the multiple sequence .fasta of Influenza A with Bio.SeqIO.

Where can I find documentation to help me understand why the data for these two viruses are stored so differently, and what rules I need to follow in order to compare them?

in order to answer this question you'll want some basic theoretical background in virology, genetics and molecular evolution and a grasp on next-generation sequencing, sequence alignment, genome assembly, gene prediction. You will not write most of the software needed for your analysis, more likely you will have to choose and (learn to) use the many tools available for the many different tasks at hand. But you will indeed end up writing your scripts, mostly in bash and python, to patch them together in your pipelines.

Hth.

ADD COMMENT • link 2.6 years ago by patrickdm ▴ 240

0

Entering edit mode

Thanks Patrick. I've tried to clarify what I'm trying to do. I agree that basic theoretical knowledge will help. That's the point of this exercise. Of the topics you suggest, sequence alignment seems most promising for this particular issue. Do you have any recommendations?

Thanks!

ADD REPLY • link 2.6 years ago by bwillkie • 0

0

Entering edit mode

I'd look for a more reasonable task then trying to align two distant viral genomes in biopython. Despite being both RNA viruses, Influenza A is a negative-strand RNA virus (−ssRNA) while CoV2 is a positive-strand RNA virus (+ssRNA). To exercise with biopython and pairwise alignments instead, you could i.e. focus on their RNA-dependent RNA polymerase (RdRp) genes, which are highly conserved throughout viruses.

ADD REPLY • link 2.6 years ago by patrickdm ▴ 240