I'm trying to learn bioinformatics. As an exercise, I'd like to compare the RefSeq for Sars-Cov-2 to that of Influenza A (H1N1) using Biopython to get a score of how similar / dissimilar the two viruses are. So something like:
alignments = pairwise2.align.globalxx(sars_cov_2, influenza_a)
If I go to the NCBI Virus database, I can find the RefSeq for Sars-Cov-2 (GCF_009858895.2) https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome&VirusLineage_ss=Severe%20acute%20respiratory%20syndrome%20coronavirus%202%20(SARS-CoV-2),%20taxid:2697049
If I click on GCF_009858895.2 for details, a drawer slides open from the right of the screen and documents one Nucleotide Accession Segment: NC_045512.2
If I download the file and view the contents, I see one long segment.
I can also find multiple RefSeq's for Influenza A. I pick one, GCF_001343785.1 https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome&VirusLineage_ss=Influenza%20A%20virus,%20taxid:711320
If I click on GCF_001343785.1 for details, a drawer slides open from the right of the screen and documents eight Nucleotide Accession Segments: NC_026438.1 NC_026435.1 NC_026437.1 NC_026433.1 NC_026436.1 NC_026434.1 NC_026431.1 NC_026432.1
If I download the file and view the contents, I see eight short segments (not in numerical order, segment 4 followed by 7, etc).
The data structures for the two viruses are very different. I can pass in the contents of Sars-Cov-2 (GCF_009858895.2) to pairwise2.align.globalxx no problem. For Influenza A (GCF_001343785.1), I don't get just one segment I can pass into the function.
I've read the Wikipedia page on fasta file format and the documentation for the Fasta software, and various posts on this forum. I still don't understand how I can compare these files.
This leaves me with many questions, such as: If I read the NCBI documentation correctly, both RefSeq's are "complete". What does "complete" mean when data is broken up into eight segments? Can I expect that the noncoding RNA is included? What transformations can I apply to Influenza A? Can I simply append the eight Influenza A segments together? If so, in what order, the order of the segment number, or the order in which they appear in the file, or some other order? Is there documentation somewhere that explains why Sars-Cov-2 is stored as one segment, and Influenza A is broken up into eight? is a globalxx comparison between these two files possible and, if so, how?
Thanks! -Brian
Once you have the sequences it's a bit like comparing Moby Dick with A Tale of Two Cities, there are different levels of comparisons and different questions you can ask, you should be more focused on the biological question and then you could find the tools to help you with it.
Thanks Asaf. I've tried to clarify what I'm trying to do. For me, it's really more of an exercise to learn how to use the technology and learn more about the domain than to answer a specific question.
I think a more instructive exercise would be to compare, e.g., SARS-COV-2 to other coronaviruses. Influenza and SC2 are so phylogenetically distant the alignments etc are going to be pretty rubbish.
You will shoot yourself in the foot because trying to use bad alignments downstream (e.g. for creating a tree, or looking at protein differences etc) will be affected by the bad data going in.