I'm currently trying to do a multiple sequence alignment for a protein to look for conserved regions to later test any surface exposed ones for immunogenicity, but the protein in question has over 19 thousand entries in UniProt. I can't seem to find any tool that can do an MSA for this many and also show me the conserved regions, although I'm very new to bioinformatics so very possible I'm just missing things. Is there either a tool I can use for this (ClustalW on Galaxy gave up after about two days of processing, and that was the only thing I found so far that even accepted the fasta file) or is there any way I can somehow heavily trim down the number of sequences I'm aligning without losing any important information? Thanks in advance.
Assuming I'm looking at the right place ( https://www.ebi.ac.uk/Tools/msa/clustalo/ ), it, unfortunately, can't accept over 4000 sequences or over 4MB in file size. The fasta file I'm trying to submit exceeds both of those by a fair bit unfortunately and it won't run it. I'll give CD-HIT a look and see how that works for me, though, thank you!
You will need to download Clustal Omega and run it from the command line.
It required some help from a friend who is much more tech-savvy than me, a person who had trouble even finding the button to make a post here, but I have gotten it working and got an alignment! However, I...don't actually know how to then get the conserved regions from it. I have tried using the MSA Viewer at NCBI but that choked and died on it, I've also tried a program called Gblocks that seemed promising but it just choked and died even faster. I'm starting to feel like I've bitten off more than I can chew on this.