Hello, I have a bunch of DNA sequences (group A) I'm going to turn into a Stockholm MSA and I hope to use this to search another collection DNA sequences with nhmmer in HMMER to hopefully pick out anything that might be related to group A.
Problem 1: I was reading through the userfile and its not really clear what format the 'database' of DNA sequences needs to be in. Can I just put them together as a Stockholm MSA against a FASTA database?
Problem 2: Group A varies widely in length and content they are binding sites which I'm not even sure have directly similar sequences at all. When I try to input them into ClustalO to get a multiple sequence alignment I get an error.
ERROR: Cannot open input file. No alignment! Exit code: 255
Is there someway I can just generate the best alignment possible without having to worry about some threshold that will case my input to be outright rejected? Is there some alignments so bad Hmmer will just flat out reject them?
Not sure what your problem is but making a multiple sequence alignment to then generate an HMM only makes sense for sequences that are somehow related. If you have too many divergent sequences, you'll never get a good model so the first step is to remove sequences that don't align to the others. Also, it may be worth stepping back and telling us what you're trying to achieve. There might be another way than trying to build an HMM from unrelated sequences.
My goal is to find sequences related to group A. I know group A are related through binding assays although it might not be apparent in sequence. I can't just discard sequences until I get a good msa because that would lose information and lead to bias
Related is usually meant in a phylogenetic sense. You could always make a multiple sequence alignment of dissimilar sequences but that may not be useful. If you're trying to find other proteins that have the same binding properties as those in group A, you should try to figure out what is the protein domain conferring the property you're interested in then search for proteins with the same domain, i.e. narrow down the sequence search to a specific region which has a higher chance of being conserved.