Dear all,
i'm dealing with a bulk of protein sequences from the same transcription factor family from distant organism families. I'd like to know what are the common good practices you use in this pipeline as i've seen that papers are very grey when they present this in their methods.
My analyses started using:
- HMM analysis to identify putative sequences in my target species.
- Filtering each gene for its longest variant
- It was followed by a conserved domain database CDD (by the way have you used it? with concise or full output?) that i use to filter the output for those with complete domains and with a significant hit for specific domain types.
Now i have few questions:
- Alignment
Which algorithm and software do you recommend for the alignment. - Post-alignment processing
After the alignment, do you cut your alignment to focus only on the TF binding domain to make the tree construction easier? - Phylogeny
What program do you recommend for tree construction for >500 seqs.
Which algorithms do you recommend for the tree constructions?
And how many bootstrap?
Do you suggest to collapse for bootstrap value e.g. >70? - Tree annotation and publication ready
What program do you use for annotating the tree?
If you can answer to one or few of these questions that would help already a lot.
Thanks in advance