Hi there, I am designing my new bioinformatics research project. I am not sure how significant it is so I post it here and hope to learn from your wisdom.
Title: Standardise classification of indel variants
Objective: Unlike single nucleotide variants which can be described by integrating 5' and 3' sequence contexts. This approach has also become a routine and been widely used. To my best understanding, there is no standardised way to describe indels. Most of the time, indels are categorised by length of sequence. Some times indels are also classified based on different regions of a gene. It's obvious there lacks an efficient approach to describe indels. Hence, this study is to fill this gap.
Look forward to hearing from you:
- how biological meaningful this project is?
- are there existing similar studies?
- if you were me, what you'd like to do?
Thanks for your attention.
What would be the difference with the HGVS notation?
Hi WouterDeCoster, thanks. I think HGVS provides guidelines to write and record a variant (eg. a deletion of ACGT occurs at chr8:1234-1237). I try to give a standard of classifying indels, which should be helpful for people to get statistics of indels in the genome (it is what I think it can do at this moment).
With regard to what would you like to classify them? Can you give an example? It's not really clear for me what you aim for.
With respect to single base substitutions, C > A, C > G, C > T, T > A, T > C, T > G can be used to describe the mutational pattern of substitutions in the genome. In contrast, indel lacks such a universal classification system. So it is not able to describe the pattern of indel yet. To be honest, I don't know what would be suitable for indel. I tried to use sequence pattern, for example, repeat expand (ACGACGACG), palindrome (CCTCC), but only a small amount of indel have such sequence, leaving most uncategorised. I am currently thinking to use sequence ontology terms. I am still trying to find other approaches.