Feature extraction from DNA sequence
1
0
Entering edit mode
6.1 years ago
bioinfo456 ▴ 150

In the context of developing a classification model for ascertaining whether a given variant is affecting the gene expression for a certain disease, I've obtained 1k bp up and downstream of the variant locations. Now, what are the possible features that I could extract out of this sequences for this specific task? Also, is it more relevant to compute biological features over statistical ones for the same purpose? Any help would be much appreciated.

snp machine learning deep learning • 4.5k views
ADD COMMENT
0
Entering edit mode

You can try to create VCF file from your data set and predict the variant effect to see the mutations are deleterious or tolerated A: Allele frequency visualization

ADD REPLY
0
Entering edit mode

I'm sorry I don't think you read my description right. I'm trying to be disease specific in my context. However, I did use ensembl VEP to obtain the positions of rs ids of my interest, in hg38 assembly. Thanks for your thoughts.

ADD REPLY
1
Entering edit mode
6.1 years ago

You want to check the association b/w variants and expression profile of the adjacent gene or going for expression prediction based on genotype?

https://www.um.edu.mt/__data/assets/pdf_file/0005/289427/eQTL_intro.pdf

ADD COMMENT
0
Entering edit mode

Expression prediction based on genotype. Kindly share your thoughts. Thanks.

ADD REPLY
3
Entering edit mode

The list of things at which to look is endless:

  • Transcription start sites (TSS)
  • Transcription factor binding sites (TFBS)
  • Promoter regions (e.g. via H3K4Me3)
  • Enhancer regions (via H3K27ac)
  • Other histone marks (many types to look at, e.g., H3k9ac, H3K27ac, etc.)
  • Conservation
  • DNase hypersensitivity
ADD REPLY
0
Entering edit mode

Thank you. What platform would you suggest me to use to compute such features?

ADD REPLY
1
Entering edit mode

Mostly shell scripting / BASH; so, Linux or Mac OS are preferable. Take a look here, where some of this data is available: http://genome.ucsc.edu/encode/downloads.html

One can also annotate some of these with ANNOVAR: http://annovar.openbioinformatics.org/en/latest/

ADD REPLY
0
Entering edit mode

Alright, but my context is as follows : given an rs id, the model should analyse a certain features for both the wild and the mutant type and accordingly predict whether or not it is going to affect the gene expression for a certain disease. I have the case and control samples from the GWAS and also 1k bp sequence up and down stream of the obtained rs ids from the GWAS. Would you still suggest me to compute the same features? Your advice is much appreciated. Thanks.

ADD REPLY
1
Entering edit mode

With just the SNP genotype and gene expression, you can do an eQTL study or build your own regression models whereby genotype is predicting expression of nearby genes, with the covariates in these models being some of the features that I mentioned above. This may sound easy, but it is not, particularly the set-up of such a study.

If you have the DNA sequence and are interested in that, then take a look at the manuscript mentioned by arup, i.e., Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

ADD REPLY
0
Entering edit mode

Alright, thanks. I do not have the gene expression, just the variants taken from the GWAS.

ADD REPLY
1
Entering edit mode

If you download the library annotation information for the microarray that you are using, then there may already be much metadata in that library file. Just search on the manufacturer's home-page. If it is Affymetrix, then remember that ThermoFisher purchased Affymetrix.

ADD REPLY
0
Entering edit mode

Which specific information would you suggest I look for from the library annotation?

ADD REPLY
1
Entering edit mode

Please see my list, above. Also, I would discourage anybody from becoming too dependent on this website.

ADD REPLY
0
Entering edit mode

In addition, I would encourage everyone doing machine learning on biological data to get familiar with what data they're working on.

ADD REPLY
0
Entering edit mode

Thank you so much for this. I used the above mentioned features (Histone, TFBS, Dnase) and also tried various dinucleotide features. But none of the dinucleotide features seem to contribute in the context of classification. How can I compute conservation and any other possible features that could contribute in this context? Please help.

ADD REPLY
0
Entering edit mode

Is this to me?

ADD REPLY
0
Entering edit mode

Yes sir. I'm typing the rest just to fulfill the minimum character criteria.

ADD REPLY
1
Entering edit mode

Conservation scores should be there. Look for phyloP scores, as an example.

ADD REPLY
0
Entering edit mode

Got it. Any other feature that would contribute in this context?

ADD REPLY
1
Entering edit mode

Take a look at the manuscripts for CADD (in silico predictor) - this will give you an idea. Conservation score is the single best predictor of functionality / pathogenicity, though.

ADD REPLY
0
Entering edit mode

Where can I find TSS information? Ensembl?

ADD REPLY
0
Entering edit mode

Search in your search engine of choice?

ADD REPLY
0
Entering edit mode

Both PhyloP and PhastCon score contribute effectively to such classification, and also GC content which gives away about the stability around the variant is also a good predictor along with CpG score. I looked for TSS database in hg38, and I see there is DBTSS but it is not in the format I need. I prefer bed format. Any other alternative for the same? And also, any other possible feature selection suggestions? Thank you so much for you help so far.

ADD REPLY
1
Entering edit mode

Hi Uday, yes, that makes sense (PhyloP, PhastCons, GC content). There are many features at which you could look:

  • TFBS - transcription factor binding sites
  • structural variants / CNV
  • H3K27 acetylation (H3K27ac)
  • H3K27me3
  • et cetera

In the manuscripts of CADD and DANN, you will see many more ideas.

You will not find any single best predictor outside of conservation score, though.

ADD REPLY
0
Entering edit mode

I did go through them. I have a question though. Is it alright to compute conservation score involving the region (eg. 50 bp +/-) surrounding the point of mutation? Because suppose, if positive meant highly conserved and negative meant otherwise, I'd obviously get negative for both my classes. So I was thinking of computing conservation score of sequence surrounding the point of mutation. Do you concur?

ADD REPLY
0
Entering edit mode

Hi Uday. What do you mean by 'classes'? The phylop scores are measured on the log scale, with positive meaning more highly conserved, and negative meaning less likely conserved, as you appear to have noticed. I believe they already consider the surrounding region when calculating these scores, but cannot confirm.

ADD REPLY
0
Entering edit mode

I have defined two classes, one contains SNPs taken from an eQTL study involving diseased tissues, and these particular mutations are believed to affect gene expression in context of a particular disease. I've chosen them based on their level of association with the disease in that eQTL study. And, the neutral class is taken from GTEx portal from normal tissues of interest. The pre processing is taken care of. As for my understanding, the conservation score is computed mostly on the basis of MSA. I've considered phyloP100way and phastCon100way. I've used mean conservation score of +/-75 bp around the point of mutation and it seems to work pretty well.

ADD REPLY
1
Entering edit mode

Okay, sounds very interesting. Great work!

ADD REPLY
1
Entering edit mode

The following article will give you an idea.

https://www.nature.com/articles/s41588-018-0160-6

ADD REPLY
0
Entering edit mode

Thank you. Will look into it.

ADD REPLY

Login before adding your answer.

Traffic: 1740 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6