Question

Gene Expression Prediction from DNA sequences

0

Entering edit mode

3.0 years ago

Vittorio • 0

Hi everyone! I am a university student working on my Master's thesis. I worked on a paper called Xpresso which has the purpose to predict the gene expression levels starting from DNA sequences using deep learning techniques. Now, my lecturers have asked me to create my own dataset made up of sequences and gene expression values. Usually the works which tackle this problem finds the locations of the TSSs of the various genes and cut a region with k bp downstream and upstram the TSS location, associating to such DNA sequences to a target value, which is a real number called gene expression level. I tried to cut the DNA of the reference genome fasta file using the gtf "gene" annotations, but I realized that it is not enough to cut the right regions, because the performances of my predictive model fall dramatically respect to the performance obtained on the Xpresso's dataset. So, I would ask to you:

How would you create such dataset?
I see someone that uses, BED files and BigWig files, can you explain me why could they be useful?

My specific knowledge of the bio domain is very poor, so any advice is valuable to me. Thanks to all!

xpression gene sequences dna • 972 views

ADD COMMENT • link updated 3.0 years ago by jared.andrews07 ★ 18k • written 3.0 years ago by Vittorio • 0

score 0 · Answer 1 · 2021-12-06

0

Entering edit mode

3.0 years ago

jared.andrews07 ★ 18k

How would you create such dataset?

I'd probably follow the docs. It seems a bit annoying, but not all that hard. They provide FASTA files for all human and mouse promoters, so you could also subset one of those lists if you're using one of those organisms.

I can't really tell if you're trying to train your own model or not, but they have a Jupyter notebook showing how to do that as well (though it's a bit of a mess it seems).

ADD COMMENT • link 3.0 years ago by jared.andrews07 ★ 18k

0

Entering edit mode

Thank you for the answer! Anyway, I've worked on their jupyter notebook for a long time and I devised methods that can outperform the one proposed by Xpresso by far. As a computer scientist, I have the skills to create good predictive models, but I'm not so confident with biological data and tools. The problem is that at this point of my Thesis I have to create my own dataset with my sequences and labels in order to validate again my models on another dataset. Moreover, we want a new dataset in order to manipulate DNA by removing introns or using just some sequences and so on... For now we tried to download the labels for gene expression levels from GTEx portal and the sequences has been extracted from the reference genome fasta file using the coordinate listed in the GTF file annotated as gene. By giving a glance online I found out that people make use of such BigWig files, but for now I really don't understand their usage. In the end, if I follow your link of the docs, I can see that I need a BED file with the coordinates of where to trim the DNA. But the problem holds, I don't know how to find the optimal coordinates. Do you know a better way to locate the real TSS of the genes? Using the coordinates of the GTF seems to be inefficient for this task. Thank you!

ADD REPLY • link 3.0 years ago by Vittorio • 0

0

Entering edit mode

Using a GTF to snag TSS positions is pretty standard. There are some programmatic ways to do this via biomaRt explained here or AnnotationHub. Pay particular attention to strand information to ensure you're actually grabbing the TSS rather than the end of a gene on the anti-sense strand.

I don't know what you mean by "labels for gene expression levels". If you provide clear examples of what exactly you want, somebody here will probably code-golf it for you. As it stands, it's not clear what your method uses as input.

ADD REPLY • link 3.0 years ago by jared.andrews07 ★ 18k