Hi everyone! I am a university student working on my Master's thesis. I worked on a paper called Xpresso which has the purpose to predict the gene expression levels starting from DNA sequences using deep learning techniques. Now, my lecturers have asked me to create my own dataset made up of sequences and gene expression values. Usually the works which tackle this problem finds the locations of the TSSs of the various genes and cut a region with k bp downstream and upstram the TSS location, associating to such DNA sequences to a target value, which is a real number called gene expression level. I tried to cut the DNA of the reference genome fasta file using the gtf "gene" annotations, but I realized that it is not enough to cut the right regions, because the performances of my predictive model fall dramatically respect to the performance obtained on the Xpresso's dataset. So, I would ask to you:
- How would you create such dataset?
- I see someone that uses, BED files and BigWig files, can you explain me why could they be useful?
My specific knowledge of the bio domain is very poor, so any advice is valuable to me. Thanks to all!
Thank you for the answer! Anyway, I've worked on their jupyter notebook for a long time and I devised methods that can outperform the one proposed by Xpresso by far. As a computer scientist, I have the skills to create good predictive models, but I'm not so confident with biological data and tools. The problem is that at this point of my Thesis I have to create my own dataset with my sequences and labels in order to validate again my models on another dataset. Moreover, we want a new dataset in order to manipulate DNA by removing introns or using just some sequences and so on... For now we tried to download the labels for gene expression levels from GTEx portal and the sequences has been extracted from the reference genome fasta file using the coordinate listed in the GTF file annotated as gene. By giving a glance online I found out that people make use of such BigWig files, but for now I really don't understand their usage. In the end, if I follow your link of the docs, I can see that I need a BED file with the coordinates of where to trim the DNA. But the problem holds, I don't know how to find the optimal coordinates. Do you know a better way to locate the real TSS of the genes? Using the coordinates of the GTF seems to be inefficient for this task. Thank you!
Using a GTF to snag TSS positions is pretty standard. There are some programmatic ways to do this via biomaRt explained here or AnnotationHub. Pay particular attention to strand information to ensure you're actually grabbing the TSS rather than the end of a gene on the anti-sense strand.
I don't know what you mean by "labels for gene expression levels". If you provide clear examples of what exactly you want, somebody here will probably code-golf it for you. As it stands, it's not clear what your method uses as input.