Question

Forum:How to do datamining in bioinformatics

6

Entering edit mode

6 weeks ago

Biomed-jeh ▴ 70

Hi Biostars,

I work in a core facility, and I am often asked to help research groups find freely available data for exploratory analyses before they invest in their own sequencing data.

I feel like I'm spending too much time searching without a clear strategy, and I’d really appreciate any advice that could help me become more efficient, creative, and confident in data mining. Additionally, I think data mining is an area of interest for many researchers, but because the results aren’t always immediately accessible, they often don’t invest time into this part of the research process.

Here are two cases that I encountered in the past week as examples:

Case 1:

"Please help us find single-cell data for the retina, including the choroid region. The disease group must be diabetic."

My current approach:

So far, I haven’t found anything relevant. I’ve been using the Gene Expression Omnibus (GEO) https://www.ncbi.nlm.nih.gov/gds/ with various keywords like "retina," "diabetes," "choroid+retina," but I haven’t found studies with data that meet the necessary quality standards.

In addition to this, I’ve spent over 40 hours reading articles, verifying data quality, and checking if there are enough cells in the datasets to make the study meaningful. Is there a faster way to identify high-quality studies and datasets, so I don’t have to invest so many hours into each article that seems only vaguely relevant based on the title or abstract?

Case 2:

"Please help us perform deconvolution on our RNA-bulk sequencing data using a single-cell study of PBMCs (preferably leukocytes) to create the signature matrix. You need to find the single-cell study and data."

My current approach:

I know that Seurat uses a small PBMC training set in their tutorials, and based on that, I found these datasets: https://www.10xgenomics.com/datasets?query=&page=1&configure%5BhitsPerPage%5D=50&configure%5BmaxValuesPerFacet%5D=1000, but I’m not entirely sure which dataset to choose from the archive.

As a bioinformatician in a core facility, I often encounter biological areas that are new to me, and I frequently collaborate with research groups that are entirely focused on wet lab work, with little to no experience in computational analysis or data handling. This often means that the entire responsibility for data mining and analysis falls on me.

What I’m looking for is any guidance or discussion on best practices for data mining. Are there resources or strategies that could help me streamline this process, especially when I’m working with groups that have limited computational experience?

data-mining • 1.0k views

ADD COMMENT • link updated 6 weeks ago by LauferVA 4.5k • written 6 weeks ago by Biomed-jeh ▴ 70

6

Entering edit mode

Not what you want to hear, but I would tell them to find the data themselves, and then provide you with it. Data selection requires some expertise in the underlying biology to decide whether it is suitable. Plus, as you say, it is time-consuming. Hence, people like to sloppily say "please quickly find me..." and then call it a day. Researchers should be into the literature, so the people who request it should have an idea whether papers exist that created these datasets.

ADD REPLY • link 6 weeks ago by ATpoint 85k

0

Entering edit mode

Hi ATpoint Thanks, that makes perfectly sense that they (research groups, PI) should put their time and energy into the mining as well. However, i have repeatedly been given papers by the PI where the papers does not share rawdata or processed objects. Do you have any recommendations on where I should guide the PI to look for data, and what they should look for? Finding interesting papers does not seem to be that difficult for the PI, but finding papers with available data and good quality data seems rare.

I know that some areas of biology have been examined more than others, so I(they) probably have an easier job finding the PBMC dataset, compared to the retina dataset.

ADD REPLY • link 6 weeks ago by Biomed-jeh ▴ 70

2

Entering edit mode

I have repeatedly been given papers by the PI where the papers does not share rawdata or processed objects.

Just tell them that the authors have not shared data. The PIs are welcome to contact the authors and request data from them. There's nothing you can do about lack of data.

ADD REPLY • link 6 weeks ago by Ram 44k

1

Entering edit mode

At least regarding the first case, the human cell atlas' (https://data.humancellatlas.org/hca-bio-networks/eye/datasets) database might help narrow down possible datasets/studies that could be used. The one I linked earlier is specific to eye tissue, so it could be a stepping stone to look into which ones could be fruitful.

But otherwise similar to what ATpoint said, a lot of times other researchers think it is easy to find a dataset for their question, but might be unaware of all the possible hoops that one has to go through to ensure that the data is actually correct, and of good quality.

ADD REPLY • link 6 weeks ago by DGTool ▴ 290

1

Entering edit mode

Thank you DGTool for sharing the atlas. I will refer the atlas to the PI and ask if they can help identify datasets of interest to decrease the time spent on the project. I can see that the fastq files are available for the multi-omics atlas of retina, so there is potential that it includes the celltypes that the PI have interest into as well. Thanks for the reference!

ADD REPLY • link 6 weeks ago by Biomed-jeh ▴ 70

score 5 · Answer 1 · 2024-10-08

Hi Biomed-jeh ,

I want to make some additional points that take the advice ATpoint and others have made in comments above seriously, and I am presenting this as a correct answer, rather than a comment, deliberately. First, I do think it is reasonable to say, "Why not just provide some very good resources - as a moderator you should know that a good answer to OP's question would be very valuable on Biostars". I agree with the value-add, but there are two key reasons I have elected not to take this approach:

Point 1. First and foremost, I do not think that providing good resources solves the problem, but rather exacerbates it. Why? because the appropriate person to select a dataset to study a specific, nuanced research question must have two kinds of deep knowledge:

of the research program of the lab and the specific question being asked in the context of the literature
of the biology of a given tissue, disease state, developmental process, etc. It is not possible in this day and age to "know all biology"; hasn't been for >100 years; so it is rather the subject matter expert who should frame the question AND select the studies necessary to study it. Why? Because you cannot expect to be able to discern first from second rate studies in every conceivable field in an amount of time short enough to meet the demands of a core facility.

Point 2. I want to be clear that I have selected this answer rather than providing candidate datasets because of the specific phrasing of this post which I believe is more likely "bring out" the psychosocial aspects of this collaborative dynamic than other very similar posts of this kind.

"You need to find the single-cell study and data"

But, Biomed-jeh , when I really reflect on this, the answer is still definitely not "ok then in this case you should find it". The answer is that this is a complex social phenomenon and that you've done well to provide enough context for others to know how to offer suggestions to navigate it.There may be power differences at play here, too, and kudos to you for recognizing that. But, even if you are told this, it is still not the right decision to pose as a subject matter expert in a niche area of biology, because you aren't. The role that leverages the expertise of the analyst isn't finding data in someone else's field, it is providing expert level quantitative analysis by drawing on mathematical structures, AI, etc. that wet lab biologist collaborator's may not understand well enough to select and implement themselves.

Of course it is a good and relevant goal to learn as much as you can about the biology as you go, and I think that you are right: it is good to start to develop a list of best resources available. But, as you have said above, you still won't know them for areas of biology that are brand new to you, and there are bad datasets mixed in with good even in high quality resources.

This is why, Biomed-jeh , with those two reasons as background, I suggest the following approach instead:

Never be rude, and always look at each new collaboration as an awesome and genuine opportunity to learn new biology.
As soon as you receive a communication like the one quoted above, go to your direct supervisor immediately, and ask THEM what role you are to play. In this scenario, the collaborator is acting as your supervisor without knowing the demands on your time. If they tell you it is your job to find the data, then say "OK I understand and I am on board, but please know in advance I can responsibly carry no more than 10 collaborations per year of this kind" (Side note here: Depending on your relationship with your supervisor, it might not even be going too far to put that statement in writing and ask directly if they are on board with that in the same email. I hope this is not necessary!!).
If your supervisor indicates that's not appropriate (they should be because of point 1. at the top of the post), then follow-up by asking for a meeting with the collaborator, the corresponding author of the collaborator, your supervisor, and you. Be kind and eloquent and positive, but don't budge on the fundamental issue: point 1. above. Instead, explain kindly what is involved in finding datasets of that kind. Appeal to their expertise by saying, "I don't know your research project as well as you do, and I don't have the command of this exact literature as you do." Explain that a request to the corresponding author of a target paper that has elected not to provide data, or that is controlled access, has the best chance of being granted if the corresponding author at your institution contacts the corresponding author of the target study directly. This is true for numerous reasons; they may even know each other.
Close by re-iterating that you are looking forward to this collaboration, but that these are the steps that need to be taken to make sure it really shines. Stress that your only reason for saying this is the gap between their expertise on the niche area they study and your own.
Finally in your down time / in the long term, begin to curate best datasets in each area based on your collaborations, reading the literature, and by learning about data resources like those others have recommended here.

There is one way that I think an answer to this question could be provided quantitatively. If there were a vast repository - the size of GEO or Array Express or ICBC or DDBJ or GSA themselves - that has been analyzed according to a standardized, best-practice pipeline, annotated with appropriate metadata, and made available in a full stack web app - THEN I think the bioinformatician would be sufficiently empowered to help guide the wet lab biologist on data selection. But short of this, if I think about the best case scenario for all people involved, it should be the other way around.

score 1 · Answer 2 · 2024-10-08

1

Entering edit mode

6 weeks ago

jared.andrews07 ★ 18k

For the second case, there are a number of relevant datasets in celldex that may be useful for deconvolution purposes. It's mostly bulk data, but they are sorted populations and thus quite suitable for using as a reference for deconvolution (arguably more so than scRNA, imo). The Monaco set in particular is fairly exhaustive and contains most cell types you'd expect in a typical PBMC prep.

Alternatively, you can check the scRNAseq package for relevant datasets.

ADD COMMENT • link 6 weeks ago by jared.andrews07 ★ 18k

0

Entering edit mode

Thank you jared.andrews07

It does make sense to use sorted cell populations instead of a single-cell set to avoid infiltration of pseudo-cell types that could cause biased deconvolution. I actually went ahead yesterday with this single-cell dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE246624, but now I'm quite interested in seeing the difference between using the Monaco set versus GSE246624. Thanks for sharing!

ADD REPLY • link 6 weeks ago by Biomed-jeh ▴ 70