Hi Biomed-jeh ,
I want to make some additional points that take the advice ATpoint and others have made in comments above seriously, and I am presenting this as a correct answer, rather than a comment, deliberately. First, I do think it is reasonable to say, "Why not just provide some very good resources - as a moderator you should know that a good answer to OP's question would be very valuable on Biostars". I agree with the value-add, but there are two key reasons I have elected not to take this approach:
Point 1. First and foremost, I do not think that providing good resources solves the problem, but rather exacerbates it. Why? because the appropriate person to select a dataset to study a specific, nuanced research question must have two kinds of deep knowledge:
- of the research program of the lab and the specific question being asked in the context of the literature
- of the biology of a given tissue, disease state, developmental process, etc. It is not possible in this day and age to "know all biology"; hasn't been for >100 years; so it is rather the subject matter expert who should frame the question AND select the studies necessary to study it. Why? Because you cannot expect to be able to discern first from second rate studies in every conceivable field in an amount of time short enough to meet the demands of a core facility.
Point 2. I want to be clear that I have selected this answer rather than providing candidate datasets because of the specific phrasing of this post which I believe is more likely "bring out" the psychosocial aspects of this collaborative dynamic than other very similar posts of this kind.
"You need to find the single-cell study and data"
But, Biomed-jeh , when I really reflect on this, the answer is still definitely not "ok then in this case you should find it". The answer is that this is a complex social phenomenon and that you've done well to provide enough context for others to know how to offer suggestions to navigate it.There may be power differences at play here, too, and kudos to you for recognizing that. But, even if you are told this, it is still not the right decision to pose as a subject matter expert in a niche area of biology, because you aren't. The role that leverages the expertise of the analyst isn't finding data in someone else's field, it is providing expert level quantitative analysis by drawing on mathematical structures, AI, etc. that wet lab biologist collaborator's may not understand well enough to select and implement themselves.
Of course it is a good and relevant goal to learn as much as you can about the biology as you go, and I think that you are right: it is good to start to develop a list of best resources available. But, as you have said above, you still won't know them for areas of biology that are brand new to you, and there are bad datasets mixed in with good even in high quality resources.
This is why, Biomed-jeh , with those two reasons as background, I suggest the following approach instead:
- Never be rude, and always look at each new collaboration as an awesome and genuine opportunity to learn new biology.
- As soon as you receive a communication like the one quoted above, go to your direct supervisor immediately, and ask THEM what role you are to play. In this scenario, the collaborator is acting as your supervisor without knowing the demands on your time. If they tell you it is your job to find the data, then say "OK I understand and I am on board, but please know in advance I can responsibly carry no more than 10 collaborations per year of this kind" (Side note here: Depending on your relationship with your supervisor, it might not even be going too far to put that statement in writing and ask directly if they are on board with that in the same email. I hope this is not necessary!!).
- If your supervisor indicates that's not appropriate (they should be because of point 1. at the top of the post), then follow-up by asking for a meeting with the collaborator, the corresponding author of the collaborator, your supervisor, and you. Be kind and eloquent and positive, but don't budge on the fundamental issue: point 1. above. Instead, explain kindly what is involved in finding datasets of that kind. Appeal to their expertise by saying, "I don't know your research project as well as you do, and I don't have the command of this exact literature as you do." Explain that a request to the corresponding author of a target paper that has elected not to provide data, or that is controlled access, has the best chance of being granted if the corresponding author at your institution contacts the corresponding author of the target study directly. This is true for numerous reasons; they may even know each other.
- Close by re-iterating that you are looking forward to this collaboration, but that these are the steps that need to be taken to make sure it really shines. Stress that your only reason for saying this is the gap between their expertise on the niche area they study and your own.
- Finally in your down time / in the long term, begin to curate best datasets in each area based on your collaborations, reading the literature, and by learning about data resources like those others have recommended here.
There is one way that I think an answer to this question could be provided quantitatively. If there were a vast repository - the size of GEO or Array Express or ICBC or DDBJ or GSA themselves - that has been analyzed according to a standardized, best-practice pipeline, annotated with appropriate metadata, and made available in a full stack web app - THEN I think the bioinformatician would be sufficiently empowered to help guide the wet lab biologist on data selection. But short of this, if I think about the best case scenario for all people involved, it should be the other way around.
Not what you want to hear, but I would tell them to find the data themselves, and then provide you with it. Data selection requires some expertise in the underlying biology to decide whether it is suitable. Plus, as you say, it is time-consuming. Hence, people like to sloppily say "please quickly find me..." and then call it a day. Researchers should be into the literature, so the people who request it should have an idea whether papers exist that created these datasets.
Hi ATpoint Thanks, that makes perfectly sense that they (research groups, PI) should put their time and energy into the mining as well. However, i have repeatedly been given papers by the PI where the papers does not share rawdata or processed objects. Do you have any recommendations on where I should guide the PI to look for data, and what they should look for? Finding interesting papers does not seem to be that difficult for the PI, but finding papers with available data and good quality data seems rare.
I know that some areas of biology have been examined more than others, so I(they) probably have an easier job finding the PBMC dataset, compared to the retina dataset.
Just tell them that the authors have not shared data. The PIs are welcome to contact the authors and request data from them. There's nothing you can do about lack of data.
At least regarding the first case, the human cell atlas' (https://data.humancellatlas.org/hca-bio-networks/eye/datasets) database might help narrow down possible datasets/studies that could be used. The one I linked earlier is specific to eye tissue, so it could be a stepping stone to look into which ones could be fruitful.
But otherwise similar to what
ATpoint
said, a lot of times other researchers think it is easy to find a dataset for their question, but might be unaware of all the possible hoops that one has to go through to ensure that the data is actually correct, and of good quality.Thank you DGTool for sharing the atlas. I will refer the atlas to the PI and ask if they can help identify datasets of interest to decrease the time spent on the project. I can see that the fastq files are available for the multi-omics atlas of retina, so there is potential that it includes the celltypes that the PI have interest into as well. Thanks for the reference!