I am new to NGS sequencing methods and recently I gained access to ICGC. For a given dataset, how do I find the control sample data? Although I didn't fully explore the portal, I was wondering that if there are tumor samples for a given patient, are there control samples too for that patient?
Welcome to the world of ICGC DCC access. You have to understand what kind of data is deposited for particular studies and what all meta-information is present with the studies. Check this link for your reference. ICGC is not a place to download its an aggregator from there you will be mirrored to the actual cloud-based repositories which have the data for which you have to make additional payments and service activation to download the data.
The open access helps you in the listing for which project you have healthy donors or adjacent healthy tissues. You have to make the search with the study and from where this study was made. The publication will outline this. Otherwise, the ICGC helpdesk will. but now it's mostly vacation time there.
In another way, you can try to see if those data are also deposited in EBI EGA database or not. Then you will need access to that in order to avoid ICGC suggested cloud service and then request EGA to provide you a way to download them in your location.
For most of the studies, you will see that each study is listed with the type of experiments, data, FILEIF, DONOR_ID. There are meta .tsv files you can download and find that information but they are mostly cancer patients and not healthy donors.
For understanding the healthy donors refer to the corresponding study which has deposited the data either in EBI or claim to contact ICGC since there only you will find information if real healthy controls are there or not.
Fun fact: How I know it? Since am having recent access and I found out all these things while downloading the data and still I do not have the data for my particular study as my EGA approval is waiting.
While all this information is useful it does not quite address the original question. OP seems to want to know if there are control samples for a patient (non-tumor tissue) and not healthy donors.
I have added a link for referencing the ICGC docs for understanding normal samples nomenclature(it also lists for normal reference from the same donor). However, the OP would need to get hold of the study which actually made the data deposition to really understand whether there are healthy controls or not. Not all studies have healthy controls in ICGC.
Another thing is the DONOR ID will be same for patients giving the healthy tissue and the cancerous tissue, there are also healthy DONORS as well. For making somatic analysis for some studies, either peripheral normal blood is considered as reference for the same patient or the health adjacent tissue but this information will be first available from the publication that deposited the data otherwise one has to go through the detail ICGC manual docs which OP did not do.
So it depends upon the project of interest or the way the samples were taken. Then what is the usual course of action if healthy samples are not included in a study?
The hierarchy of the projects is different. For this, you will have to consult their documentation. The idea of ICGC is to keep global wide data for different cancers and use them for downstream analysis in Pan Cancer projects. Now if you are interested in just one cancer among them then you will have to search for the kind of data you are interested in. In that case, you will find them categorized and also mentioned if healthy data is there or not. You can refer to the corresponding study from which the data was taken to see if there were healthy controls or not. Not everything gets deposited at the right time. Also, the helpdesk can show or port you to the correct data if you cannot find and miss out.
Another thing is if you want to do analysis from scratch, let's say WES data, you will need healthy controls from the same patient but even population wide somatic calls are also done to reduce the false positive. There comes the idea of Panel of Normals (which can use as a surrogate for healthy individuals that will act as a reference for somatic calls). But all these depends on what extent of interrogation you are going to make? However, the problem of sequencing kits, generation of data by a different machine and different people pose a limiting factor if you perform the somatic analysis on few individuals. If it's larger it does not since you run for all together.
Now you will have to start another thread with the design you are intending to perform and the cancer else it will be difficult to address. However, please go through the documentation, if you have specific query ask their helpdesk and then you can start new specific queries here. Am sure there are more learned people and experts here who can better help. Good luck!
If the moderators think please feel free to strike through the lines from the start of the second paragraph. I wanted to make aware people here that ICGC does is just a meta-data aggregator and one cannot download any data unless proper measures are taken or unless one had enough studied their documentation.
While all this information is useful it does not quite address the original question. OP seems to want to know if there are control samples for a patient (non-tumor tissue) and not healthy donors.
I have added a link for referencing the ICGC docs for understanding normal samples nomenclature(it also lists for normal reference from the same donor). However, the OP would need to get hold of the study which actually made the data deposition to really understand whether there are healthy controls or not. Not all studies have healthy controls in ICGC.
Another thing is the DONOR ID will be same for patients giving the healthy tissue and the cancerous tissue, there are also healthy DONORS as well. For making somatic analysis for some studies, either peripheral normal blood is considered as reference for the same patient or the health adjacent tissue but this information will be first available from the publication that deposited the data otherwise one has to go through the detail ICGC manual docs which OP did not do.
So it depends upon the project of interest or the way the samples were taken. Then what is the usual course of action if healthy samples are not included in a study?
The hierarchy of the projects is different. For this, you will have to consult their documentation. The idea of ICGC is to keep global wide data for different cancers and use them for downstream analysis in Pan Cancer projects. Now if you are interested in just one cancer among them then you will have to search for the kind of data you are interested in. In that case, you will find them categorized and also mentioned if healthy data is there or not. You can refer to the corresponding study from which the data was taken to see if there were healthy controls or not. Not everything gets deposited at the right time. Also, the helpdesk can show or port you to the correct data if you cannot find and miss out.
Another thing is if you want to do analysis from scratch, let's say WES data, you will need healthy controls from the same patient but even population wide somatic calls are also done to reduce the false positive. There comes the idea of Panel of Normals (which can use as a surrogate for healthy individuals that will act as a reference for somatic calls). But all these depends on what extent of interrogation you are going to make? However, the problem of sequencing kits, generation of data by a different machine and different people pose a limiting factor if you perform the somatic analysis on few individuals. If it's larger it does not since you run for all together.
Now you will have to start another thread with the design you are intending to perform and the cancer else it will be difficult to address. However, please go through the documentation, if you have specific query ask their helpdesk and then you can start new specific queries here. Am sure there are more learned people and experts here who can better help. Good luck!
Thanks for making it clear! Really appreciated.
Did it myself. Thanks
If the moderators think please feel free to strike through the lines from the start of the second paragraph. I wanted to make aware people here that ICGC does is just a meta-data aggregator and one cannot download any data unless proper measures are taken or unless one had enough studied their documentation.