I am working with TCGA data and I want to group the samples based on the tumor_tissue_site field. However, some of the groups are quite specific and merging them together may be reasonable to decrease the number of groups. For example, the following groups could be merged: "'Chest - Breast', 'Chest - Chest wall', 'Chest - Lung/pleura', 'Chest - Mediastinum', 'Chest - Other (please specify)' and "'Head and Neck', 'Head and Neck - Head', 'Head and Neck - Head|Chest - Chest wall', 'Head and Neck - Neck|Head and Neck - Other (please specify)', 'Head and Neck - Other (please specify)'".
I would appreciate it if someone could explain the difference between these sites and whether it would be reasonable to merge them together. Specifically, I am interested in knowing whether merging these groups would result in a loss of information, and whether it would be appropriate for my analysis?
Look at the human body plot at the right side of the GDC data portal https://portal.gdc.cancer.gov/, is this grouping what you are looking for?
Yes, exactly. However, when grouping the data by the 'tumor_tissue_site' column, I've encountered numerous additional groups that I'm unsure how to merge. Are there any other columns in the TCGA clinical data metadata that could be considered as tumor source sites?