Basically, I have several datasets coming from different sources that I want to merge.
The data contains blood sample values. All the data has been collected from a population with a disease, which was given a drug. Based on the datasets, I would like to run some different tests if I can obtain a large enough dataset and study the clearances rate. The problem with the datasets is that all the clinics used different time intervals to draw blood, but besides that, the methods used to analyze the blood are identical. As the various datasets do not have any identical timestamps, I find very it difficult to merge. Also, the clearance rate should follow the first order, but as several patients within the datasets have different clearing rates.
Is there a gold standard method that could be used to correct the datasets? I can't be the only person running into this sort of problem?
I tried to represent it with a drawing: Imagine you have dataset´s A, B, C, D, F each represented as an individual line and time-stamps depicted here as horizontal lines. In my case, I would have to extract two-three values at each time-stamp and somehow correlate or correct it to the various dataset.
You don't say what the data is so anything we can say will be very generic. The simplest approach is to consider that time point i in one series is the same as time point i in the other. In your case, you need to decide if there is a relevant difference between 5h and 12h, 18h and 24h... you can check if that's the case by looking for differences between matched time points. If you can't or are not willing to match time points then one option is to have missing data for a time point not represented in a series and maybe impute the missing values by interpolation or other appropriate method.
If you're interested in calculating the clearance rate for each patient, you don't need to match the time points or there's something I don't get. What are the different tests you mentioned?
Thanks for taking an interest, The reason that I am uncertain of how to deal with it is that the patients in the dataset have very high variance in their clearance rates. Also, the clearance rate does not follow a steady-state. In most cases, the observed clearance is substantial in the first couple of hours. Which is then followed by a slow rate is some cases almost steady-state.
This seems to be one of the obstacles with merging the dataset. As for the tests, the ultimate objective is to determine which patients have inadequate clearance. But this is also defined in different ways depending on the data's country of origin. Right now my focus is on finding a way to merge our data into a single DB.
So you already have the clearance rates. If it's a matter of putting data in a database, I would preserve as much as possible the original information, in this case, the sample collection time. How to deal with the different times should be dealt with at the analysis stage as it would depend of what data is used and for what purpose.