Alignment Problem
1
0
Entering edit mode
12.4 years ago

Hi,

I would like to compute correlation coefficient between two numerical vectors that represent MALDI-TOF data. The two vectors do not always have the same length, which makes cor() function to fail.

The difference is usually of 1 or 2 values (out of 100 or 200). I would like to insert a NA into the shorter vector, but my problem is to find where I have to insert the NA value. The masses vectors are aligned, but sometimes, one mass in vector 1 match 2 masses in vector 2. By hand, I am able to "find" where the shift occurs, but I don't know how to code this search.

Completed :

Here is an example :

mass1 is the first vector of masses. mass2 is the second vector.

I want to identify the intersection of the two vectors. However, we are dealing with biological data, so the masses are not exactly the same. We have to allow a delta to say that two close masses are equivalent. match1 is a boolean vector used to know which masses in vector 1 are found in vector 2. To compute this, i use a "window" so close masses are said to be equivalent. The window is coded in ppm (part per million), because the error in measurement is growing with the mass. So for a peak at mass 2000, I will allow a window of +/- 2, but for 20000, I will allow a window of +/-20.

The problem is that in a few circonstances, a mass in vector 1 is found to have a match in vector 2, but as the corresponding mass in vector 2 is lower, the window is smaller, and in the match2 vector, the value is FALSE. That's why at the end I have not the same length for match1 and match2 (but i should have). I have try to solve this problem, but to solve it exactly, it takes to much time to compute. That's why I wanted to just remove or add one value to have the same length for both vectors.

mass1 :  ... 3711 3740 3818 3883 ...
match1 : ...    0    1    1    1 ...
mass2 : ... 3687 3747 3769 3817 3883 ... 
match2 : ...   0    0    0    1     1 ...

Here you can see that 3818 match 3817, 3883 match 3883, but in vector 1 3740 match 3747, but the opposite is not true. At the end, the vector match1 is longer than vector 2 by one unit. The error comes from here. I would like to align to vectors like the one below to "add" or "remove" one value and have the same length :

matched masses in vector 1 :  ... 2114 3245 3740 3818 3883 4254 4785 ...
matched masses in vector 2 :  ... 2113 3247 3817 3883 4256 4785 ...

I'm sorry, it's hard to explain !

Would you know how to do this ? or if there is a way to compute corelation coeficient with two vectors of different length ?

Thanks a lot

Julien

alignment mass-spec • 2.7k views
ADD COMMENT
1
Entering edit mode

I kind of understand it, but it would be much easier to help you with a numerical example. Also, are you coding in R/perl/??

ADD REPLY
0
Entering edit mode

Yes sorry for that, I edited the post above.

Julien

ADD REPLY
1
Entering edit mode

You should get correlation of the data that shares the same class in the two vectors instead of adding 'NA' for missing class values in the shorter vector.

ADD REPLY
0
Entering edit mode

Still not really clear. Could you provide a small version of your problem (even with two fake vectors), so that we can see what exactly you are dealing with? I agree with Dk that adding NAs does not seem like the appropriate solution, but please provide an example, so that we can help you.

ADD REPLY
0
Entering edit mode
12.4 years ago
kstamm ▴ 50

Trying to match up the values based on similar items is going to be error prone and lead to trouble. How do we define similar numbers? Do you really want the software to arbitrarily guess the best place for a NA? Why did one vector get a double measurement?

Your better solution is to ensure the data comes in correctly labeled to begin with. If your vectors have rownames, there are easier ways to them up automatically. If it doesnt have to be automated, this is a good place to use a simple spreadsheet application with each data vector as a column.

Your data is inconsistent at the moment, and those double-measurements should probably be pre-processed as the mean of the two, or maybe one dropped, but automating the decision is probably the wrong solution. See if you can get rownames onto your datafiles and the merge() function might satisfy to create a two-column matrix with consistent rownames.

ADD COMMENT

Login before adding your answer.

Traffic: 1389 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6