Question

How to do context dependent data merge in python? help ?

0

Entering edit mode

8.4 years ago

kirannbishwa01 ★ 1.6k

Can some one help with this ??

This question was posted on stackoverflow, I didn't get any answers so referring to Biostars. Post which don't receive any answer will be deleted or updated as required.

I want to merge the table that was generated from a vcf file. This merge is context dependent. I am now at beginner level python programing and was able to do some data extraction and use some level of conditional statement. I am looking the solution specifically in python. Note: I have looked into pandas and scipy but I am not able to apply the conditional statement as I desire. This problem really needs intervention from some experts in here. Thanks much in advance !

Problem description: - I have two separate text files which need to be read first. Each file has several lines of data with 7 different column. The context dependent merging involves reading the values from first two columns in each text file and then proceed to merge if both info are the same.

A few lines from text01.txt

contig  pos id  ref_al-My   alt-al-My   ref-freq-My alt-freq-My
2   15801571    .   G   A   0.667   0.333
2   15801604    .   CAAAAACAAAA    C    0.583   0.417
2   15801610    .   C   CA,CAAA    0.5      0.25,0.25
2   15803330    .   C   T   0.333   0.667
2   15803398    .   G   A   0.667   0.333
2   15803529    .   ATGC    A   0.667   0.333

Similarly some lines from text_02.txt:

contig  pos id  ref_al-Sp   alt-al-Sp   ref-freq-Sp     alt-freq-Sp
2   15801610    .   CAAAAA  C   0.0     1.0
2   15801618    .   A   G   0.0     1.0
2   15802052    .   C   T   0.1     0.9
2   15803398    .   A   G   0.9     0.1
2   15803477    .   G   A   0.1     0.9
2   15803542    .   A   C   0.1     0.9

Context dependent merging:

So, both the text files have 7 columns in which first three (contig, pos, id) column names are same.

This context dependent merging involves reading the values in the first two (contig and pos) columns from both the text files.
If both contig and pos value match, new columns are added and updated to the output_text.txt file.
Eg. in the given text file two lines have same matching contig and pos value.

From text_01.txt:

contig  pos id  ref_al-My   alt-al-My   ref-freq-My alt-freq-My
2   15801610    .   C     CA,CAAA       0.5     0.25,0.25
2   15803398    .   G     A    0.667     0.333

matches text_02.txt:

contig  pos id  ref_al-Sp   alt-al-Sp   ref-freq-Sp     alt-freq-Sp
2   15801610    .   CAAAAA     C     0.0        1.0
2   15803398    .   A     G     0.9      0.1

So, we append several columns and add one new column (i.e all_ref): where,

all_ref = ref_al-My[::] + ref_al-Sp[::]

Now, the output_text.txt should contain following data:

contig  pos   id    all_ref    alt-al-My       alt-al-Sp    ref-freq-My    ref-freq-Sp     alt-freq-My     alt-freq-Sp
2   15801610    .   C,CAAAAA    CA,CAAA     C   0.5    0.0      0.25,0.25     1.0
2   15803398    .   G,A    A      G   0.667    0.9   0.333    0.1

For other lines we will simply be append their respective values in respective columns, with values for null fields updated as periods.

contig     pos  id    all_ref      alt-al-My    alt-al-Sp      ref-freq-My       ref-freq-Sp        alt-freq-My    alt-freq-Sp
2   15801571    .   G   A   .    0.667     .   0.333     .
2   15803477    .   G   .   A     .      0.1      .     0.9

The data values for My should come before Sp for new added column (all_ref).

I understanding this is a long question but any inputs is appreciated.

Thanks, - K

python vcf data merge pandas data frame • 2.7k views

ADD COMMENT • link 8.4 years ago by kirannbishwa01 ★ 1.6k

score 0 · Answer 1 · 2016-11-01

0

Entering edit mode

8.4 years ago

Zaag ▴ 870

import pandas as pd

df1 = pd.read_csv('text01.txt', sep='\s+')
df2 = pd.read_csv('text_02.txt', sep='\s+')

So now we have each file in a dataframe, text01 looks like this. The 0-5 column on the left is the index that is given by pandas.

contig  pos id  ref_al-My   alt-al-My   ref-freq-My alt-freq-My
0   2   15801571    .   G   A   0.667   0.333
1   2   15801604    .   CAAAAACAAAA C   0.583   0.417
2   2   15801610    .   C   CA,CAAA 0.500   0.25,0.25
3   2   15803330    .   C   T   0.333   0.667
4   2   15803398    .   G   A   0.667   0.333
5   2   15803529    .   ATGC    A   0.667   0.333

So now we set the index to the columns we want to use to compare, contig and pos.

df1.set_index(['contig', 'pos'], inplace=True)
df2.set_index(['contig', 'pos'], inplace=True)

and the text01 dataframe looks like this:

contig  pos     id  ref_al-My   alt-al-My   ref-freq-My alt-freq-My                 
2    15801571   .   G   A   0.667   0.333
2    15801604   .   CAAAAACAAAA C   0.583   0.417
2    15801610   .   C   CA,CAAA 0.500   0.25,0.25
2    15803330   .   C   T   0.333   0.667
2    15803398   .   G   A   0.667   0.333
2    15803529   .   ATGC    A   0.667   0.333

Now we join the 2 dataframes on the index we have just created

dfall = df1.join(df2, lsuffix='01', rsuffix='02')

and the dfall looks like this:

contig  pos id01    ref_al-My   alt-al-My   ref-freq-My alt-freq-My id02    ref_al-Sp   alt-al-Sp   ref-freq-Sp alt-freq-Sp                                 
2   15801571    .   G   A   0.667   0.333   NaN NaN NaN NaN NaN
2   15801604    .   CAAAAACAAAA C   0.583   0.417   NaN NaN NaN NaN NaN
2   15801610    .   C   CA,CAAA 0.500   0.25,0.25   .   CAAAAA  C   0.0 1.0
2   15803330    .   C   T   0.333   0.667   NaN NaN NaN NaN NaN
2   15803398    .   G   A   0.667   0.333   .   A   G   0.9 0.1
2   15803529    .   ATGC    A   0.667   0.333   NaN NaN NaN NaN NaN

Last step get rid of the lines that have NaN in them because they don't hold all the information:

dfall = dfall.dropna()

Gives this output:

contig  pos id01    ref_al-My   alt-al-My   ref-freq-My alt-freq-My id02    ref_al-Sp   alt-al-Sp   ref-freq-Sp alt-freq-Sp
2   15801610    .   C   CA,CAAA 0.500   0.25,0.25   .   CAAAAA  C   0.0 1.0
2   15803398    .   G   A   0.667   0.333   .   A   G   0.9 0.1

ADD COMMENT • link 8.4 years ago by Zaag ▴ 870

0

Entering edit mode

Hi @Zaag,

Thanks for writing an answer. But, I don't see any conditional statement in here. Only the rows that have both matching 'contig' and 'pos' value need to be added together, if not they well be added but the information from another table will just have null values (.). I am also not following the logic of each line of the code, though I can read that contig and pos value are treated as true. Can you please put an explanation to each line of code, so I can probably improve it.

Thanks much for answering though !

ADD REPLY • link 8.4 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Hi kirannbishwa01,

Please refer to the Pandas Documentation.

ADD REPLY • link 8.4 years ago by lmanohara99 ▴ 20

0

Entering edit mode

There you go!

Ensure this value has at least 20 characters (it has 13).

ADD REPLY • link 8.4 years ago by Zaag ▴ 870

0

Entering edit mode

Hi @Zaag,

The code dfall = df1.join(df2, lsuffix='01', rsuffix='02') isn't working for me. I am using python 3.5 but switched to 2.7 just to test it. I am having duplication of ID values (which isn't much of a problem) and the print statement is like this:

id01    ref_al-My alt-al-My  ref-freq-My alt-freq-My id02  \
contig pos                                                                  
 2      15801571    .            G         A        0.667       0.333  NaN   
   15801604    .  CAAAAACAAAA         C        0.583       0.417  NaN   
   15801610    .            C   CA,CAAA        0.500   0.25,0.25    .   
   15803330    .            C         T        0.333       0.667  NaN   
   15803398    .            G         A        0.667       0.333    .   
   15803529    .         ATGC         A        0.667       0.333  NaN   

ref_al-Sp alt-al-Sp  ref-freq-Sp  alt-freq-Sp  
contig pos                                                     
2      15801571       NaN       NaN          NaN          NaN  
   15801604       NaN       NaN          NaN          NaN  
   15801610    CAAAAA         C          0.0          1.0  
   15803330       NaN       NaN          NaN          NaN  
   15803398         A         G          0.9          0.1  
   15803529       NaN       NaN          NaN          NaN

@ Imanohara99, I am a beginner in python and just stumbled upto pandas. I read the documentation and was able to follow through it but its taking some time to align those information with what I want.

But, thanks !

ADD REPLY • link 8.4 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Why switch to 2.7? Does it works in 3.5?

ADD REPLY • link 8.4 years ago by Zaag ▴ 870

0

Entering edit mode

It didn't work for me in both python 2.7 and 3.5. See the result that I posted.

ADD REPLY • link 8.4 years ago by kirannbishwa01 ★ 1.6k