Can some one help with this ??
This question was posted on stackoverflow, I didn't get any answers so referring to Biostars. Post which don't receive any answer will be deleted or updated as required.
I want to merge the table that was generated from a vcf file. This merge is context dependent. I am now at beginner level python programing and was able to do some data extraction and use some level of conditional statement. I am looking the solution specifically in python. Note: I have looked into pandas and scipy but I am not able to apply the conditional statement as I desire. This problem really needs intervention from some experts in here. Thanks much in advance !
Problem description: - I have two separate text files which need to be read first. Each file has several lines of data with 7 different column. The context dependent merging involves reading the values from first two columns in each text file and then proceed to merge if both info are the same.
A few lines from text01.txt
contig pos id ref_al-My alt-al-My ref-freq-My alt-freq-My
2 15801571 . G A 0.667 0.333
2 15801604 . CAAAAACAAAA C 0.583 0.417
2 15801610 . C CA,CAAA 0.5 0.25,0.25
2 15803330 . C T 0.333 0.667
2 15803398 . G A 0.667 0.333
2 15803529 . ATGC A 0.667 0.333
Similarly some lines from text_02.txt:
contig pos id ref_al-Sp alt-al-Sp ref-freq-Sp alt-freq-Sp
2 15801610 . CAAAAA C 0.0 1.0
2 15801618 . A G 0.0 1.0
2 15802052 . C T 0.1 0.9
2 15803398 . A G 0.9 0.1
2 15803477 . G A 0.1 0.9
2 15803542 . A C 0.1 0.9
Context dependent merging:
So, both the text files have 7 columns in which first three (contig, pos, id) column names are same.
This context dependent merging involves reading the values in the first two (contig and pos) columns from both the text files.
If both contig and pos value match, new columns are added and updated to the output_text.txt file.
Eg. in the given text file two lines have same matching contig and pos value.
From text_01.txt:
contig pos id ref_al-My alt-al-My ref-freq-My alt-freq-My
2 15801610 . C CA,CAAA 0.5 0.25,0.25
2 15803398 . G A 0.667 0.333
matches text_02.txt:
contig pos id ref_al-Sp alt-al-Sp ref-freq-Sp alt-freq-Sp
2 15801610 . CAAAAA C 0.0 1.0
2 15803398 . A G 0.9 0.1
So, we append several columns and add one new column (i.e all_ref): where,
all_ref = ref_al-My[::] + ref_al-Sp[::]
Now, the output_text.txt should contain following data:
contig pos id all_ref alt-al-My alt-al-Sp ref-freq-My ref-freq-Sp alt-freq-My alt-freq-Sp
2 15801610 . C,CAAAAA CA,CAAA C 0.5 0.0 0.25,0.25 1.0
2 15803398 . G,A A G 0.667 0.9 0.333 0.1
For other lines we will simply be append their respective values in respective columns, with values for null fields updated as periods.
contig pos id all_ref alt-al-My alt-al-Sp ref-freq-My ref-freq-Sp alt-freq-My alt-freq-Sp
2 15801571 . G A . 0.667 . 0.333 .
2 15803477 . G . A . 0.1 . 0.9
- The data values for My should come before Sp for new added column (all_ref).
I understanding this is a long question but any inputs is appreciated.
Thanks, - K
Hi @Zaag,
Thanks for writing an answer. But, I don't see any conditional statement in here. Only the rows that have both matching 'contig' and 'pos' value need to be added together, if not they well be added but the information from another table will just have null values (.). I am also not following the logic of each line of the code, though I can read that contig and pos value are treated as true. Can you please put an explanation to each line of code, so I can probably improve it.
Thanks much for answering though !
Hi kirannbishwa01,
Please refer to the Pandas Documentation.
There you go!
Ensure this value has at least 20 characters (it has 13).
Hi @Zaag,
The code
dfall = df1.join(df2, lsuffix='01', rsuffix='02')
isn't working for me. I am using python 3.5 but switched to 2.7 just to test it. I am having duplication of ID values (which isn't much of a problem) and the print statement is like this:@ Imanohara99, I am a beginner in python and just stumbled upto pandas. I read the documentation and was able to follow through it but its taking some time to align those information with what I want.
But, thanks !
Why switch to 2.7? Does it works in 3.5?
It didn't work for me in both python 2.7 and 3.5. See the result that I posted.