Question

What is the best time saving way to loop over column elements in pandas DF and replace Nan with specific element from each column?

0

Entering edit mode

2.3 years ago

Phoebe Magdy • 0

Hello, respectable minds;)

I was analyzing SNPs data where rows represent patients and columns represent SNPs, of course, there are some Nan values because some SNPs exist in some patients while don't exist in others, at first every sample was represented by 2 rows one to show the reference allele and the other row to show the alternative allele at every SNP.

I was trying to replace the Nan values in every single column with the Reference allele of that specific SNP (column) so my approach was to:

enter image description here

1- create a variable containing all elements of every col as pd.Series and Get the first valid value

2- Then use this value to replace Nan's in this specific column: After that i will remove all rows representing the REF allele

My code used a For loop to loop over every column to get the REF allele and use it to replace Nan as follows:

for col in df3.columns:
s = pd.Series (df3[col])
first_valid_Ref_value = s.loc[s.first_valid_index()]
print(first_valid_Ref_value)
df3[[col]] = df3[[col]].fillna(first_valid_Ref_value)##

This piece of code took more than 7 hours to loop over 151865 SNPs (columns) and did not finish but suddenly windows required to restart and shut down my Linux VM that is hosted in windows 10 OS

Now I had 2 Questions:

1- Is there is a better way to loop over column and replace Nan values that saves time than the way I'm doing it ?

2- How to secure my code from being stopped while working in jupyter notebook, is there is a command like for example 'nohup' which is used in terminal that we can use in jupyternotebook such that if the note stopped suddenly our code is still running in the back ground ,, or else is there a way to restart from where we stopped instead of restart from the beginning?

column loop nohup • 1.2k views

ADD COMMENT • link updated 2.3 years ago by GenoMax 149k • written 2.3 years ago by Phoebe Magdy • 0

GenoMax · Answer 1 · 2022-11-17

0

Entering edit mode

2.3 years ago

Shred ★ 1.6k

You've not posted any sample data, so no code will be provided to test on. I think you could transpose your dataframe (to have SNP on index, samples on columns), than you could keep only non-NA values by dropping on axis=1 and store this dataframe as a separate df (assume df4). Giving that you've now two dataframe sharing same indices (SNP), you could use combine_first [Reference] without iterating through columns (which is generally a bad idea with Pandas). Then:

df3.combine_first(df4)

If you need some code to understand better, please attach some input data.

ADD COMMENT • link 2.3 years ago by Shred ★ 1.6k

0

Entering edit mode

Really I don't know how to add part of my data in a good format here I've searched a lot and read a lot of posts and could not find even a single youtube video illustrating the different options on the site here this is, unfortunately, the cause that most of my questions are not answered caz i believe I don't post them in a proper way: I highly appreciate if u can help me in this

Anyways regarding my data here is a part of the df3 in the shared link if this is allowed here so that you can replicate it and help me in finding my answer :

https://docs.google.com/spreadsheets/d/1wiXTEovqV4Uh5HLV2Vkz0jBJocSOAbNN/edit?usp=sharing&ouid=103665810529544354453&rtpof=true&sd=true

many thanks in advance

ADD REPLY • link 2.3 years ago by Phoebe Magdy • 0

0

Entering edit mode

Sorry but I'll not open external spreadsheet and do data extraction for you. That's not how answering question online works. You must provide an example which could be visible for any users opening this question, without opening external links which could go offline for any reason. Extract a subset of 10x10, with some NaN, and post here using the code sample.

ADD REPLY • link 2.3 years ago by Shred ★ 1.6k

0

Entering edit mode

thanks for your reply here is part of data to replicate if possible:

    SNP_ID  chr1-10002921   chr1-100058793  chr1-10007418   chr1-1000966    chr1-1001028    chr1-100111956  chr1-1001177    chr1-1001233    chr1-1001270    chr1-1001290    chr1-100133176
G2_Sample_11_SNPs ALT   NaN NaN NaN NaN NaN NaN C   T   NaN NaN NaN
G2_Sample_11_SNPs REF   NaN NaN NaN NaN NaN NaN G   C   NaN NaN NaN
G2_Sample_12_SNPs ALT   NaN NaN NaN NaN NaN NaN C   NaN NaN NaN NaN
G2_Sample_12_SNPs REF   NaN NaN NaN NaN NaN NaN G   NaN NaN NaN NaN
G2_Sample_33_SNPs ALT   NaN NaN NaN NaN NaN G   C   NaN NaN T   NaN
G2_Sample_33_SNPs REF   NaN NaN NaN NaN NaN A   G   NaN NaN C   NaN
G2_Sample_34_SNPs ALT   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
G2_Sample_34_SNPs REF   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
G2_Sample_35_SNPs ALT   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
G2_Sample_35_SNPs REF   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
G2_Sample_36_SNPs ALT   NaN NaN NaN NaN NaN NaN C   NaN NaN NaN NaN

ADD REPLY • link updated 2.3 years ago by GenoMax 149k • written 2.3 years ago by Phoebe Magdy • 0