Hello, respectable minds;)
I was analyzing SNPs data where rows represent patients and columns represent SNPs, of course, there are some Nan values because some SNPs exist in some patients while don't exist in others, at first every sample was represented by 2 rows one to show the reference allele and the other row to show the alternative allele at every SNP.
I was trying to replace the Nan values in every single column with the Reference allele of that specific SNP (column) so my approach was to:
1- create a variable containing all elements of every col as pd.Series and Get the first valid value
2- Then use this value to replace Nan's in this specific column: After that i will remove all rows representing the REF allele
My code used a For loop to loop over every column to get the REF allele and use it to replace Nan as follows:
for col in df3.columns:
s = pd.Series (df3[col])
first_valid_Ref_value = s.loc[s.first_valid_index()]
print(first_valid_Ref_value)
df3[[col]] = df3[[col]].fillna(first_valid_Ref_value)##
This piece of code took more than 7 hours to loop over 151865 SNPs (columns) and did not finish but suddenly windows required to restart and shut down my Linux VM that is hosted in windows 10 OS
Now I had 2 Questions:
1- Is there is a better way to loop over column and replace Nan values that saves time than the way I'm doing it ?
2- How to secure my code from being stopped while working in jupyter notebook, is there is a command like for example 'nohup' which is used in terminal that we can use in jupyternotebook such that if the note stopped suddenly our code is still running in the back ground ,, or else is there a way to restart from where we stopped instead of restart from the beginning?
Really I don't know how to add part of my data in a good format here I've searched a lot and read a lot of posts and could not find even a single youtube video illustrating the different options on the site here this is, unfortunately, the cause that most of my questions are not answered caz i believe I don't post them in a proper way: I highly appreciate if u can help me in this
Anyways regarding my data here is a part of the df3 in the shared link if this is allowed here so that you can replicate it and help me in finding my answer :
https://docs.google.com/spreadsheets/d/1wiXTEovqV4Uh5HLV2Vkz0jBJocSOAbNN/edit?usp=sharing&ouid=103665810529544354453&rtpof=true&sd=true
many thanks in advance
Sorry but I'll not open external spreadsheet and do data extraction for you. That's not how answering question online works. You must provide an example which could be visible for any users opening this question, without opening external links which could go offline for any reason. Extract a subset of 10x10, with some NaN, and post here using the code sample.
thanks for your reply here is part of data to replicate if possible: