Hello, I'm having some major chaos with the output table generated and then passing that to my plot function. New to this (1st time), could somebody review this piece of code and suggest some corrections. I saved the output of a data frame as a txt file and it does not look good at all. The numbers and columns are there, but they follow some sort of strange wa, like a staircase.
# Create DataFrame from significant results
df_go = pd.DataFrame(
[(
r.GO,
r.goterm.name,
r.goterm.namespace,
r.p_uncorrected,
len(r.study_items), # Number of study genes
GO_items.count(r.GO),
', '.join([rev_mapper[y] for y in r.study_items]) # Concatenate study genes into a single string
) for r in goea_results_sig],
columns=['GO', 'term', 'p', 'n_genes', 'n_study', 'n_go', 'study_genes']
)
print(df_go)
# Save DataFrame to a text file (CSV format)
df_go.to_csv('df_go_results.txt', sep='\t', index=False)
I can provide more code, because when I tried to commit and push from IDE onto github I got the following message.
Successfully created project 'GSEAplots' on GitHub, but initial commit failed: inflate: data stream error (incorrect data check) corrupt loose object 'a87848b96903648398f723df040fd0be540a025d' unable to read a87848b96903648398f723df040fd0be540a025d
No idea why or what ? Git does not seem to interact with PyCharm and/ or Github really well ?
The plots code :
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(6,4),dpi=300)
ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.rcParams["font.size"] = 18
plt.rcParams["font.family"] = "Helvetica"
plt.scatter(df_p['Z'],-np.log10(df_p['P(sgRNAs)']),s=18,color='grey',alpha=0.3,linewidths=0.3,edgecolors='grey')
label_ls1 = list(df_up['Gene'])
df_show1 = df_p.loc[label_ls1,:]
lfc_ls1 = list(df_show1['Z']); p_ls1 = list(-np.log10(df_show1['P(sgRNAs)']))
plt.scatter(lfc_ls1,p_ls1,s=20,color='red',linewidths=0.5,edgecolors='grey')
for i, txt in enumerate(label_ls1[-4:]):
plt.annotate(txt, (lfc_ls1[-4:][i]+0.03, p_ls1[-4:][i]-0.08),fontsize=8)
label_ls2 = list(df_down['Gene'])
df_show2 = df_p.loc[label_ls2,:]
lfc_ls2 = list(df_show2['Z']); p_ls2 = list(-np.log10(df_show2['P(sgRNAs)']))
plt.scatter(lfc_ls2,p_ls2,s=20,color='blue',linewidths=0.5,edgecolors='grey')
for i, txt in enumerate(label_ls2[1:5]):
plt.annotate(txt, (lfc_ls2[1:5][i]+0.03, p_ls2[1:5][i]-0.08),fontsize=8)
plt.axhline(y=1.3, linestyle='--', color='lightgray')
plt.axvline(x=-2, linestyle='--', color='lightgray')
plt.axvline(x=2, linestyle='--', color='lightgray')
plt.xlabel('Differential score (PRMT5/DMSO)')
plt.ylabel('-log10(p-value)')
#plt.savefig('DifferentialEssential_volcanoplot.png',dpi=300,facecolor='white')
plt.show()
Data frame that I am trying to plot has the following structure :
total_samples occurrence_total occurrence_CRSsNP \
ENSG00000100592 15 15 8
ENSG00000101336 15 14 8
ENSG00000134285 15 15 8
occurrence_CTL sparsity rlog_av_CTL rlog_av_CRSsNP \
ENSG00000100592 7 0.099193 11.709476 11.413235
ENSG00000101336 6 0.322330 4.084709 6.179892
ENSG00000134285 7 0.098323 9.169283 9.595925
rlog_Regulation vst_av_CTL vst_av_CRSsNP vst_Regulation \
ENSG00000100592 CTL 11.790292 11.367062 CTL
ENSG00000101336 CRSsNP 4.909465 7.051789 CRSsNP
ENSG00000134285 CRSsNP 9.142557 9.734598 CRSsNP
raw_av_CTL raw_av_CRSsNP Raw_Regulation structural_zero \
ENSG00000100592 3973.285714 2428.500 CTL No
ENSG00000101336 10.285714 119.750 CRSsNP No
ENSG00000134285 602.857143 761.625 CRSsNP No
DESeq2Norm_av_CTL DESeq2Norm_av_CRSsNP DESeq_Regulation \
ENSG00000100592 3526.244455 2614.486202 CTL
ENSG00000101336 9.293777 131.796045 CRSsNP
ENSG00000134285 535.929110 828.762618 CRSsNP
baseMean log2FoldChange lfcSE stat \
ENSG00000100592 3039.973387 -0.431423 0.076683 -5.626030
ENSG00000101336 74.628320 3.828859 0.605862 6.319689
ENSG00000134285 692.106981 0.628415 0.105894 5.934376
pvalue_DESEq2 padj_DESEq2 pvalue padj \
ENSG00000100592 1.844041e-08 0.000120 3.694635e-06 0.038127
ENSG00000101336 2.620902e-10 0.000005 2.009572e-07 0.006221
ENSG00000134285 2.949646e-09 0.000029 1.052822e-06 0.016297
errorBarUp errorBarLow Final_Regulation external_gene_name
ENSG00000100592 -0.354739 -0.508106 CTL DAAM1
ENSG00000101336 4.434721 3.222997 CRSsNP HCK
ENSG00000134285 0.734309 0.522521 CRSsNP FKBP11
I tried to push onto Github - unknown error. Also, the data frame would not plot because of discrepancy in the structure of the data frames or the class which gets the df passed onto. ?? Any suggestions, the p-value filters can be adjusted. What I am trying to plot or get at is in bold. Much thanks for your ideas.
**Runing BP Ontology Analysis: current study set of 3 IDs.
100% 3 of 3 study items found in association
100% 3 of 3 study items found in population(20608)
Calculating 12,180 uncorrected p-values using fisher_scipy_stats
12,180 terms are associated with 17,024 of 20,608 population items
33 terms are associated with 3 of 3 study items**
METHOD fdr_bh:
0 GO terms found significant (< 0.1=alpha) ( 0 enriched + 0 purified): statsmodels fdr_bh
0 study items associated with significant GO IDs (enriched)
0 study items associated with significant GO IDs (purified)
**Runing CC Ontology Analysis: current study set of 3 IDs.
100% 3 of 3 study items found in association
100% 3 of 3 study items found in population(20608)
Calculating 1,801 uncorrected p-values using fisher_scipy_stats
1,801 terms are associated with 18,139 of 20,608 population items
18 terms are associated with 3 of 3 study items**
METHOD fdr_bh:
0 GO terms found significant (< 0.1=alpha) ( 0 enriched + 0 purified): statsmodels fdr_bh
0 study items associated with significant GO IDs (enriched)
0 study items associated with significant GO IDs (purified)
**Runing MF Ontology Analysis: current study set of 3 IDs.
100% 3 of 3 study items found in association
100% 3 of 3 study items found in population(20608)
Calculating 4,602 uncorrected p-values using fisher_scipy_stats
4,602 terms are associated with 17,598 of 20,608 population items
11 terms are associated with 3 of 3 study items**
METHOD fdr_bh:
0 GO terms found significant (< 0.1=alpha) ( 0 enriched + 0 purified): statsmodels fdr_bh
0 study items associated with significant GO IDs (enriched)
0 study items associated with significant GO IDs (purified)
What are you showing the output of there? The result of
df_go.to_csv('df_go_results.txt', sep='\t', index=False)
or the result ofprint(df_go)
. You'd need to provide more to go on just starting there. The stuff about git doesn't belong here. That is a different question for a different forum. (Indeed you can just use the browser interface with Github if you really cannot push from the command line/IDE. Or make a gist.) You just need to make a toy example that read in data to a dataframe and results in the same issue as you are seeing. We don't need all your data or actual data. Please learn about sharing fenced code block in forums such as this. One example is here. (Ideally specify the language, too so that it has syntax highlighting, see here.) Right now your example code that beginsimport seaborn as sns
is unreadable and unuseable. Here is an example of you can share what is stored by something likedf_go.to_csv('df_go_results.txt', sep='\t', index=False)
so we can use it. (That example is written in regards to a Jupyter notebook but is valid Python code, too.) Ideally, you'd already have the steps to go from the toy set of data spelled out as part of a minimal MRE that results in what you are seeing as your issue.It seems you updated this some more yesterday. The code formatting is improved. Issues remain though with your post that make it hard to offer any guidance:
You don't give any indication of where the output that starts
**Runing BP Ontology Analysis: current study set of 3 IDs
is coming from? What was the command that resulting in that?However, I cannot see how what what you share under "Data frame that I am trying to plot has the following structure :" relates to
df_go
. Is that the output ofdf_go.to_csv('df_go_results.txt', sep='\t', index=False)
orprint(df_go)
?However, I cannot see how
plt.scatter(df_p['Z'],-np.log10(df_p['P(sgRNAs)']),s=18,color='grey',alpha=0.3,linewidths=0.3,edgecolors='grey'
in your plotting code relates to what you share under "Data frame that I am trying to plot has the following structure :". Or how it relates to other parts of your post?