I have multiple files in a folder where a need to rename the headers, split after the first | and replace 'p.' with '' (empty).
The code looks like this:
path = "/home/kristina/snpeff_analysis/a.a/result/Ann.vcf/TEST_P.G_ann.vcf/PLAY.TEST"
all_files = glob.glob(path + "/*_G.P.vcf")
#print(all_files)
aa_df = []
for filename in all_files:
aa_df = pd.read_csv(filename, sep='\t')
new_header = {'Gene':'Gene', 'P':'Aminoacids'}
aa_df.rename(columns=new_header, inplace=True)
aa_df.to_csv(filename, index=False, sep='\t')
#%%
#split & replace
def get_element(my_list, position):
return my_list[position]
df = aa_df
for filename in all_files:
df.Gene.str.split('|').apply(get_element, position=0), df.Aminoacids.str.split('|').apply(get_element, position=0).str.replace('p.','').to_csv(filename, index=False, sep='\t')
Ex looking into one file
Gene Aminoacids
gyrA|Rv0007|ppiA|dnaN|recF|Rv0004|gyrB|Rv0008c p.Ser95Thr|.|.|.|.|.|.|.
rpoB|rpoC|atsD|vapB8|vapC8|Rv0666 p.His445Asp|.|.|.|.|.
Rv1313c|Rv1314c|atpC|Rv1312|murA|ogt|rrs .|.|.|.|.|.|.
tlyA|ppnK|recN|Rv1697|mctB|mpg|tyrS|lprJ|Rv1691|Rv1692|Rv1693 p.Leu11Leu|.|.|.|.|.|.|.|.
The issue that I have is that when running the last part of my script it only outputs the split in the Aminoacids column.
Aminoacids
Ser95Thr
His445Asp
.
Leu11Leu
But when changing the last command to end with .head instead of .to_csv the ouput in the interactive window it look correct.
(0 gyrA
1 rpoB
2 Rv1313c
3 tlyA
Name: Gene, dtype: object,
<bound method NDFrame.head of
0 Ser95Thr
1 His445Asp
2 .
3 Leu11Leu
Name: Aminoacids, dtype: object>)
What am I doing wrong? I can add that I'm new to programing and that I have uploaded the same question to stackoverflow.
I think a non-Pandas solution may be faster.
Thank you Shred! Your suggestion works fine!