I have multiple files in a folder where a need to rename the headers, split after the first | and replace 'p.' with '' (empty).
The code looks like this:
path = "/home/kristina/snpeff_analysis/a.a/result/Ann.vcf/TEST_P.G_ann.vcf/PLAY.TEST"
all_files = glob.glob(path + "/*_G.P.vcf")
aa_df = []
for filename in all_files:
aa_df = pd.read_csv(filename, sep='\t')
new_header = {'Gene':'Gene', 'P':'Aminoacids'}
aa_df.rename(columns=new_header, inplace=True)
aa_df.to_csv(filename, index=False, sep='\t')
#split & replace
def get_element(my_list, position):
return my_list[position]
df = aa_df
for filename in all_files:
df.Gene.str.split('|').apply(get_element, position=0), df.Aminoacids.str.split('|').apply(get_element, position=0).str.replace('p.','').to_csv(filename, index=False, sep='\t')
Ex looking into one file
Gene Aminoacids
gyrA|Rv0007|ppiA|dnaN|recF|Rv0004|gyrB|Rv0008c p.Ser95Thr|.|.|.|.|.|.|.
rpoB|rpoC|atsD|vapB8|vapC8|Rv0666 p.His445Asp|.|.|.|.|.
Rv1313c|Rv1314c|atpC|Rv1312|murA|ogt|rrs .|.|.|.|.|.|.
tlyA|ppnK|recN|Rv1697|mctB|mpg|tyrS|lprJ|Rv1691|Rv1692|Rv1693 p.Leu11Leu|.|.|.|.|.|.|.|.
The issue that I have is that when running the last part of my script it only outputs the split in the Aminoacids column.
But when changing the last command to end with .head instead of .to_csv the ouput in the interactive window it look correct.
(0 gyrA
1 rpoB
2 Rv1313c
3 tlyA
Name: Gene, dtype: object,
<bound method NDFrame.head of
0 Ser95Thr
1 His445Asp
2 .
3 Leu11Leu
Name: Aminoacids, dtype: object>)
What am I doing wrong? I can add that I'm new to programing and that I have uploaded the same question to stackoverflow.
I think a non-Pandas solution may be faster.
Thank you Shred! Your suggestion works fine!