Question

Clean data with pandas

0

Entering edit mode

2.2 years ago

Kristina • 0

I have multiple files in a folder where a need to rename the headers, split after the first | and replace 'p.' with '' (empty).

The code looks like this:

path = "/home/kristina/snpeff_analysis/a.a/result/Ann.vcf/TEST_P.G_ann.vcf/PLAY.TEST"
all_files = glob.glob(path + "/*_G.P.vcf")

#print(all_files)

aa_df = []
for filename in all_files:
  aa_df = pd.read_csv(filename, sep='\t')
  new_header = {'Gene':'Gene', 'P':'Aminoacids'}
  aa_df.rename(columns=new_header, inplace=True)

  aa_df.to_csv(filename, index=False, sep='\t')

#%%
#split & replace
def get_element(my_list, position):
    return my_list[position]

df = aa_df
for filename in all_files:
    df.Gene.str.split('|').apply(get_element, position=0), df.Aminoacids.str.split('|').apply(get_element, position=0).str.replace('p.','').to_csv(filename, index=False, sep='\t')

Ex looking into one file

   Gene    Aminoacids
gyrA|Rv0007|ppiA|dnaN|recF|Rv0004|gyrB|Rv0008c  p.Ser95Thr|.|.|.|.|.|.|.
rpoB|rpoC|atsD|vapB8|vapC8|Rv0666   p.His445Asp|.|.|.|.|.
Rv1313c|Rv1314c|atpC|Rv1312|murA|ogt|rrs    .|.|.|.|.|.|.
tlyA|ppnK|recN|Rv1697|mctB|mpg|tyrS|lprJ|Rv1691|Rv1692|Rv1693   p.Leu11Leu|.|.|.|.|.|.|.|.

The issue that I have is that when running the last part of my script it only outputs the split in the Aminoacids column.

Aminoacids
Ser95Thr
His445Asp
.
Leu11Leu

But when changing the last command to end with .head instead of .to_csv the ouput in the interactive window it look correct.

(0       gyrA
 1       rpoB
 2    Rv1313c
 3       tlyA
 Name: Gene, dtype: object,
 <bound method NDFrame.head of 
 0     Ser95Thr
 1    His445Asp
 2            .
 3     Leu11Leu
 Name: Aminoacids, dtype: object>)

What am I doing wrong? I can add that I'm new to programing and that I have uploaded the same question to stackoverflow.

Pandas • 877 views

ADD COMMENT • link 2.1 years ago by Kristina • 0

score 2 · Accepted Answer · 2022-10-26

2

Entering edit mode

2.2 years ago

Shred ★ 1.6k

Cross posting questions is a bad practice. You're writing to csv only the last column, because you're selecting a column (Aminoacids) against which you're doing functions and exporting.

df.Gene.str.split('|').apply(get_element, position=0), df.Aminoacids.str.split('|').apply(get_element, position=0).str.replace('p.','').to_csv(filename, index=False, sep='\t')

I'd rather edit the dataframe in place, then I'd export into a CSV. Like

df['gene'] = df.Gene.str.split('|').apply(get_element, position=0)
df['Aminoacids'] =  df.Aminoacids.str.split('|').apply(get_element, position=0).str.replace('p.','')
df.to_csv(filename, index=False, sep='\t')

ADD COMMENT • link 2.2 years ago by Shred ★ 1.6k

0

Entering edit mode

I think a non-Pandas solution may be faster.

import os
for filename in all_files:
    with open(filename, 'r') as iput:
        with open(f"{filename}_replaced.csv", 'w') as oput:
            oput.write(f"Gene\tAminoacids")
            for idx,line in enumerate(iput):
                #skip header
                if idx > 0:
                    gene = line.rstrip().split('|')[0]
                    aa = line.rstrip().split('\t')[1].split('p.')[1].split('|')[0]
                    oput.write(f"{gene}\t{aa}")
    os.remove(filename)
    os.rename(f"{filename}_replaced.csv", filename)