Clean data with pandas
1
0
Entering edit mode
2.2 years ago
Kristina • 0

I have multiple files in a folder where a need to rename the headers, split after the first | and replace 'p.' with '' (empty).

The code looks like this:

path = "/home/kristina/snpeff_analysis/a.a/result/Ann.vcf/TEST_P.G_ann.vcf/PLAY.TEST"
all_files = glob.glob(path + "/*_G.P.vcf")

#print(all_files)

aa_df = []
for filename in all_files:
  aa_df = pd.read_csv(filename, sep='\t')
  new_header = {'Gene':'Gene', 'P':'Aminoacids'}
  aa_df.rename(columns=new_header, inplace=True)

  aa_df.to_csv(filename, index=False, sep='\t')

#%%
#split & replace
def get_element(my_list, position):
    return my_list[position]

df = aa_df
for filename in all_files:
    df.Gene.str.split('|').apply(get_element, position=0), df.Aminoacids.str.split('|').apply(get_element, position=0).str.replace('p.','').to_csv(filename, index=False, sep='\t')

Ex looking into one file

   Gene    Aminoacids
gyrA|Rv0007|ppiA|dnaN|recF|Rv0004|gyrB|Rv0008c  p.Ser95Thr|.|.|.|.|.|.|.
rpoB|rpoC|atsD|vapB8|vapC8|Rv0666   p.His445Asp|.|.|.|.|.
Rv1313c|Rv1314c|atpC|Rv1312|murA|ogt|rrs    .|.|.|.|.|.|.
tlyA|ppnK|recN|Rv1697|mctB|mpg|tyrS|lprJ|Rv1691|Rv1692|Rv1693   p.Leu11Leu|.|.|.|.|.|.|.|.

The issue that I have is that when running the last part of my script it only outputs the split in the Aminoacids column.

Aminoacids
Ser95Thr
His445Asp
.
Leu11Leu

But when changing the last command to end with .head instead of .to_csv the ouput in the interactive window it look correct.

(0       gyrA
 1       rpoB
 2    Rv1313c
 3       tlyA
 Name: Gene, dtype: object,
 <bound method NDFrame.head of 
 0     Ser95Thr
 1    His445Asp
 2            .
 3     Leu11Leu
 Name: Aminoacids, dtype: object>)

What am I doing wrong? I can add that I'm new to programing and that I have uploaded the same question to stackoverflow.

Pandas • 878 views
ADD COMMENT
2
Entering edit mode
2.2 years ago
Shred ★ 1.6k

Cross posting questions is a bad practice. You're writing to csv only the last column, because you're selecting a column (Aminoacids) against which you're doing functions and exporting.

df.Gene.str.split('|').apply(get_element, position=0), df.Aminoacids.str.split('|').apply(get_element, position=0).str.replace('p.','').to_csv(filename, index=False, sep='\t') 

I'd rather edit the dataframe in place, then I'd export into a CSV. Like

df['gene'] = df.Gene.str.split('|').apply(get_element, position=0)
df['Aminoacids'] =  df.Aminoacids.str.split('|').apply(get_element, position=0).str.replace('p.','')
df.to_csv(filename, index=False, sep='\t') 
ADD COMMENT
0
Entering edit mode

I think a non-Pandas solution may be faster.

import os
for filename in all_files:
    with open(filename, 'r') as iput:
        with open(f"{filename}_replaced.csv", 'w') as oput:
            oput.write(f"Gene\tAminoacids")
            for idx,line in enumerate(iput):
                #skip header
                if idx > 0:
                    gene = line.rstrip().split('|')[0]
                    aa = line.rstrip().split('\t')[1].split('p.')[1].split('|')[0]
                    oput.write(f"{gene}\t{aa}")
    os.remove(filename)
    os.rename(f"{filename}_replaced.csv", filename)
ADD REPLY
0
Entering edit mode

Thank you Shred! Your suggestion works fine!

ADD REPLY

Login before adding your answer.

Traffic: 1487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6