Add taxonomy information of fileA to species list in fileB
3
0
Entering edit mode
6.1 years ago
aenna_p • 0

Hello everyone, I have the following two .csv files (comma-separated):

A working_file:

genus,species,column3,column4,column5,column6,column7
Staphylococcus,aureus,40000,3.0,7.0,6.0,3.0
Neisseria,gonorrhoea,2300,40.0,1.0,3.0,4.4
Vibrio,cholerae,2961,0,47.7,0.0,3.1,0.8
Pseudomonas,aeruginosa,64404,0,66.6,0.0,2.8,8.0

...

A taxonomy_file

domain,phylum,class,order,family,genus,species
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,aureus
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,capitis
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,saprophyticus
Bacteria,Proteobacteria,Gammaproteobacteria,Pseudomonadales,Pseudomonadaceae,Pseudomonas,aeruginosa
Bacteria,Proteobacteria,Gammaproteobacteria,Pseudomonadales,Pseudomonadaceae,Pseudomonas,brassicacearum

...

I would like to have a script (Python, R, Perl or Bash), which loops through the working_file line-by-line. Whenever the entries in column 1 and 2 of the working_file match the content of column 6 and 7 of the taxonomy_file, I want to add the taxonomy information (domain,phylum,class,order,family) as extra columns to the working_file.

Output file

domain,phylum,class,order,family,genus,species,column3,column4,column5,column6,column7
Bacteria,Firmictues,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,aureus,40000,3.0,7.0,6.0,3.0

...

Do you have any idea how to do that? Thank you very much in advance!

metagenomics taxonomy • 2.0k views
ADD COMMENT
0
Entering edit mode

Hello and welcome to biostars mariemadlen,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

Ah, I was wondering how to do that! This looks much better. Thank you very much indeed!

ADD REPLY
0
Entering edit mode

Do you just want to print out lines where there is a match? Or what should happen if there is no match?

ADD REPLY
0
Entering edit mode

It would be helpful in the case of non-matching lines, if "NA" is printed into the otherwise empty columns. In this case, I know which species are still missing in my taxonomy file. My overall goal is to make the taxonomy file as complete as possible, so that I get an outcome for each species in the end.

ADD REPLY
0
Entering edit mode

Hello !

I am a young scientist who just started to work

No offense at all, but if you just start dealing with bioinformatics I suggest you to learn a text manipulation language as Python or Perl. Everyday you will have these kind of problematics and learning Python or Perl will save you a lot of time. For example see how to read and write in file in Python

Plus, you can take a look at the Unix commands (awk, sed...). I think your question can be solve in one line command in awk

ADD REPLY
0
Entering edit mode

Exactly, you are right. I started to learn and focus on Python and Linux/Bash a month ago, but I am still a beginner and it is difficult for me to understand how to tackle such problems. Therefore, I am very thankful to receive your suggestions until I learned enough to solve these issues on my own.

ADD REPLY
0
Entering edit mode

You got 2 lines in your working_file

Vibrio,cholerae,2961,0,47.7,0.0,3.1,0.8
Pseudomonas,aeruginosa,64404,0,66.6,0.0,2.8,8.0

Which have 8 attributes instead of 7

ADD REPLY
4
Entering edit mode
6.1 years ago

An awk solution:

$ awk -v FS="," -v OFS="," 'FNR==NR { working[$1 $2] = $3","$4","$5","$6","$7; next; } {if(working[$6 $7]) {print $0,working[$6 $7]} else {print $0,"NA","NA","NA","NA","NA"}}' working_file.txt taxonomy_file.txt
  • -v FS="," -v OFS="," defines the field seperator used in input and output file.

  • FNR==NR { working[$1 $2] = $3","$4","$5","$6","$7; next; }: as long as we read lines from the first file (working_file.txt) we create an associative array with the values of column 1 and 2 as key and the other columns as values.

  • {if(working[$6 $7]) {print $0,working[$6 $7]} else {print $0,"NA","NA","NA","NA","NA"}}': if we iterate over the second file (taxonomy_file.txt) we take the 6th and 7th column to create the key name and check if this exists in the array we created before. If so, we print out the whole line and append the values from working_file.txt. If not we fill the missed vallues by NA.

fin swimmer

ADD COMMENT
0
Entering edit mode

A good answer and well explained.

I've tried a join-approach but it'll be way more complicated, due to sorting and key-merging.

ADD REPLY
0
Entering edit mode

This was a very, very helpful answer! Due to your explanation, I was able to re-arrange the code in a way that it does exactly what I wanted initially. Thank you very, very much. This is the final code now:

$ awk -v FS="," -v OFS="," 'FNR==NR { taxo[$6 $7] = $1","$2","$3","$4","$5; next; } {if(taxo[$1 $2]) {print taxo[$1 $2],$0} else {print "NA","NA","NA","NA","NA",$0}}' taxonomy_file.txt working_file.txt > output_file.txt
ADD REPLY
2
Entering edit mode
6.1 years ago

Try csvtk join (left join).

$ csvtk join -k -f genus,species working_file taxonomy_file \
    | csvtk cut -f domain,phylum,class,order,family,genus,species,column3,column4,column5,column6,column7
ADD COMMENT
0
Entering edit mode
6.1 years ago

A Python solution with comments, using pandas dataframes

#Import pandas
import pandas as pd

#Read each file as dataframe
#sep=',' => attributes separator is ,
#decimal='.' => number separator is .
#header=0 => The first line of the file will be used as column values
df_working = pd.read_csv("working_file.txt", sep=',', decimal=".", header=0)
df_taxonomy = pd.read_csv("taxonomy_file.txt", sep=',', decimal=".", header=0)

#Merge the two dataframes with a left joining, on genus and species and apply the result to df variable
df = df_taxonomy.merge(df_working, left_on=['genus','species'], right_on=['genus','species'], how='left')

#Write dataframe into output file
#sep=',' => attributes separator is ,
#header=True => Column names will be output
#na_rep='NA' => Cells without value will be output as NA
#index=False => Remove the index of the dataframe from the output
df.to_csv("output_file.txt", sep=',', header=True, na_rep='NA', index=False)

I also removed the last attribute of those two lines

Vibrio,cholerae,2961,0,47.7,0.0,3.1,0.8
Pseudomonas,aeruginosa,64404,0,66.6,0.0,2.8,8.0
ADD COMMENT

Login before adding your answer.

Traffic: 1621 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6