Question

Add taxonomy information of fileA to species list in fileB

0

Entering edit mode

6.5 years ago

aenna_p • 0

Hello everyone, I have the following two .csv files (comma-separated):

A working_file:

genus,species,column3,column4,column5,column6,column7
Staphylococcus,aureus,40000,3.0,7.0,6.0,3.0
Neisseria,gonorrhoea,2300,40.0,1.0,3.0,4.4
Vibrio,cholerae,2961,0,47.7,0.0,3.1,0.8
Pseudomonas,aeruginosa,64404,0,66.6,0.0,2.8,8.0

...

A taxonomy_file

domain,phylum,class,order,family,genus,species
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,aureus
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,capitis
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,saprophyticus
Bacteria,Proteobacteria,Gammaproteobacteria,Pseudomonadales,Pseudomonadaceae,Pseudomonas,aeruginosa
Bacteria,Proteobacteria,Gammaproteobacteria,Pseudomonadales,Pseudomonadaceae,Pseudomonas,brassicacearum

...

I would like to have a script (Python, R, Perl or Bash), which loops through the working_file line-by-line. Whenever the entries in column 1 and 2 of the working_file match the content of column 6 and 7 of the taxonomy_file, I want to add the taxonomy information (domain,phylum,class,order,family) as extra columns to the working_file.

Output file

domain,phylum,class,order,family,genus,species,column3,column4,column5,column6,column7
Bacteria,Firmictues,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,aureus,40000,3.0,7.0,6.0,3.0

...

Do you have any idea how to do that? Thank you very much in advance!

metagenomics taxonomy • 2.4k views

ADD COMMENT • link 3.9 years ago by aenna_p • 0

0

Entering edit mode

Hello and welcome to biostars mariemadlen,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

Ah, I was wondering how to do that! This looks much better. Thank you very much indeed!

ADD REPLY • link 6.5 years ago by aenna_p • 0

0

Entering edit mode

Do you just want to print out lines where there is a match? Or what should happen if there is no match?

ADD REPLY • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

It would be helpful in the case of non-matching lines, if "NA" is printed into the otherwise empty columns. In this case, I know which species are still missing in my taxonomy file. My overall goal is to make the taxonomy file as complete as possible, so that I get an outcome for each species in the end.

ADD REPLY • link 6.5 years ago by aenna_p • 0

0

Entering edit mode

Hello !

I am a young scientist who just started to work

No offense at all, but if you just start dealing with bioinformatics I suggest you to learn a text manipulation language as Python or Perl. Everyday you will have these kind of problematics and learning Python or Perl will save you a lot of time. For example see how to read and write in file in Python

Plus, you can take a look at the Unix commands (awk, sed...). I think your question can be solve in one line command in awk

ADD REPLY • link 6.5 years ago by Bastien Hervé 6.3k

0

Entering edit mode

Exactly, you are right. I started to learn and focus on Python and Linux/Bash a month ago, but I am still a beginner and it is difficult for me to understand how to tackle such problems. Therefore, I am very thankful to receive your suggestions until I learned enough to solve these issues on my own.

ADD REPLY • link 6.5 years ago by aenna_p • 0

0

Entering edit mode

You got 2 lines in your working_file

Vibrio,cholerae,2961,0,47.7,0.0,3.1,0.8
Pseudomonas,aeruginosa,64404,0,66.6,0.0,2.8,8.0

Which have 8 attributes instead of 7

ADD REPLY • link 6.5 years ago by Bastien Hervé 6.3k

2

Entering edit mode

6.5 years ago

shenwei356 8.7k

Try csvtk join (left join).

$ csvtk join -k -f genus,species working_file taxonomy_file \
    | csvtk cut -f domain,phylum,class,order,family,genus,species,column3,column4,column5,column6,column7

ADD COMMENT • link 6.5 years ago by shenwei356 8.7k

0

Entering edit mode

6.5 years ago

Bastien Hervé 6.3k

A Python solution with comments, using pandas dataframes

#Import pandas
import pandas as pd

#Read each file as dataframe
#sep=',' => attributes separator is ,
#decimal='.' => number separator is .
#header=0 => The first line of the file will be used as column values
df_working = pd.read_csv("working_file.txt", sep=',', decimal=".", header=0)
df_taxonomy = pd.read_csv("taxonomy_file.txt", sep=',', decimal=".", header=0)

#Merge the two dataframes with a left joining, on genus and species and apply the result to df variable
df = df_taxonomy.merge(df_working, left_on=['genus','species'], right_on=['genus','species'], how='left')

#Write dataframe into output file
#sep=',' => attributes separator is ,
#header=True => Column names will be output
#na_rep='NA' => Cells without value will be output as NA
#index=False => Remove the index of the dataframe from the output
df.to_csv("output_file.txt", sep=',', header=True, na_rep='NA', index=False)

I also removed the last attribute of those two lines

Vibrio,cholerae,2961,0,47.7,0.0,3.1,0.8
Pseudomonas,aeruginosa,64404,0,66.6,0.0,2.8,8.0

ADD COMMENT • link 6.5 years ago by Bastien Hervé 6.3k

score 4 · Accepted Answer · 2018-12-04

4

Entering edit mode

6.5 years ago

finswimmer 16k

An awk solution:

$ awk -v FS="," -v OFS="," 'FNR==NR { working[$1 $2] = $3","$4","$5","$6","$7; next; } {if(working[$6 $7]) {print $0,working[$6 $7]} else {print $0,"NA","NA","NA","NA","NA"}}' working_file.txt taxonomy_file.txt

-v FS="," -v OFS="," defines the field seperator used in input and output file.
FNR==NR { working[$1 $2] = $3","$4","$5","$6","$7; next; }: as long as we read lines from the first file (working_file.txt) we create an associative array with the values of column 1 and 2 as key and the other columns as values.
{if(working[$6 $7]) {print $0,working[$6 $7]} else {print $0,"NA","NA","NA","NA","NA"}}': if we iterate over the second file (taxonomy_file.txt) we take the 6th and 7th column to create the key name and check if this exists in the array we created before. If so, we print out the whole line and append the values from working_file.txt. If not we fill the missed vallues by NA.

fin swimmer

ADD COMMENT • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

A good answer and well explained.

I've tried a join-approach but it'll be way more complicated, due to sorting and key-merging.

ADD REPLY • link 6.5 years ago by michael.ante ★ 4.0k

0

Entering edit mode

This was a very, very helpful answer! Due to your explanation, I was able to re-arrange the code in a way that it does exactly what I wanted initially. Thank you very, very much. This is the final code now:

$ awk -v FS="," -v OFS="," 'FNR==NR { taxo[$6 $7] = $1","$2","$3","$4","$5; next; } {if(taxo[$1 $2]) {print taxo[$1 $2],$0} else {print "NA","NA","NA","NA","NA",$0}}' taxonomy_file.txt working_file.txt > output_file.txt

ADD REPLY • link 6.5 years ago by aenna_p • 0