Hello everyone, I have the following two .csv files (comma-separated):
A working_file:
genus,species,column3,column4,column5,column6,column7
Staphylococcus,aureus,40000,3.0,7.0,6.0,3.0
Neisseria,gonorrhoea,2300,40.0,1.0,3.0,4.4
Vibrio,cholerae,2961,0,47.7,0.0,3.1,0.8
Pseudomonas,aeruginosa,64404,0,66.6,0.0,2.8,8.0
...
A taxonomy_file
domain,phylum,class,order,family,genus,species
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,aureus
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,capitis
Bacteria,Firmicutes,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,saprophyticus
Bacteria,Proteobacteria,Gammaproteobacteria,Pseudomonadales,Pseudomonadaceae,Pseudomonas,aeruginosa
Bacteria,Proteobacteria,Gammaproteobacteria,Pseudomonadales,Pseudomonadaceae,Pseudomonas,brassicacearum
...
I would like to have a script (Python, R, Perl or Bash), which loops through the working_file line-by-line. Whenever the entries in column 1 and 2 of the working_file match the content of column 6 and 7 of the taxonomy_file, I want to add the taxonomy information (domain,phylum,class,order,family) as extra columns to the working_file.
Output file
domain,phylum,class,order,family,genus,species,column3,column4,column5,column6,column7
Bacteria,Firmictues,Bacilli,Bacillales,Staphylococcaceae,Staphylococcus,aureus,40000,3.0,7.0,6.0,3.0
...
Do you have any idea how to do that? Thank you very much in advance!
Hello and welcome to biostars mariemadlen,
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thank you!
Ah, I was wondering how to do that! This looks much better. Thank you very much indeed!
Do you just want to print out lines where there is a match? Or what should happen if there is no match?
It would be helpful in the case of non-matching lines, if "NA" is printed into the otherwise empty columns. In this case, I know which species are still missing in my taxonomy file. My overall goal is to make the taxonomy file as complete as possible, so that I get an outcome for each species in the end.
Hello !
No offense at all, but if you just start dealing with bioinformatics I suggest you to learn a text manipulation language as Python or Perl. Everyday you will have these kind of problematics and learning Python or Perl will save you a lot of time. For example see how to read and write in file in Python
Plus, you can take a look at the Unix commands (awk, sed...). I think your question can be solve in one line command in awk
Exactly, you are right. I started to learn and focus on Python and Linux/Bash a month ago, but I am still a beginner and it is difficult for me to understand how to tackle such problems. Therefore, I am very thankful to receive your suggestions until I learned enough to solve these issues on my own.
You got 2 lines in your working_file
Which have 8 attributes instead of 7