Importing multiple files, and extracting unique valus to create a Dictionary of terms in r
0
0
Entering edit mode
2.2 years ago

I am currently working with several hundred files that I have stored in a single folder. These files, which are in ".tsv" format, all containing the same metabolic pathway information. Each of these files has the same number of columns (a total of 11).

The main idea is that these files contain the molecules involved in each pathway, which I need to encode for their representation. This encoding would be the combination of: the name of the file (containing the name of the metabolic pathway) + a sequential number for each unique term in each pathway.

An example of how the content of the files would look like would be the following:

|filenames                                             |
|:-----------------------------------------------------|
|PA145011109-Atorvastatin_Pathway_Pharmacokinetics.tsv |
|PA145011110-Pravastatin_Pathway_Pharmacokinetics.tsv  |
|PA145011111-Fluvastatin_Pathway_Pharmacokinetics.tsv  |
|...                                                   |

That means that from the above example, in the case of "Atorvastatin" I would be left with PA145011109 + "-" + sequential numbers (e.g. 01, 02, 03....).

My intention would be to: 1) to be able to automate the process of reading all the the files into different data frames, 2) while adding to each data frame a new columns the corresponding file name, 3) to extract unique values to be able to create a dictionary of terms for the encoding.

In my mind the process would be as follows:

1) Loop to import all the files:

```
##### Import multiple files all at once 
#Read files named PA___.tsv
filenames <- list.files(path = table_pgkb,
                        pattern="PA+.*tsv")

#Create list of data frame names without the ".tsv" part 
fnames <-str_remove(filenames, pattern = "\\.tsv$")

#Create a list of the codes for every path using rebus
pattern = "PA" %R% one_or_more(DGT)
pathnames <- str_extract(fnames, pattern)

#Load all files and unify format
for(i in fnames){
  filepath <- file.path(table_pgkb,paste(i,".tsv",sep=""))
  assign(i, read.delim(filepath,
                       colClasses=c(rep("character",4), rep("NULL", 7)),
                       sep = "\t"))
}
```  

After this, every data frame I get look like this (they all have the same exact format):

From To Genes
atorvastatin lactone 2-hydroxyatorvastatin lactone CYP3A4, CYP3A5
atorvastatin lactone 4-hydroxyatorvastatin lactone CYP3A4.
atorvastatin. 2-hydroxyatorvastatin. CYP2C8.
...

2) The idea is to include the file name to all the data frames, so each one looks like this after the loop for read.delim:

filename From To Genes
PA145011109 atorvastatin lactone 2-hydroxyatorvastatin lactone CYP3A4, CYP3A5
PA145011109 atorvastatin lactone 4-hydroxyatorvastatin lactone CYP3A4.
PA145011109 atorvastatin 2-hydroxyatorvastatin CYP2C8, CYP3A4
...

3) Then I extract unique values:

```
##### Extract unique values from each data frame 
PA145011109 <- unique(c(PA145011109$From, PA145011109$To)) %>%
      sort() %>%
      as.data.frame()
```   

My final aim is to get a dictionary of terms, as follows:

coded.mol molecules.
PA145011109-01 atorvastatin lactone
PA145011109-02 2-hydroxyatorvastatin lactone
PA145011109-03 4-hydroxyatorvastatin lactone
PA145011109-04 atorvastatin
PA145011109-05 2-hydroxyatorvastatin
...

I have never done anything like this with multiple files, so I may be making the analysis much more complicated than it should be.

I have tried the alternatives proposed here (https://stackoverflow.com/questions/59875767/insert-a-column-with-file-name), here (https://stackoverflow.com/questions/54887482/r-adding-filepath-to-each-row-of-a-dataframe), and here (https://stackoverflow.com/questions/54887482/r-adding-filepath-to-each-row-of-a-dataframe) combined with (https://stackoverflow.com/questions/44302912/read-multiple-csv-data-and-create-new-columns-at-one-time/44304004#44304004), (https://stackoverflow.com/questions/58956818/read-multiple-files-create-a-data-frame-and-add-a-new-column-containing-the-nam), and (https://stackoverflow.com/questions/58956818/read-multiple-files-create-a-data-frame-and-add-a-new-column-containing-the-nam), but donĀ“t get to make it work inside the loop.

Also have tried (https://stackoverflow.com/questions/38042226/create-a-new-data-frame-that-will-act-as-a-dictionary-with-key-and-value-pairs), (https://stackoverflow.com/questions/53623416/r-transform-dataframe-column-using-dictionary-list), and (https://stackoverflow.com/questions/64111179/how-to-use-a-dictionary-for-a-large-data-frame-in-r) to create the dictionary.

I think it might have more to do with my approach to the problem. I don't know what your thoughts of the whole process are.

Any suggestion would be appreciate. Thank you all for your time.

dataframe r dictionary • 727 views
ADD COMMENT
1
Entering edit mode

If I understand correctly you want something like this.

library("tidyverse")

files <- list.files(table_pgkb, pattern="PA.*\\.tsv$", full.name=TRUE)
names(files) <- str_remove(basename(files), "\\.tsv$")

df <- files |>
  map(read_tsv) |>
  bind_rows(.id="filename") |>
  pivot_longer(c(From, To), values_to="molecules") |>
  select(filename, molecules) |>
  distinct() |>
  group_by(filename) |>
  mutate(coded.mol=str_c(filename, str_pad(seq(1, n(), 1), side="left", width=3, pad=0), sep="-")) |>
  ungroup()
ADD REPLY
0
Entering edit mode

Thanks for your promt answer!

From this im getting:

Error in dplyr::bind_rows(): ! Can't combine PMIDs <double> and PMIDs <character>.

The column "PMID" in my files refers to the PubMed reference number (eg 63536). It

I've tried to work around this specifying the arguments for

map(read_tsv) #Here I tried to col_select or col_types

even trying to state arguments as

map(read_delim)

but it does not seem to work.

ADD REPLY
1
Entering edit mode

What is probably happening is that in some of the files PMID contains non-numeric values so it's not coerced to a numeric data type. You can get around this by just converting that column to character for all the samples. To do this just add map(mutate, PMID=as.character(PMID)) after the map(read_tsv) line.

You can also just convert every column to a character too map(mutate, across(everything(), as.character)).

ADD REPLY

Login before adding your answer.

Traffic: 2402 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6