Question

Read a text file into a list

0

Entering edit mode

7.9 years ago

Assa Yeroslaviz ★ 1.9k

Hi,

I have a text file i would like to read into a list structure in R.

the files is something like that (which might be describe as a list of data frames):

[[1]]
                   NAME  MEM.SHIP
FBgn0037415 FBgn0037415 0.8035441
FBgn0010812 FBgn0010812 0.6579683
FBgn0265351 FBgn0265351 0.6443309
...
[[3]]
                   NAME  MEM.SHIP
FBgn0037227 FBgn0037227 0.9997242
FBgn0040682 FBgn0040682 0.9997242
...
[[9]]
                   NAME  MEM.SHIP
FBgn0026620 FBgn0026620 0.5241095
FBgn0263619 FBgn0263619 0.5420427
FBgn0263353 FBgn0263353 0.9812295
FBgn0037424 FBgn0037424 0.9793901
FBgn0037428 FBgn0037428 0.9779420
FBgn0037430 FBgn0037430 0.9540148
FBgn0004777 FBgn0004777 0.8962534
FBgn0004778 FBgn0004778 0.9810570
...

I would like it to have a list structure like that at the end:

> str(INPUT)
List of 3
 $ : Factor w/ 223 levels "FBgn*****",..: 194 129 222 213 42 130 45 131 132 133 ...
 $ : Factor w/ 210 levels "FBgn*****",..: 185 109 110 146 171 175 111 17 112 209 ...
 $ : Factor w/ 343 levels "FBgn*****",..: 27 296 326 228 229 263 19 39 230 26

I am reading the file in with scan, but I just get a character vector of all the elements together. I was wondering if there is a way to split the text file into a list by the pattern [[.*]] and than extract only the first column from each data frame.

thanks in advance

Assa

r list • 2.6k views

ADD COMMENT • link updated 7.9 years ago by zx8754 12k • written 7.9 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Do you have any reason for not using a data.frame? That's a more straightforward data-container imho.

ADD REPLY • link 7.9 years ago by WouterDeCoster 47k

0

Entering edit mode

yes i know. I wish I could. I can't change the input files. This is how I got them. I think this is a list structure exported to a text file.

ADD REPLY • link 7.9 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Hello Assa Yeroslaviz!

We believe that this post does not fit the main topic of this site.

Not a bioinformatics question. Please ask StackOverflow

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLY • link 7.9 years ago by Ram 44k

score 1 · Answer 1 · 2017-04-14

This is the ugliest solution ever. I hope that output is what OP had in mind.
If anyone knows how to improve this code please help.

    pathInput <- "FILE"
    library(data.table)
    library(tidyverse)

   # Read in 
    d <- pathInput %>%
        read_delim(delim = "[[") %>%
        setDT()
   # Have to do this on different object
   # Dont't know why
    d2 <- as.data.table(d[, X1]) %>%
        .[!is.na(V1)] %>%
        .[, col1 := sapply(strsplit(V1, " "), "[[", 1)] %>%
        .[, col2 := sapply(strsplit(V1, " "), "[[", 2)] %>%
        .[, col3 := sapply(strsplit(V1, " "), "[[", 3)]
   # Where new table starts
    foo <- grep("NAME", d2$V1)
   # Write to list
    res <- list()
    for(i in 1:length(foo)) {
        if (i < length(foo)) {
            res[[i]] <- d2[(foo[i] + 1):(foo[i + 1] - 1), .(col1, col2, col3)]
        } else {
            res[[i]] <- d2[(foo[i] + 1):nrow(d2), .(col1, col2, col3)]
        }
    }

res returns:

[[1]]
          col1        col2      col3
1: FBgn0037415 FBgn0037415 0.8035441
2: FBgn0010812 FBgn0010812 0.6579683
3: FBgn0265351 FBgn0265351 0.6443309

[[2]]
          col1        col2      col3
1: FBgn0037227 FBgn0037227 0.9997242
2: FBgn0040682 FBgn0040682 0.9997242

[[3]]
          col1        col2      col3
1: FBgn0026620 FBgn0026620 0.5241095
2: FBgn0263619 FBgn0263619 0.5420427
3: FBgn0263353 FBgn0263353 0.9812295
4: FBgn0037424 FBgn0037424 0.9793901
5: FBgn0037428 FBgn0037428 0.9779420
6: FBgn0037430 FBgn0037430 0.9540148
7: FBgn0004777 FBgn0004777 0.8962534
8: FBgn0004778 FBgn0004778 0.9810570

From here you can create your tidy tables.

score 1 · Answer 2 · 2017-04-14

Another cleaning version:

# read as lines, every line is a character
x <- readLines("myFile.txt")

# split on "[["
x <- split(x, cumsum(grepl("[[", x, fixed = TRUE)))

# tidy up
clean <-
  lapply(x, function(i){
    # get list id
    listID <- as.numeric(gsub("\\D+", "", i[1]))
    # column names
    header <- unlist(strsplit(gsub("\\s+", " ", trimws(i[2])), " "))
    # split on " " and convert to dataframe
    res <- as.data.frame(do.call(rbind, strsplit(tail(i, -2), " ")))[, 2:3]
    # add name and list id
    colnames(res) <- header
    res$ListID <- listID
    res
  })

score 0 · Answer 3 · 2017-04-14

0

Entering edit mode

7.9 years ago

ivivek_ngs ★ 5.2k

pletny of way to do that I believe depending how you want to do it. Take a look at the similar solutions in stackoverflow.

Link1

Link2

Link3

You can come up with a solution from there. However this is not a bioinformatics question tbh. So try to use stackoverflow for such queries.

ADD COMMENT • link 7.9 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

thanks for the links. I have seen them all before I posted this question here. I have tried them, but couldn't get what I was trying to do. I know there are supposedly plenty of ways to achieve it. I just can't do it myself.

IMHO this is still bioinformatics related, as it related to R, it works on biological data processed by bioinformatic tools and it's related to my work. But I can post this question to the stack overflow site. I'll try it there. thanks

ADD REPLY • link 7.9 years ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

Hmm, it does fall on the border, but this is more CS than actual bioinformatics. Anyway, I'll reopen the question.

ADD REPLY • link 7.9 years ago by Ram 44k

score 0 · Answer 4 · 2017-04-14

0

Entering edit mode

7.9 years ago

VHahaut ★ 1.2k

Can you not solve that using the split command? Something like this:

    # Read the data:
        a <- list(read.table(sep="\t", text = "rows NAME    MEM.SHIP
        FBgn0026620 FBgn0026620 0.5241095
        FBgn0263619 FBgn0263619 0.5420427
        FBgn0263353 FBgn0263353 0.9812295
        FBgn0037424 FBgn0037424 0.9793901
        FBgn0037428 FBgn0037428 0.9779420
        FBgn0037430 FBgn0037430 0.9540148", header=T), 
        read.table(text="rows   NAME    MEM.SHIP
        FBgn0037415 FBgn0037415 0.8035441
        FBgn0037430 FBgn0037430 0.6579683
        FBgn0265351 FBgn0265351 0.6443309", sep="\t", header=T))

    # Combine the lists:
    b <- do.call(rbind, a)

    # Extract MEM.SHIP info

    sapply(split(b, b$NAME), function(x) x["MEM.SHIP"])

    # Which give you this:

        $FBgn0026620.MEM.SHIP
        [1] 0.5241095

        $FBgn0037424.MEM.SHIP
        [1] 0.9793901

        $FBgn0037428.MEM.SHIP
        [1] 0.977942

        $FBgn0037430.MEM.SHIP
        [1] 0.9540148 0.6579683

ADD COMMENT • link 7.9 years ago by VHahaut ★ 1.2k

0

Entering edit mode

thanks for the advice, but as I mentioned above, this is not what i want to achieve. I would like to keep each group as one vector in a list of vectors.

ADD REPLY • link 7.9 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Maybe I still don't understand but have you simply tried to do str() on my output.

str(sapply(split(do.call(rbind, a), do.call(rbind, a)$NAME), function(x) x["MEM.SHIP"]))

List of 8
 $ FBgn0026620.MEM.SHIP: num 0.524
 $ FBgn0037424.MEM.SHIP: num 0.979
 $ FBgn0037428.MEM.SHIP: num 0.978
 $ FBgn0037430.MEM.SHIP: num [1:2] 0.954 0.658

and if you want some factors instead of numerical values:

str(sapply(sapply(split(do.call(rbind, a), do.call(rbind, a)$NAME), function(x) x["MEM.SHIP"]), function(y) as.factor(y)))
List of 8
 $ FBgn0026620.MEM.SHIP: Factor w/ 1 level "0.5241095": 1
 $ FBgn0037424.MEM.SHIP: Factor w/ 1 level "0.9793901": 1

ADD REPLY • link 7.9 years ago by VHahaut ★ 1.2k