Separating a column in list of files using purr in R
1
0
Entering edit mode
2.8 years ago
pramach1 ▴ 40

I have 108 files of BLAST output. I am treating them as list files and filtering them out based on %identity and Qcoverage.

fnames <- list.files()

data4 = lapply(files, function(x) {  res <- read.table(x, header=TRUE, sep="\t", quote = "", fill = FALSE) res$sample <- x   res                           })

colnames <- c("qseqid", "sseqid", "stitle", "pident", "qcovs", "Sample")

out <- lapply(data, setNames, colnames)
 data <- lapply(out, "[", 3:6)

data1 <- lapply(data, function (x) x[(x$qcovs > 90),])
data2 <- lapply(data1, function (x) x[(x$pident > 90),]) 

After this, I want to split the stitle column based on the paranthesis and this "|". How do I do that in the list of files.

Here is the example of the stitle column.

gb|AM260957.1|+|4186-5086|ARO:3003071|mphF [uncultured bacterium] 
gb|NC_008618.1|-|1667063-1670624|ARO:3004480|Bifidobacterium adolescentis rpoB conferring resistance to rifampicin [Bifidobacterium adolescentis] 
gb|AP006618.1|+|4835199-4838688|ARO:3000501|Nocardia rifampin resistant beta-subunit of RNA polymerase (rpoB2) [Nocardia farcinica IFM 10152] 
gb|AY043299.1|-|3984-5175|ARO:3000167|tet(C) [Aeromonas salmonicida] 
gb|AB571865.1|-|144312-145536|ARO:3003745|mefC [Photobacterium damselae subsp. damselae] 
gb|AE004091.2|+|2810008-2813197|ARO:3000804|MexF [Pseudomonas aeruginosa PAO1] 
gb|AB219524.1|+|1176-4338|ARO:3003699|mexQ [Pseudomonas aeruginosa] 

I want the column split based on "|" and tab. Thank you for the help.

purr output BLAST R • 1.2k views
ADD COMMENT
0
Entering edit mode
2.8 years ago

Example data.

df <- structure(list(V1 = c("gb|AM260957.1|+|4186-5086|ARO:3003071|mphF [uncultured bacterium] ", 
"gb|NC_008618.1|-|1667063-1670624|ARO:3004480|Bifidobacterium adolescentis rpoB conferring resistance to rifampicin [Bifidobacterium adolescentis] ", 
"gb|AP006618.1|+|4835199-4838688|ARO:3000501|Nocardia rifampin resistant beta-subunit of RNA polymerase (rpoB2) [Nocardia farcinica IFM 10152] ", 
"gb|AY043299.1|-|3984-5175|ARO:3000167|tet(C) [Aeromonas salmonicida] ", 
"gb|AB571865.1|-|144312-145536|ARO:3003745|mefC [Photobacterium damselae subsp. damselae] ", 
"gb|AE004091.2|+|2810008-2813197|ARO:3000804|MexF [Pseudomonas aeruginosa PAO1] ", 
"gb|AB219524.1|+|1176-4338|ARO:3003699|mexQ [Pseudomonas aeruginosa] "
)), class = "data.frame", row.names = c(NA, -7L))

Tidyverse answer. Since you have a list of files just convert this to functional form in lapply or purrr::map.

library("tidyr")

separate(df, 1, into=LETTERS[1:7], sep="\\s(?=\\[)|\\|")

  A     B           C     D               E           F               G         
  <chr> <chr>       <chr> <chr>           <chr>       <chr>           <chr>     
1 gb    AM260957.1  +     4186-5086       ARO:3003071 mphF            "[uncultu…
2 gb    NC_008618.1 -     1667063-1670624 ARO:3004480 Bifidobacteriu… "[Bifidob…
3 gb    AP006618.1  +     4835199-4838688 ARO:3000501 Nocardia rifam… "[Nocardi…
4 gb    AY043299.1  -     3984-5175       ARO:3000167 tet(C)          "[Aeromon…
5 gb    AB571865.1  -     144312-145536   ARO:3003745 mefC            "[Photoba…
6 gb    AE004091.2  +     2810008-2813197 ARO:3000804 MexF            "[Pseudom…
7 gb    AB219524.1  +     1176-4338       ARO:3003699 mexQ            "[Pseudom…
ADD COMMENT
0
Entering edit mode

I have different number of rows but the same number of columns in 108 files. The number if rows range from 4000 to 12000 rows. If I have to use the above code, that means I have the same number of rows and exact same information on all the 108 files. I don't have that. so..how would I separate/split the column1 (stitle) on all 108 files? Thank you. I apologize for not being clear previously.

ADD REPLY
0
Entering edit mode

I think your confusion might be coming from into=LETTERS[1:7] since there also happens to be 7 rows. You're splitting the stitle column into 7 separate columns, so that argument was just telling the function to name the 7 new columns A-G. This function works for any number of rows.

ADD REPLY
0
Entering edit mode

I think I am doing something wrong. The first I did was

df <- structure(list(V1 = c("gb|AM260957.1|+|4186-5086|ARO:3003071|mphF [uncultured bacterium] ", 
                        "gb|NC_008618.1|-|1667063-1670624|ARO:3004480|Bifidobacterium adolescentis rpoB conferring resistance to rifampicin [Bifidobacterium adolescentis] ", 
                        "gb|AP006618.1|+|4835199-4838688|ARO:3000501|Nocardia rifampin resistant beta-subunit of RNA polymerase (rpoB2) [Nocardia farcinica IFM 10152] ", 
                        "gb|AY043299.1|-|3984-5175|ARO:3000167|tet(C) [Aeromonas salmonicida] ", 
                        "gb|AB571865.1|-|144312-145536|ARO:3003745|mefC [Photobacterium damselae subsp. damselae] ", 
                        "gb|AE004091.2|+|2810008-2813197|ARO:3000804|MexF [Pseudomonas aeruginosa PAO1] ", 
                        "gb|AB219524.1|+|1176-4338|ARO:3003699|mexQ [Pseudomonas aeruginosa])), class = "data.frame", row.names = c(NA, -7L))

purrr::map

data3 <- separate(df, 1, into=LETTERS[1:7], sep="\\s(?=\\[)|\\|")

I ended up with a single data frame of this split into 7 columns.It is not separating the stitle column in the list of all108 files. But creating a single data frame only with this column split into 7 columns.

The list of files is shown here

ADD REPLY
0
Entering edit mode

If you want to get into data analysis in R I would suggest reading R for Data Science by Hadley Whickam. It's going to be difficult to write R code without investing the time into learning it.

With that being said the code I provided was an example, and was not meant to be copy and pasted directly into your code. In your code it should look something like this.

data3 <- lapply(data2, \(x) separate(x, stitle, into=LETTERS[1:7], sep="\\s(?=\\[)|\\|"))
ADD REPLY

Login before adding your answer.

Traffic: 1733 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6