Question

Parsing .txt file in R and creating a dataframe

0

Entering edit mode

3.1 years ago

pramach1 ▴ 40

I am trying to parse .txt file. I have pasted the contents of the .txt file here. This is the consolidated Seqsero output for Salmonella enriched samples. Each file has 8 or 9 lines with information as output. I want to extract information using R and create a csv file. I have pasted the expected csv file below. Since each file has different rows(sometimes 8 or sometimes 9), my code is not working? I am using the code that I have pasted here. The code doesn't seem to do anything. Thank you for the help.

Output_directory:SeqSero_result_08_29_2019_12_13_293213245

Input files: 80_S20_L001_R1_001.fastq.gz

O antigen prediction: O-

H1 antigen prediction(fliC): -

H2 antigen prediction(fljB): -

Predicted antigenic profile: :-:-

Predicted subspecies:

Predicted serotype(s): N/A (The predicted antigenic profile does not exist in the White-Kauffmann-Le Minor scheme)

Output_directory:SeqSero_result_08_29_2019_12_13_369944611 Input files: 82_S22_L001_R1_001.fastq.gz O antigen prediction: O-9 H1 antigen prediction(fliC): - H2 antigen prediction(fljB): - Predicted antigenic profile: 9:-:- Predicted subspecies: I Predicted serotype(s): Gallinarum The serotype(s) is/are the only serotype(s) with the indicated antigenic profile currently recognized in the Kauffmann White Scheme. New serotypes can emerge and the possibility exists that this antigenic profile may emerge in a different subspecies. Identification of strains to the subspecies level should accompany serotype determination; the same antigenic profile in different subspecies is considered different serotypes.

Output_directory:SeqSero_result_08_29_2019_12_13_401740837

Input files: 83_S23_L001_R1_001.fastq.gz O antigen prediction: O-

H1 antigen prediction(fliC): -

H2 antigen prediction(fljB): -

Predicted antigenic profile: :-:-

Predicted subspecies:

Predicted serotype(s): N/A (The predicted antigenic profile does not exist in the White-Kauffmann-Le Minor scheme)

The expected return is

This is the expected dataframe

This is the code I am using.

library(dplyr) x1 <- bind_cols(lapply(split(x, cumsum(grepl("Input files|O antigen prediction|H1 antigen prediction|H2 antigen prediction|Predicted antigenic profile|Predicted subspecies|Predicted serotype(s)", x))), function(i){    i1 <- i[ i != ";" ]  nums <- unlist(strsplit(tail(i1), ";")) res <- cbind.data.frame(Grp = i1[], matrix(na.omit(as.numeric(nums)), nrow = length(i1), byrow = TRUE), stringsAsFactors = FALSE)     
res}))

But what i get is the same txt file back. No change.

R SeqSero csv • 1.3k views

ADD COMMENT • link 3.0 years ago by pramach1 ▴ 40

0

Entering edit mode

This question is unanswerable as written. There is no reference to a text file in your code, and you haven't written it in a way that we can see what you are reading in, or the structure of what you would like to parse. Also, you haven't added anything that would make this relevant to bioinformatics. Please go over you code line by line, and rephrase your question to be more understandable. See also: How to ask good questions in a scientific or technical forum.

ADD REPLY • link 3.1 years ago by seidel 11k

0

Entering edit mode

I have edited my question. Hope this is clear.

ADD REPLY • link 3.1 years ago by pramach1 ▴ 40

score 2 · Accepted Answer · 2022-10-08

library(gsubfn) L <- readLines("SeqSero2_results.txt") df <- read.delim(textConnection(L),header=FALSE,sep=":", strip.white=TRUE) df1 = unstack(df, V2 ~ V1) rbindout <- do.call("rbind", df1) myData <- rbindout[-c(1, 2, 8, 11, 12), ] myData1 <- t(myData) data <- myData1[, c(3,5,4,1,2,7,6)]

I solved by looking into stackoverflow and combining several things. I got the dataframe i wanted.