Courtesy Rpolicastro

Question

Metagenomic Data management

0

Entering edit mode

4.4 years ago

MSRS ▴ 590

I have taxonomic assignment data from the metagenomic sample. How to merge data as described below?

The data >

Name    Group-1 Name    Group-2 Name    Group-3
A   2   A   1   A   4
B   4   B   2   B   7
C   7   C   4   I   6
D   6   D   7   D   4
E   4   F   6   M   9
        H   4

The output should be >

Name    Group-1 Group-2 Group-3
A   2   1   4
B   4   2   7
C   7   4   0
D   6   7   4
E   4   0   0
F   0   6   0
H   0   4   0
I   0   0   6
M   0   0   9

Any python or other coding or any other tools available to convert such kind of data? (Group number may vary like 4/5/6...) Thanks in advance.

Metagenome Data management taxonomic assignment • 2.3k views

ADD COMMENT • link updated 4.4 years ago by zx8754 12k • written 4.4 years ago by MSRS ▴ 590

1

Entering edit mode

Interesting problem. I'm trying to use dplyr to solve this but it's turning to be pretty challenging.

ADD REPLY • link 4.4 years ago by Ram 44k

0

Entering edit mode

Wouldn't full join do the trick here? (iteratively)

ADD REPLY • link 4.4 years ago by Asaf 10k

0

Entering edit mode

Yes, but you'd need to split the df into tables with 2 cols each and do an iterative full join with a manual stop. Plus, it is not a generalized solution that would work across any number of 2n cols so I'm curious if any sort of pivot can do it.

ADD REPLY • link 4.4 years ago by Ram 44k

score 5 · Accepted Answer · 2020-08-24

input:

> df
  Name Group.1 Name.1 Group.2 Name.2 Group.3
1    A       2      A       1      A       4
2    B       4      B       2      B       7
3    C       7      C       4      I       6
4    D       6      D       7      D       4
5    E       4      F       6      M       9
6           NA      H       4             NA

output:

> df %>%
+     rename_at(vars(starts_with("Name")), ~paste("Name", seq(1:(ncol(df)/2)),sep=".")) %>%
+     pivot_longer(everything(),names_to = c(".value", "set"),names_pattern = "(\\D+)(\\d)") %>%
+     drop_na()%>%
+     pivot_wider(names_from="set", values_from="Group.", names_prefix = "Group.", values_fill = 0) %>%
+     arrange(Name.)

# A tibble: 9 x 4
  Name. Group.1 Group.2 Group.3
  <chr>   <int>   <int>   <int>
1 A           2       1       4
2 B           4       2       7
3 C           7       4       0
4 D           6       7       4
5 E           4       0       0
6 F           0       6       0
7 H           0       4       0
8 I           0       0       6
9 M           0       0       9

score 4 · Accepted Answer · 2020-08-23

example data

df <- structure(list(Name = c("A", "B", "C", NA), Group_1 = c(40L, 
3L, 28L, NA), Name.1 = c("A", "B", "H", "I"), Group_2 = c(38L, 
36L, 20L, 16L), Name.2 = c("A", "D", "G", NA), Group_3 = c(15L, 
47L, 33L, NA)), class = "data.frame", row.names = c(NA, -4L))

> df
  Name Group_1 Name.1 Group_2 Name.2 Group_3
1    A      40      A      38      A      15
2    B       3      B      36      D      47
3    C      28      H      20      G      33
4 <NA>      NA      I      16   <NA>      NA

Tidyverse solution. Someone may come up with a more elegant solution, but this works in the meantime.

library("tidyverse")

df <- seq(2, ncol(df), 2) %>%
  map(~select(df, .x-1, .x) %>% rename(Name=1)) %>%
  reduce(full_join, by="Name") %>%
  mutate(across(starts_with("Group"), ~replace_na(.x, 0))) %>%
  drop_na

And the results.

> df
# A tibble: 7 x 4
  Name  Group_1 Group_2 Group_3
  <chr>   <dbl>   <dbl>   <dbl>
1 A          40      38      15
2 B           3      36       0
3 C          28       0       0
4 H           0      20       0
5 I           0      16       0
6 D           0       0      47
7 G           0       0      33

score 4 · Accepted Answer · 2020-08-25

4

Entering edit mode

4.4 years ago

h.mon 35k

edit: name of the script changed from weird_to_long.pl to weird_to_wide.pl.

I hacked a small Perl script to convert the data. Save the script as weird_to_wide.pl and run as (you may need to chmod +x weird_to_wide.pl):

./weird_to_wide.pl < input.tsv > output.tsv

In the process of writing the script, I got me wondering what program produces such weird tabular format? Or did you somehow joined the output of several programs? Do all lines have the same number of elements, i.e., the last line in your first example (H 4) has trailing tabs indicating it has additional empty columns?

Your original data seems awkward and error prone, I think it deserves some thinking in how to avoid it in the first place.

The script:

#!/usr/bin/env perl

use warnings;
use strict;

my @header;
my %hash;

while(<>) {
    s/\R//;
    next if ( /^\s$/ ); 
    unless( @header ) { 
        @header = split( /\t/, $_ );
        next;
    }
    my (@line) = split( /\t/, $_, -1 );
    for( my $i=1; $i<scalar( @line ); $i+=2 ) {
        my $name = $line[$i-1];
        next if !( length $name );
        @{$hash{$name}}[(($i+1)/2)-1] = $line[$i];
    }
}

print "$header[0]\t";
print join( "\t", map{$header[$_]} grep{$_ % 2 == 1} 1..$#header ), "\n";

for my $key (keys %hash){
    $#{$hash{$key}} = ( scalar( @header ) / 2 ) - 1;
    $_ //= 0 for @{$hash{$key}};
    print "$key\t";
    print join( "\t", @{$hash{$key}} ), "\n";
}

ADD COMMENT • link 4.4 years ago by h.mon 35k

0

Entering edit mode

Thank you H.mon.

did you somehow joined the output of several programs? Do all lines have the same number of elements, i.e., the last line in your first example (H 4) has trailing tabs indicating it has additional empty columns?

Yes, the data from running taxonomic assignment single file seperately and finally join it. As different sample, the taxonomic profilling will also be different thats why the different column has diferent length.

Again thank you for your excellent perl code.

ADD REPLY • link 4.4 years ago by MSRS ▴ 590

0

Entering edit mode

If the join was done by a program, it is badly programmed or being used badly. What is this software you're using?

ADD REPLY • link 4.4 years ago by Ram 44k

0

Entering edit mode

I used PathoScope2 for taxonomic assignment. As unavailability of HPC, I need to run the programm to each seperate sample. Then it is easy to sort out such like data using Microsoft Excel. Thank you

ADD REPLY • link 4.4 years ago by MSRS ▴ 590

4

Entering edit mode

No wonder. Read each file using R and then Reduce using merge. You're creating the problem, so the best way to solve it would be to not create it.

ADD REPLY • link 4.4 years ago by Ram 44k

0

Entering edit mode

Thank you. Perl and R have already solved my issue.

ADD REPLY • link 4.4 years ago by MSRS ▴ 590

1

Entering edit mode

Yes, they did this one time for this particular manner in which the data got messed up. No language can clean up manually created messes. It's good practice to maintain data in a sensible manner from start to finish.

ADD REPLY • link 4.4 years ago by Ram 44k

0

Entering edit mode

Thank you. This is not a manually created mass! No program so far available to run multiple metagenomic data file where data needs to align at least 80GB Refseq. In that case, single file run is the best option as having 64GB RAM and 16 CPU. You will be most welcome if you give me the solution of running the heavy metagenomic data (16 samples) run at a time with having accessibility of such kind of PC.

ADD REPLY • link 4.4 years ago by MSRS ▴ 590

1

Entering edit mode

Maybe I'm misunderstanding your premise. If PathoScoope2 is giving you this mess, where it is putting together two-column files side by side, then you need a robust solution downstream. However, if you're using Excel to paste the PathoScope2 outputs side by side, you can avoid that and feed raw output to an R/perl/python script.

Do you get this 6-column output from a program or does the program give you a bunch of 2 column outputs?

ADD REPLY • link 4.4 years ago by Ram 44k

1

Entering edit mode

For each sample, PathpScope gives the bunch of results, like first taxonomic ID, best hit, percentages... (9/10 columns for each sample).

From there I have to choose taxonomic ID and best hid column only. Here, as the NCBI Refseq database are huge (around 80GB) there are duplicates in the taxonomic ID. In that case, I need to consolidate and remove duplicate. After that I get above kind of sample where I want to set my taxonomic ID in a single column and best hit for each sample.

If not clear, please give me your email, I will send you raw results. Then if you can give the better solution, you are most welcome. Thank you

ADD REPLY • link 4.4 years ago by MSRS ▴ 590

0

Entering edit mode

I'm sorry, I cannot invest so much time in that (even though I'd really like to). If you can de-couple side-by-side datasets produced in Excel from the above pipeline, your life would get easier. In any case, this looks like it can be solved only by someone working closely with the pipeline, and I cannot invest any time right now. Keep looking though.

ADD REPLY • link 4.4 years ago by Ram 44k

1

Entering edit mode

Thank you very much RamRS. I can understand your recommendations. Output from PathoScope will be the input and the desired output will be the above mentioned output. Here, I will share my raw results and will look through.

ADD REPLY • link 4.4 years ago by MSRS ▴ 590

score 3 · Accepted Answer · 2020-08-26

Using base R:

# reproducible example data
df <- read.table(text = "
Name    Group-1 Name    Group-2 Name    Group-3
A   2   A   1   A   4
B   4   B   2   B   7
C   7   C   4   I   6
D   6   D   7   D   4
E   4   F   6   M   9
        H   4       
", stringsAsFactors = FALSE, check.names = FALSE, header = TRUE, sep = "\t", na.strings = "")

# split on every 2 columns
x <- split.default(df, rep(1:3, each = 2))

# remove rows where name is blank
x <- lapply(x, function(i) i[ !is.na(i[, "Name"]), ])

# merge list of dataframes
res <- Reduce(function(...) merge(..., all = TRUE), x)

# convert NAs to zero (I would skip this step)
res[ is.na(res) ] <- 0

res
#   Name Group-1 Group-2 Group-3
# 1    A       2       1       4
# 2    B       4       2       7
# 3    C       7       4       0
# 4    D       6       7       4
# 5    E       4       0       0
# 6    F       0       6       0
# 7    H       0       4       0
# 8    I       0       0       6
# 9    M       0       0       9

Note: For readability I put every command separately, all could be bundled up in one row as below:

res <- Reduce(function(...) merge(..., all = TRUE), 
              lapply(split.default(df, rep(1:3, each = 2)),
                     function(i) i[ !is.na(i[, "Name"]), ]))

score 1 · Accepted Answer · 2020-08-23

Courtesy Rpolicastro

Woking from csv file as below,

Name.1  Group_1 Name.2  Group_2 Name.3  Group_3
1222    2   1222    1   1222    4
1333    4   1333    2   1333    7
1334    7   1334    4   1334    6
1335    6   1335    7   1335    4
1336    4   1337    6   13699   9
1366    4

We can follow like this in RStudio

setwd("C:/Users/SHAKIL/Desktop/metagenome_data_R") #working_directory_RStudio
install.packages("writexl")
library(writexl)
library("tidyverse")
df = read.csv("bar_plot.csv", header = TRUE)

df <- seq(2, ncol(df), 2) %>%
  map(~select(df, .x-1, .x) %>% rename(Name=1)) %>%
  reduce(full_join, by="Name") %>%
  mutate(across(starts_with("Group"), ~replace_na(.x, 0))) %>%
  drop_na

write_xlsx(x = df, path = "df.xlsx", col_names = TRUE)

Output as .xlsx format

Name    Group_1 Group_2 Group_3
1222    2   1   4
1333    4   2   7
1334    7   4   6
1335    6   7   4
1336    4   0   0
1366    4   0   0
1337    0   6   0
13699   0   0   9