Question

Excluding columns in a dataframe based on a character string of column names to exclude

0

Entering edit mode

9.3 years ago

confusedious ▴ 490

Hello everyone,

I apologise in advance if the terminology used in the title is misleading; I am not totally familiar with all object type terms, but I believe what I have posted is at least mostly correct.

I have a script for extracting sequences from a phyDat object (see packages 'ape' and 'phangorn') in R that is based on using subset and and a character string of the column names I wish to retain. See code below:

newalign <- as.phyDat(subset(aligndf, select = seqkeep))

In this case, 'aligndf' is the complete original alignment that has been transformed into a data frame in an earlier part of the script. Here I use 'subset' and 'select' to generate a new alignment object via 'as.phyDat' that consists only of the sequence names contained in the object 'seqkeep'. As an example, the contents of 'seqkeep' looks like the following:

[1] "hominin23"                                                           
[2] "hominin33"                                                                      
[3] "hominin47"

This procedure works well, and from this I gain exactly what I wanted, which is a new alignment that consists only of the sequences given in 'seqkeep'.

When I try to then write a second alignment that consists only of the sequences not in 'seqkeep', I have encountered a problem. No matter what I have tried, the resulting alignment is the complete original alignment that still includes the 'seqkeep' sequences.

Here are my most recent attempts based on some guides I have seen online:

remainalign <- as.phyDat(subset(aligndf, aligndf =! seqkeep))

remainalign <- as.phyDat(subset(aligndf, !(aligndf == seqkeep)))

Could anyone advise me on how to correctly render this task in R?

Thank you for your help.

R data frame • 4.4k views

ADD COMMENT • link updated 9.3 years ago by Michael 55k • written 9.3 years ago by confusedious ▴ 490

1

Entering edit mode

?subset

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

ADD REPLY • link 9.3 years ago by Michael 55k

Ram · Accepted Answer · 2016-01-09

2

Entering edit mode

9.3 years ago

Michael 55k

Don't use the subset function, the normal subsetting is much more readable. I am having problems determining the structure of your data please post head(seqkeep) and aligndf, so we can see the column names. You possibly want something like:

aligndf[ ,!(colnames(aligndf) %in% seqkeep)]

Edit: changed to column selection, it is unlcear what your goal is here.

or even simpler

aligndf[,seqkeep] # if rownames are compatible with 
# seqkeep and all seqkeep are in rownames

extracting in R is normally very straight forward to code and read

see ?match ?extract ?Comparison

In my R build the following is less readable but slightly faster than %in%:

aligndf[,!match(aligndf, seqkeep, nomatch=0)] # if you need to do that often

you can further speed this up using package fastmatch

Also, we are moving far away from bioinformatics here.

subset(aligndf, aligndf =! seqkeep) # what's wrong?

Also subset works on rows, not columns by default.

You are trying to extract the column aligndf from aligndf and trying to comparing it to a smaller vector using the non-exiting operator =!. You meant !=, but comparison is not the same as set operation, and == or != are not the right operators. It is just coincidence that it didn't throw an error in the first place.

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by Michael 55k

1

Entering edit mode

Also, I apologise if the content drifted a little too far into basic scripting as opposed to bioinformatics proper.

I have always had a much more positive experience getting answers here than on StackOverflow; people here are more understanding about the fact that it can take time for someone from a biology background to fully grasp handling computational and scripting issues effectively.

ADD REPLY • link 9.3 years ago by confusedious ▴ 490

0

Entering edit mode

Thank you Michael.

The below solved my problem nicely:

aligndf[ ,!(colnames(aligndf) %in% seqkeep)]

The object 'seqkeep' was a list of column names. I have now successfully altered the script so that it writes out two new .fasta alignments. The first with only the sequences in 'seqkeep' and the second with only those not in 'seqkeep'.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.3 years ago by confusedious ▴ 490