Hi,
I am trying to filter multiple blastn result files in csv format, such that each csv file will only include 3 columns: pident, qcoverage, and stitle. I want to keep only rows with pident and qcoverage above 90, and I want to remove any duplicate stitle rows.
To start with I used this:
data <- lapply(out, "[", 3:5)
to reduce my data down to the required 3 columns:
data
list[3] list of length 3
[[1]] list[2652 x3] (S3: data.frame) A data.frame with 2652 rows and 3 columns
[[2]] list[2646 x 3] (S3: data.frame) A data.frame with 2646 rows and 3 columns
[[3]] list[1460 x 3] (S3:data.frame) A data.frame with 1460 rows and 3 columns
The data in each file now looks like this:
gb|AE006468.2|+|1707351-1707789|ARO:3002571|AAC(6')-Iaa [Salmonella enterica subsp. enterica serovar Typhimurium str. LT2] 96.522 46
gb|AY769962|+|2434-5611|ARO:3000781|adeJ [Acinetobacter baumannii] 87.273 22
gb|CP014358.1|-|2161325-2162750|ARO:3001327|mdtK [Salmonella enterica subsp. enterica serovar Typhimurium] 98.387 100
However, my next step, to exclude all rows with pident column below 90, I am running into trouble. I have tried this:
mydata1 <- lapply[data, function (x) x[data$pident > 90]]
but it does not seem to do this. Could anyone suggest how I could accomplish this better? I would also like to remove rows with duplicate stitles, for which I am planning on using the "distinct()" function of dplyr as I have seen in another post, something like this:
distinct(dat, stitle, .keep_all = TRUE)
but if this looks foolish, please let me know.
Thanks!
(P.S. This is reposted on behalf of someone else, who might also respond.)
Instead of deleting and reposting, next time consider editing the original post.
Link to original deleted post