Question

Trouble filtering Blastn output csv's by pident, qcov, and stitle using Rstudio

3

Entering edit mode

4.2 years ago

pawhitesell ▴ 30

Hi,

I am trying to filter multiple blastn result files in csv format, such that each csv file will only include 3 columns: pident, qcoverage, and stitle. I want to keep only rows with pident and qcoverage above 90, and I want to remove any duplicate stitle rows.

To start with I used this:

data <- lapply(out, "[", 3:5)

to reduce my data down to the required 3 columns:

data        

list[3]                                              list of length 3

[[1]]       list[2652 x3] (S3: data.frame)  A data.frame with 2652 rows and 3 columns

[[2]]       list[2646 x 3] (S3: data.frame)    A data.frame with 2646 rows and 3 columns

[[3]]        list[1460 x 3] (S3:data.frame) A data.frame with 1460 rows and 3 columns

The data in each file now looks like this:

gb|AE006468.2|+|1707351-1707789|ARO:3002571|AAC(6')-Iaa [Salmonella enterica subsp. enterica serovar Typhimurium str. LT2] 96.522  46
gb|AY769962|+|2434-5611|ARO:3000781|adeJ [Acinetobacter baumannii] 87.273 22
gb|CP014358.1|-|2161325-2162750|ARO:3001327|mdtK [Salmonella enterica subsp. enterica serovar Typhimurium] 98.387 100

However, my next step, to exclude all rows with pident column below 90, I am running into trouble. I have tried this:

mydata1 <- lapply[data, function (x) x[data$pident > 90]]

but it does not seem to do this. Could anyone suggest how I could accomplish this better? I would also like to remove rows with duplicate stitles, for which I am planning on using the "distinct()" function of dplyr as I have seen in another post, something like this:

distinct(dat, stitle, .keep_all = TRUE)

but if this looks foolish, please let me know.

Thanks!

(P.S. This is reposted on behalf of someone else, who might also respond.)

Blastn CSV R • 1.2k views

ADD COMMENT • link updated 4.2 years ago by zx8754 12k • written 4.2 years ago by pawhitesell ▴ 30

0

Entering edit mode

Instead of deleting and reposting, next time consider editing the original post.

Link to original deleted post

ADD REPLY • link 4.2 years ago by zx8754 12k

score 0 · Answer 1 · 2020-10-08

0

Entering edit mode

4.2 years ago

zx8754 12k

Try:

mydata1 <- lapply(data, function (x) x[x$pident > 90, ])

ADD COMMENT • link 4.2 years ago by zx8754 12k

0

Entering edit mode

mydata1 <- lapply(data, function (x) x[(x$qcovs > 90),])

The code you sent didnt work. So i modified it a little bit and it works. Thank you so much. Saved me a lot of time. I was planning on removing duplicate rows from the column after filtering. Your code gave me an idea of how to do that too.

data2 <- lapply(mydata1, function (x) x[!duplicated(x$stitle),])

That worked too. Thank you.

ADD REPLY • link 4.2 years ago by pramach1 ▴ 40

0

Entering edit mode

Yeah, there was a typo, I missed the comma after 90, fixed now.

ADD REPLY • link 4.2 years ago by zx8754 12k