How to remove duplicate and reverse duplicate hits from BLASTn results?
2
0
Entering edit mode
21 months ago

Hello, I have 10 sample files with multiple fasta sequences in each of them. I performed an "all_vs_all" blastn for all the samples for a specific analysis. So, I have obtained output files like sample_1_vs_sample_2.tsv, sample_2_vs_sample_1.tsv and so on. Now I want to filter out all the duplicate and reverse duplicate blast hits and keep only the unique hits for each sample file against all others. For example, consider the following scenario:

query   subject pident
108.tig00000003_nlr_1:522-1277  108.tig00000003_nlr_3:513-1256  92.063
108.tig00000003_nlr_1:522-1277  108.tig00000005_nlr_1:524-1267  88.243
108.tig00000003_nlr_1:522-1277  108.tig00000005_nlr_2:524-1267  85.789
108.tig00000003_nlr_2:495-1265  108.tig00000003_nlr_4:3-374 98.387
108.tig00000003_nlr_3:513-1256  108.tig00000005_nlr_1:524-1267  94.631
108.tig00000003_nlr_3:513-1256  108.tig00000003_nlr_1:522-1277  92.063
108.tig00000003_nlr_3:513-1256  108.tig00000005_nlr_2:524-1267  88.636
108.tig00000003_nlr_4:3-374 108.tig00000003_nlr_2:495-1265  98.387
108.tig00000005_nlr_1:524-1267  108.tig00000003_nlr_3:513-1256  94.631
108.tig00000005_nlr_1:524-1267  108.tig00000005_nlr_2:524-1267  88.503
108.tig00000005_nlr_1:524-1267  108.tig00000003_nlr_1:522-1277  88.243
108.tig00000005_nlr_2:524-1267  108.tig00000003_nlr_3:513-1256  88.667
108.tig00000005_nlr_2:524-1267  108.tig00000005_nlr_1:524-1267  88.533
108.tig00000005_nlr_2:524-1267  108.tig00000003_nlr_1:522-1277  85.827
108.tig00000008_nlr_1:1019-1360 108.tig00000110_nlr_1:1005-1346 88.824
108.tig00000009_nlr_5:618-1259  108.tig00000013_nlr_1:739-1290  81.703
108.tig00000010_nlr_3:1981-2334 108.tig00000010_nlr_4:1026-1376 85.507

You can see that 108.tig00000003_nlr_1:522-1277 108.tig00000003_nlr_3:513-1256 92.063 and 108.tig00000003_nlr_3:513-1256 108.tig00000003_nlr_1:522-1277 92.063 in the first and sixth line as an example of reverse duplicate. I want to keep only one hit and remove all the other occasions where it is in the subject column indicating reverse duplicate hits. How can I do that in R? Thank you.

bash-scripting gene R BLAST • 1.8k views
ADD COMMENT
2
Entering edit mode
21 months ago

You can use R to filter out the duplicate and reverse duplicate hits from your blast output files. Here's a possible solution using the dplyr package:

## 1. Load the dplyr package:
library(dplyr)

## 2. Read in your blast output file using read.table:
blast <- read.table("sample_1_vs_sample_2.tsv", header = TRUE)

## 3. Create a new column that concatenates the query and subject columns, separated by a comma:
blast <- blast %>% 
  mutate(qs = ifelse(query < subject, paste(query, subject, sep = ","), paste(subject, query, sep = ",")))

## 4. Group the data by the qs column and keep only the rows with the minimum query value within each group:

blast_unique <- blast %>% 
  group_by(qs) %>% 
  filter(query == min(query)) %>% 
  ungroup() %>% 
  select(-qs)

## 5. Repeat steps 2-4 for all your blast output files, and then combine the unique hits into a single data frame using rbind:
blast_files <- c("sample_1_vs_sample_2.tsv", "sample_2_vs_sample_1.tsv", ...)
blast_unique <- data.frame()
for (f in blast_files) {
  blast <- read.table(f, header = TRUE)
  blast <- blast %>% 
    mutate(qs = ifelse(query < subject, paste(query, subject, sep = ","), paste(subject, query, sep = ",")))
  blast_unique_file <- blast %>% 
    group_by(qs) %>% 
    filter(query == min(query)) %>% 
    ungroup() %>% 
    select(-qs)
  blast_unique <- rbind(blast_unique, blast_unique_file)
}

## 6. Write the unique hits to a new file using write.table:
write.table(blast_unique, file = "blast_unique.tsv", sep = "\t", quote = FALSE, row.names = FALSE)
ADD COMMENT
0
Entering edit mode

Wow. Thank you so much for this solution. Just FYI, I also found a solution which is as follows:

# get a list of all .tsv files in the directory
blast_files <- list.files(pattern = "\\.tsv$")

# loop through each file and read it into a data frame
blast_dfs <- lapply(blast_files, function(file) {
  read.table(file, header = FALSE, sep = "\t", stringsAsFactors = FALSE,
             col.names = c("query_id", "subject_id", "pct_identity", "alignment_length",
                           "mismatches", "gap_opens", "q_start", "q_end", "s_start", "s_end",
                           "evalue", "bit_score"))
})

# combine all data frames into one
blast_df <- do.call(rbind, blast_dfs)

# find duplicate hits and remove them
dup_rows <- duplicated(blast_df[c("query_id", "subject_id")]) | duplicated(blast_df[c("query_id", "subject_id")], fromLast = TRUE)
blast_df <- blast_df[!dup_rows, ]

# find reverse duplicates and remove them
rev_dup_rows <- which(blast_df$subject_id %in% blast_df$query_id & blast_df$query_id %in% blast_df$subject_id)
blast_df <- blast_df[-rev_dup_rows[seq(2, length(rev_dup_rows), by = 2)],]

# output the unique hits to a new file
unique_df <- unique(blast_df)
write.table(unique_df, file = "unique_hits.tsv", sep = "\t", quote = FALSE,
            row.names = FALSE, col.names = FALSE)
ADD REPLY
0
Entering edit mode
21 months ago
Mensur Dlakic ★ 28k

Seems like a pretty straightforward case of brute-force matching that can be done as a loop in any programming language. After loading the dataset, we go through it row by row, and inspect all the rows underneath. If [row A, column 1] == [row B, column 2] and [row A, column 2] == [row B, column 1], then drop row B.

If you plan to do clustering or some similar thing downstream, it may be desirable to have reciprocal hits.

ADD COMMENT
0
Entering edit mode

Yes, I want to do Markov chain clustering after this with this dataset. But don't you think its necessary to filter out the bidirectional hits from the output files?

ADD REPLY
0
Entering edit mode

If memory serves well, MCL wants a symmetrized matrix. If there are no reciprocal hits, it will create them by flipping the query-hit rows and keep the same score. You already have that.

https://micans.org/mcl/man/mcxio.html

ADD REPLY
0
Entering edit mode

But it is necessary to remove the reverse duplicates, no? because it will count the same hit twice or more in the end.

ADD REPLY
0
Entering edit mode

There is no such thing as reverse duplicates in this case. If A matches B with a score 100, and B matches A with the same score, that is a reciprocal hit and it is expected. MCL likes best when the graphs are undirected, which means that A "attracts" B with the same force as B does to A. If you record only A -> B score but not B -> A, that would be a directed graph.

I think you will need to research this on your own it this explanation is not sufficient.

ADD REPLY
0
Entering edit mode

Ok, thank you very much. I see your point now.

ADD REPLY

Login before adding your answer.

Traffic: 2667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6