How to filter blast output according to sequence names?
1
0
Entering edit mode
8.0 years ago
bei • 0

Hi, I was wondering if anyone could tell me if how to filter BLAST output according to the name of sequence? Thanks!

For example:blast will give the following output:

A11_610 gi|502439232 68.4 57 18 3.1e-14 85.5
A11_1273 gi|951490813 85.3 68 10 1e-24 120.6
A11_1116 gi|476506208 65.3 4 71 2.1e-11 76.6
A11_1132 gi|497849802 97.9 48 48 8.3e-17 94.4

And a second txt file contains the sequence name:

A11_610
A11_1273

How to filter the blast output only containg A11_610 and A11_1116?

blast • 1.8k views
ADD COMMENT
0
Entering edit mode
8.0 years ago

Is the following doing what you want or did I misunderstand the question?

grep -w -f file2.txt file1.txt > filtered.txt
ADD COMMENT
1
Entering edit mode

Always remember potential error brought by grep and grep -f.

e.g. A11_610 matches more than A11_610:

A11_610
A11_6101
A11_6102
A11_610...

CSV/TSV tools are better choice:

  • csvkit - CSV/TSV tools, written in Python.
  • csvtk - CSV/TSV tools, written in Go.
  • GNU datamash - Performs numeric, textual and statistical operations TSV files. Written in C.
  • dplyr - Tools for tabular data in R storage formats. Runs in an R environment, code is in C++.
  • miller - CSV/TSV and JSON tools, written in C.
  • tsvutils - TSV/CSV tools, especially rich in format converters. Written in Python.
  • xsv - CSV/TSV tools, written in Rust.

For my csvtk, use:

csvtk -t -F grep -f 1 -P LIST_FILE   TAB_FILE

EDIT: Sorry I ignored the option -w

Anyway, CSV/TSV tools able to search given columns can run faster generally.

ADD REPLY
0
Entering edit mode

Indeed, -w would have tackled that. But another danger for -f is having an empty line in your file, which will match everything...

ADD REPLY
0
Entering edit mode

Thanks for your soon reply!

I have two files: one is blast output file (-outfmt 6), the other is txt file containing some sequence names. I just want to filter blast output file with the needed sequences (for example: A11_610 and A11_1273).

I have run your command, however, it doesn't work for me.

ADD REPLY
0
Entering edit mode

I have run your command, however, it doesn't work for me.

What does that mean? What was the result? In my command, file2.txt would be the file with the needed sequences and file1.txt would be the blast output file.

ADD REPLY
0
Entering edit mode

Thank you! I made a mistake that file1.txt was sequence names. Your command works!

ADD REPLY
0
Entering edit mode

Do you have a comman more fast? I have millions of sequnces.Now, i just test 300 sequences, with 50 blast hits, the computer is still filtering.

ADD REPLY
0
Entering edit mode

sort both files and use join https://linux.die.net/man/1/join e.g.: merge two files

ADD REPLY

Login before adding your answer.

Traffic: 1819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6