Question

How would you create a table for multiple organisms vs presence of multiple genes using command line blast? [image in description]

0

Entering edit mode

9.4 years ago

Tom ▴ 20

Here's the situation.

I have proteome files for a bunch of strains. Each strain has its own fasta proteome (strain1.faa, strain2.faa, strain3.faa).

I also have a fasta list of AA sequences, and I want to know if they are present within these strains. That "query" file, looks like this:

>gene 1

MKGMF...*

>gene 2

MQWAEA...*

etc...

What I want in the end is a matrix with the strains in first column, and first row being the genes. I DONT want to have to do a manual blast for every cell because that's impractical. I just want the information. The values in the matrix is the %identity of that gene in that strain. It will look like this: enter image here What is the most parsimonious way to go about this project? I have a lot of strains, and hundreds of genes to test. But, I'm okay with outputing a csv for now. It's such a large task that I'm unsure of how to start it.

blast command line blastp • 2.6k views

ADD COMMENT • link updated 9.4 years ago by Michael 56k • written 9.4 years ago by Tom ▴ 20

score 0 · Answer 1 · 2016-03-18

0

Entering edit mode

9.4 years ago

5heikki 11k

When you have the cvs load it into R (maybe RStudio) and plot if with ggplot2 like here. One very fast way to get a distance matrix is to use the cool new mash algorithm. I think it should work with proteomes too..

p.s. I don't really understand your picture. How is strain X Y percent some gene?

ADD COMMENT • link 9.4 years ago by 5heikki 11k

0

Entering edit mode

It's not the heatmap I want. It's just the raw information. I don't want to have to individually do a blast search manually for each cell.

I can't find another google image picture that depicts this very type of project.

ADD REPLY • link 9.4 years ago by Tom ▴ 20

score 0 · Answer 2 · 2016-03-19

0

Entering edit mode

9.4 years ago

Michael 56k

You don't need more than one blast run to do this. Put all the reference sequences or genomes on the y-axis into one blast database. Put all query sequence on the x-axis into the query fasta. Run the right blast command (e.g. tblastn, or blastp), and you are done.

ADD COMMENT • link 9.4 years ago by Michael 56k