Blastp Of Human Proteins Against A Combined Pig And Gorilla Protein Db, Why 10% Of Best Hits Are Against Pig?
3
0
Entering edit mode
11.0 years ago
5heikki 11k
wget ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/protein/protein.fa.gz
wget ftp://ftp.ncbi.nih.gov/genomes/Sus_scrofa/protein/protein.fa.gz
wget ftp://ftp.ncbi.nih.gov/genomes/Gorilla_gorilla/protein/protein.fa.gz

sed 's/>/>pig_/' pig proteins > pig
sed 's/>/>gor_/' gorilla proteins > gor
cat pig gor > piggor.fa
makeblastdb -in piggor.fa -dbtype prot -hash_index -parse_seqids -out piggor

subsample 1,000 human proteins

blastp human subsample against piggor db with tabular output (outfmt 6)

sort for best hits with
sort -k1,1 -k12,12gr -k11,11g blast_output | sort -u -k1,1 --merge > best_hits

grep -c 'pig_gi' best_hits
grep -c 'gor_gi' best_hits

About 10% of best hits are against pig. Why?

blast+ • 2.3k views
ADD COMMENT
0
Entering edit mode

You may want to show a few examples, as I doubt someone wants to reproduce this without results.

ADD REPLY
0
Entering edit mode

It takes like 5-10 min with a modern laptop. Sub-sampling just 100 human proteins gave me two times pretty much the same outcome (with and without -use_sw_tback).

ADD REPLY
0
Entering edit mode
11.0 years ago
Adrian Pelin ★ 2.6k

You didn't show a canonical blastp command, and also you did not show your sub-sampling command, and that's a little embarrassing but I am not sure how to sub-sample.

Anyway, I believe 10% to be slightly high just based on gut feeling. However, a certain percent is fine, and could be due to: 1) The gene being lost in gorillas, pseudogenized or completely absent. 2) Underwent gain of function / rapid evolution in gorillas relative to humans and pigs. 3) The gene is extremely conserved with an evalue of 0, in that case you might get the pig first for odd reasons such as error in sequecing or improper selection of the best hit based on evalues rather than score....

If you really want to know which one it is, take one of those genes, and blast against nr, maybe all of a sudden your best hit is chimp.

Also, you decreased sub-sampling and your result doubled. What if you increase it, will the 10% become 5%? What if you don't sub-sample? Food for thought.

ADD COMMENT
0
Entering edit mode

I subsample with this script. I have it at /usr/local/bin as subsample_fasta.pl
blast was:
blastp -query homo_subset.fa -db piggor -outfmt 6 -out blast_output -num_threads 4
or
blastp -query homo_subset.fa -db piggor -outfmt 6 -seg yes -use_sw_tback -out blast_output -num_threads 4
I also made a pig chimp db and still 5-10% of best hits to pig. I also sorted best hits with percent identity instead of evalue and bitscore. Even then, sometimes the pig wins. I'm sure there's a good explanation to this and it's not that humans are some creepy chimp pig hybrids because as you said, against nr the best hit is probably never against a pig. I'm testing this right now with a 100 protein subset. Maybe there is no unifying explanation to the results but each case is different, some due to rapid change in one lineage, some due to something completely different..

ADD REPLY
0
Entering edit mode
ADD COMMENT
0
Entering edit mode

grep -v "homo" output | grep -v "synthetic" > outNoHomNoSyn and sorting for best hits as above:

Pan troglodytes
Papio anubis;Nomascus leucogenys
Gorilla gorilla gorilla
Pan paniscus
Pan paniscus
Pan troglodytes
Papio anubis
Pan paniscus;Pan troglodytes
Pan troglodytes
Pan paniscus
Pongo abelii
Pan paniscus
Pan paniscus
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Pan paniscus
Ailuropoda melanoleuca
Pan paniscus
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Macaca fascicularis
Pan paniscus
Pan paniscus
Pan paniscus
Pan troglodytes
Pan paniscus
Pan troglodytes
Gorilla gorilla gorilla;Pan troglodytes
Gorilla gorilla gorilla;Pan paniscus;Pan troglodytes
Pan troglodytes
Macaca mulatta
Cricetulus griseus
Gorilla gorilla gorilla
Gorilla gorilla;Gorilla gorilla gorilla
Pan paniscus;Pan troglodytes
Pongo abelii
Pan paniscus;Pan troglodytes
Papio anubis
Gorilla gorilla gorilla
Macaca fascicularis;Macaca mulatta;Papio anubis
Pan paniscus;Pan troglodytes
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Pan troglodytes
Saimiri boliviensis boliviensis
Papio anubis
Macaca mulatta
Pan troglodytes
Pan troglodytes
Pan troglodytes
Pan troglodytes
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Pan paniscus;Pan troglodytes
Pan paniscus
Pongo abelii
Macaca fascicularis;Macaca mulatta
Pan troglodytes
Pan troglodytes
Pongo abelii
Pongo abelii
Pan paniscus
Pan paniscus
Pan paniscus;Pan troglodytes
Otolemur garnettii
Macaca mulatta
Macaca mulatta
Odobenus rosmarus divergens;Monodelphis domestica;Sorex araneus;Condylura cristata
Gorilla gorilla gorilla
Pan paniscus;Pan troglodytes
Macaca mulatta
Nomascus leucogenys
Pan paniscus;Pan troglodytes
Pan paniscus;Pan troglodytes
Pan paniscus
Pan troglodytes
Pan troglodytes
Pan troglodytes
Macaca mulatta
Pongo abelii
Pan paniscus
Pan paniscus
Pan paniscus
Pan troglodytes
Macaca fascicularis;Macaca mulatta
Pan troglodytes
Pan paniscus
Ochotona princeps
Nomascus leucogenys
Pan paniscus
Pan paniscus;Pan troglodytes
Gorilla gorilla gorilla
Pongo abelii
Gorilla gorilla gorilla
Pan troglodytes
Pongo abelii

98 hits, a few of them quite crazy (chinese hamster, walrus, pika, bear, panda..), likely due to human contamination and/or extremely conserved sequences..

ADD REPLY
0
Entering edit mode
11.0 years ago
cdsouthan ★ 1.9k

Could it not be that the pig assembly and gene model pipeline run was simply better ? i.e. the ORF set is more complete (http://cdsouthan.blogspot.se/2011/08/alas-poor-kamilah-erroneous-ensembl.html)

ADD COMMENT

Login before adding your answer.

Traffic: 1618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6