Question

Blastp Of Human Proteins Against A Combined Pig And Gorilla Protein Db, Why 10% Of Best Hits Are Against Pig?

0

Entering edit mode

11.0 years ago

5heikki 11k

wget ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/protein/protein.fa.gz
wget ftp://ftp.ncbi.nih.gov/genomes/Sus_scrofa/protein/protein.fa.gz
wget ftp://ftp.ncbi.nih.gov/genomes/Gorilla_gorilla/protein/protein.fa.gz

sed 's/>/>pig_/' pig proteins > pig
sed 's/>/>gor_/' gorilla proteins > gor
cat pig gor > piggor.fa
makeblastdb -in piggor.fa -dbtype prot -hash_index -parse_seqids -out piggor

subsample 1,000 human proteins

blastp human subsample against piggor db with tabular output (outfmt 6)

sort for best hits with
sort -k1,1 -k12,12gr -k11,11g blast_output | sort -u -k1,1 --merge > best_hits

grep -c 'pig_gi' best_hits
grep -c 'gor_gi' best_hits

About 10% of best hits are against pig. Why?

blast+ • 2.3k views

ADD COMMENT • link updated 10.7 years ago by Biostar 20 • written 11.0 years ago by 5heikki 11k

0

Entering edit mode

You may want to show a few examples, as I doubt someone wants to reproduce this without results.

ADD REPLY • link 11.0 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

It takes like 5-10 min with a modern laptop. Sub-sampling just 100 human proteins gave me two times pretty much the same outcome (with and without -use_sw_tback).

ADD REPLY • link 11.0 years ago by 5heikki 11k

score 0 · Answer 1 · 2013-11-28

You didn't show a canonical blastp command, and also you did not show your sub-sampling command, and that's a little embarrassing but I am not sure how to sub-sample.

Anyway, I believe 10% to be slightly high just based on gut feeling. However, a certain percent is fine, and could be due to: 1) The gene being lost in gorillas, pseudogenized or completely absent. 2) Underwent gain of function / rapid evolution in gorillas relative to humans and pigs. 3) The gene is extremely conserved with an evalue of 0, in that case you might get the pig first for odd reasons such as error in sequecing or improper selection of the best hit based on evalues rather than score....

If you really want to know which one it is, take one of those genes, and blast against nr, maybe all of a sudden your best hit is chimp.

Also, you decreased sub-sampling and your result doubled. What if you increase it, will the 10% become 5%? What if you don't sub-sample? Food for thought.

score 0 · Answer 2 · 2013-12-01

0

Entering edit mode

11.0 years ago

Adrian Pelin ★ 2.6k

You know... this recent news article got me thinking about your recent experiment: http://timesofindia.indiatimes.com/home/science/Humans-emerged-from-male-pig-and-female-chimp-worlds-top-geneticist-says/articleshow/26648981.cms as well as: http://observationdeck.io9.com/no-humans-are-not-chimp-pig-hybrids-1474029809 http://scienceblogs.com/pharyngula/2013/07/02/the-mfap-hypothesis-for-the-origins-of-homo-sapiens/

So... what did you get when you blasted that 100 gene subset against nr?:)

ADD COMMENT • link 11.0 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

grep -v "homo" output | grep -v "synthetic" > outNoHomNoSyn and sorting for best hits as above:

Pan troglodytes
Papio anubis;Nomascus leucogenys
Gorilla gorilla gorilla
Pan paniscus
Pan paniscus
Pan troglodytes
Papio anubis
Pan paniscus;Pan troglodytes
Pan troglodytes
Pan paniscus
Pongo abelii
Pan paniscus
Pan paniscus
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Pan paniscus
Ailuropoda melanoleuca
Pan paniscus
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Macaca fascicularis
Pan paniscus
Pan paniscus
Pan paniscus
Pan troglodytes
Pan paniscus
Pan troglodytes
Gorilla gorilla gorilla;Pan troglodytes
Gorilla gorilla gorilla;Pan paniscus;Pan troglodytes
Pan troglodytes
Macaca mulatta
Cricetulus griseus
Gorilla gorilla gorilla
Gorilla gorilla;Gorilla gorilla gorilla
Pan paniscus;Pan troglodytes
Pongo abelii
Pan paniscus;Pan troglodytes
Papio anubis
Gorilla gorilla gorilla
Macaca fascicularis;Macaca mulatta;Papio anubis
Pan paniscus;Pan troglodytes
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Pan troglodytes
Saimiri boliviensis boliviensis
Papio anubis
Macaca mulatta
Pan troglodytes
Pan troglodytes
Pan troglodytes
Pan troglodytes
Gorilla gorilla gorilla
Gorilla gorilla gorilla
Pan paniscus;Pan troglodytes
Pan paniscus
Pongo abelii
Macaca fascicularis;Macaca mulatta
Pan troglodytes
Pan troglodytes
Pongo abelii
Pongo abelii
Pan paniscus
Pan paniscus
Pan paniscus;Pan troglodytes
Otolemur garnettii
Macaca mulatta
Macaca mulatta
Odobenus rosmarus divergens;Monodelphis domestica;Sorex araneus;Condylura cristata
Gorilla gorilla gorilla
Pan paniscus;Pan troglodytes
Macaca mulatta
Nomascus leucogenys
Pan paniscus;Pan troglodytes
Pan paniscus;Pan troglodytes
Pan paniscus
Pan troglodytes
Pan troglodytes
Pan troglodytes
Macaca mulatta
Pongo abelii
Pan paniscus
Pan paniscus
Pan paniscus
Pan troglodytes
Macaca fascicularis;Macaca mulatta
Pan troglodytes
Pan paniscus
Ochotona princeps
Nomascus leucogenys
Pan paniscus
Pan paniscus;Pan troglodytes
Gorilla gorilla gorilla
Pongo abelii
Gorilla gorilla gorilla
Pan troglodytes
Pongo abelii

98 hits, a few of them quite crazy (chinese hamster, walrus, pika, bear, panda..), likely due to human contamination and/or extremely conserved sequences..

ADD REPLY • link 11.0 years ago by 5heikki 11k

score 0 · Answer 3 · 2013-12-01

0

Entering edit mode

11.0 years ago

cdsouthan ★ 1.9k

Could it not be that the pig assembly and gene model pipeline run was simply better ? i.e. the ORF set is more complete (http://cdsouthan.blogspot.se/2011/08/alas-poor-kamilah-erroneous-ensembl.html)

ADD COMMENT • link 11.0 years ago by cdsouthan ★ 1.9k