Hello guys,
I am a bioinformatics student and I am having metagenomic data (contigs) of the content of a fermentation plant. This means there is possible plant, animal and bacterial DNA material in it. I am only interested in some enzyme produced by methanogenic bacteria. So that is why I try to filter my metagenomic data with a self- constructed db with the genome of z. mays, s. tuberosum, b. subtilis and r.norvegicus. For this purpose I wanted to use a nucleotide megablast. To figure out which settings I needed to use to get the cleanest filtering I made a query that consisted of: some random plant genes, some random mammalian animal genes and some subunits gene coding for a methanogenesis pathway enzyme.
This is my command:
rnieuwenhuis@bin211:~$ megablast \
-i /homes/rnieuwenhuis/Documents/Thema09/Project9/Tofindrightsettings.fa \
-d /media/usb1/Data2/Thema09_fliterdb \
-o test5 \
-m 9 \
-a 4 \
-W 24 \
-e 1
Then I get this output:
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|312136230:c753499-752768 Methanothermus fervidus DSM 2088 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|428205073:c5244719-5241819 Chroococcidiopsis thermalis PCC 7203 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|45357563:1512964-1514625 Methanococcus maripaludis S2 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|240256243:1440146-1441863 Arabidopsis thaliana chromosome 4, complete sequence
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|240256243:6100743-6101708 Arabidopsis thaliana chromosome 4, complete sequence
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
gi|240256243:6100743-6101708 gi|564111623|ref|NW_006239484.1| 100.00 25 0 0 835 859 288651 288627 0.005 50.0
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|312136230:c753499-752768 Methanothermus fervidus DSM 2088 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|312136230:c752768-751938 Methanothermus fervidus DSM 2088 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
# BLASTN 2.2.26 [Sep-21-2011]
# Query: gi|312136230:c751588-750875 Methanothermus fervidus DSM 2088 chromosome, complete genome
# Database: /media/usb1/Data2/Thema09_fliterdb
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
gi|312136230:c751588-750875 gi|109471397|ref|NW_001084817.1| 96.67 30 1 0 502 531 12407878 12407907 8e-04 52.0
gi|312136230:c751588-750875 gi|109506243|ref|NW_001084730.1| 100.00 25 0 0 502 526 17031349 17031325 0.003 50.0
There are 4 Methanothermus fervidus methyltransferase subunit loci in it. And 2 plant genes I found by searching kegg for a common enzyme of plants.
My result show that, whatever setting I use, the subunit gene always matches way better than any of the other genes. It says that the complete subunit is in the genome of the r.norvegicus. How is this possible? Does a rat have some enzyme which could use exact the same subunit?
(p.s. I know a gene is not the same as an enzyme and that I probrably used some wrong terms, feel free to correct me, I want to learn :) )
I would start solving this problem by searching enzyme sequence with online ncbi blast. You'll get approximate results what you should be looking in your db.
It is impossible to answer your question because of the lack of detail and obvious errors in your description. It is very likely that your pipeline is flawed and all wrong. In fact I do not think you need this subtractive filtering at all. What is your motivation to include the organisms you included? Are sequences from these organisms potentially contained in the sample? If not, why would you filter for them on the DNA level? what happens on the DNA level is mostly irrelevant and maybe an artifact for any comparison of distant sequences.
Methanogens are not bacteria but archaea. Furthermore, to answer the question we need exact input and output of your query, in particular:
In fact I guess that you are comparing distant gene sequences on the DNA level which is irrelevant and gives rise to artifacts.
Thank you sir for your comment.
The reason the question is a bit briefly because I expected it to be enough to answer, so my apologies.
I cannot provide all the information you asked for. (I will modify my question tomorrow with the requested information, I cannot reach it right now)
It would be best to cast a wide net including all plant, animal and bacterial genes. The hits of a rat protein against a bacterial protein are probably a cause of this, it isn't impossible for a rat to have a sequence similar to a bacterial one. The reason you're getting hits against bacteria is because you've only included bacteria (and a very small number at that) in your database. BLAST is finding the best possible hit for your sequence, which happens to place a rat gene against a bacterial gene.
This is going to very strongly bias your results, just because you're interested in a specific set of species, doesn't mean that you can ignore everything else. A good scientist finds the best possible explanation given all possible explanations, not the best one given what he or she is interested in. This approach is massively inappropriate, as you can see, limiting your species is drastically biasing your results. Contigs from massively distant species could be seen as hits because BLAST is only designed to find the best possible hit given the search space provided. Yes there are metrics to filter hits, but in the rare occasions of highly conserved genes, you might incorrectly map things.
Since you have reason to believe that there are species from such a very diverse set, you need to include them. You also need to remember that there are multiple species of Rat, Mouse, as well as numerous other rodent genera that could end up in the fermenter either at the point of harvest, transport, or within the plant. What about plant, animal and bacterial viruses/phages? What about fungi? Do birds defecate on grain piles or on anything that ends up in the fermentation? What type of animals graze on the grains? How is the grain collected, by machinery or manual labor? Is there a chance for human urine or feces to contaminate the grain at any point? What about insects?
Unless you have a reason to block species (e.g. the plant adds fungicidal compounds to the grain), you must include them if you want to do good science. Is there previous 16S sequencing that allows you to narrow bacterial species present? Do you have access to records from the pest management at the plant that show you what species of rodent are most common? Even then, this sort of information should be used to include additional species to the mix, if they're trying to kill species/etc, it means there's a real chance of that species/etc being present.
The best way to start with this, and with any large dataset, is to filter out the crap you're not interested in, not try and map everything to the things you're interested in. A basic solution would be to BLAST against the nr/nt databases to assign all of your contigs a taxonomical placement. Then you can reasonably throw things out to focus on the data you're interested in, although since you used the term "metagenome", you need to analyze everything.
I never get why people try to narrow so much to start with, metagenomic datasets are the bioinformatics equivalent of Indiana Jones, you can dig and hunt and look for all of the novel/oddball stuff no one has ever found before! Besides, finding that kind of stuff can mean big impact factors, so why not get as much as you can and only filter as you need to?
Who knows, maybe the gut bacteria from the mice chopped up in the combines are critical to that smooth oaky finish to my bourbon! :)
Thank you for your reply sir,
I was clearly having a wrong idea about 'filtering' my data and I will certainly think more and better about my future plan to not oversimplify them.