Hi all,
I am currently performing blast searches for a species of interest by blasting a genome assembly against custom databases of metagenome shotgun data.
Everything seems to be working fine and I am getting hits, if I create a database from a single sample (i.e. paired reads, converted fastq to fasta). But, as soon as I create a custom database from multiple samples, I am only getting hits for a single sample, although I know that there must be hits in the other samples as well, since I tried making databases from every single sample.
So for example:
Lactobacillus crispatus gets - 578 hits from sample A - 1606 hits in sample B - 614 hits from sample C if I blast against individual databases for each sample
Combing the three samples in one custom database, I get 614 hits (and the subject IDs in the output are all from sample C). Not a single hit from sample A or B.
I have also had a look at sequence identity between samples (since its 150bp reads) and although there are some identical reads, I still should get plenty unique hits from each of the other samples.
Does anyone have an idea as to why it behaves that way? I have the feeling that its the sample with the largest amount of sequences, compared to the others. Or alternatively, alphabetically the last one in the database.
Any thoughts are much appreciated! Thanks, Christina
It is not clear what is your blast database. Are you creating the database from the raw shotgun metagenomic reads?
Could you post the commands you used?
I downloaded paired raw read fastq files from the SRA archive. Converted to fasta using:
combine both runs using
then create custom database
I have done an individual database for sample A, B and C. But I have also made a database for all three samples, by combining fasta files using the cat command.
I want to do many more samples, so creating individual databases for each sample is too tedious.