I have several sets of relatively short DNA sequences ranging from 200bp to about 2000bp. They are stored as FASTA.
They are all supposed to be from bacterial origin.
However, I want to make sure that there are no human sequences sneaking in. Some of them could also be just partially human (meaning a part of the entire sequence could be from from human origin).
I would just blastN against whole human_genomic.*tar.gz
and est_human.*.tar.gz
. Speed is not much of an issue, so I do not need a solution like Centrifuge or mapping. I think I would like to go with blast for high sensitivity.
Do you have some more databases you would add to the table to search against?
Use
bbsplit.sh
to bin your reads ( A: Tool to separate human and mouse ran seq reads ). Use human genome alone if you don't know specific bacteria you want to include. Reads aligning to human genome will go into one file and rest will be collected in second.Thank you! However, I am NOT asking for a tool to split my data. I know bbsplit.sh. I am asking if you would add another database in addition to the two I have mentioned above to make sure very short stretches of human sequences get catched.
Human genome sequence should be a catch all. There should be no need to add any other sequence. EST's etc are all a subset of entire genome.
That is a tough criteria. If you want to enforce that then what minimum length are you thinking of using for
hits
? You may have small stretches of sequence identity between your data and human genome by chance.