Entering edit mode
4.9 years ago
anasofiamoreira94
▴
80
Good Morning, I apologize for this question, and I will try to explain myself as clearly as possible. We analyze some fastq files from iontorrent, and we blast against the whole nt. Now we are seeing certain results that don't make sense (environmental, vibrios, etc), in the sense that more things appear than we were expecting. So now, we wanted to use the refseq mitochondrial, but I have doubts about the taxid, because when I do it against the mitochondrial, I don't have the results of the species. Could someone indicate the best path for my goal? Thanks
No need for that. Biostars exists specifically for questions.
It would help to describe what this project is about. We have heard what you have tried to do but not why.
NG Sequencing can be very sensitive (you would sequence contaminant DNA in any of your samples very easily) and if one is not careful with samples/preps unexpected results can happen.
I can't really describe the goal of the project... I'm so sorry. Can you lead me to the best way on how to use refseq database?
Maybe you can describe the data. Are it amplicons or full genomes (all DNA)? If amplicons which marker/gene you are looking at?
we perform target sequencing, so we focus on amplicons
I already expected this when you said you had hits on "environmental sample" entry's. The nt database is full of those kind of reads. I practice researchers take a sample from water or soil and know that there is a certain family of species in it but don't know exactly which species. And it will be just uploaded to genbank, if you check the taxonomy of those hits often it will not go deeper then family level.
The easy thing to do is to use a database that is specifically created for your target. (We don't know your target so we can not give suggestions). The more difficult thing, depending on your own skills is to filter the nt database.
Think your expectation is wrong, depending on your type of sample ofcourse.
As far as I thought they all just have taxid's in the same way the nt database have. But using this database will probably not help. (don't know for sure don't know your goal). Most of the time a species is specifically target sequenced with a certain primer and not the mitochondria. So you will miss a lot.
Let's say I want to analyse my data against hits of meat and fish, would you suggest to continue using nt?
I personally would suggest to filter nt. (You can also filter the hits afterwards...maybe) or use BOLD (http://boldsystems.org/) This database is not that easy to use the API is a bit weird.
Filtering nt also drastically speed up the blasting progress.
But how can I filter the nt? Is it possible to do it remotely?
remotely don't think so. You could download it and use biopython or maybe blastdbmc. In detail it is a lot to explain so maybe you can better make a start and ask a new question if you are stuck. Or maybe some one else has an other suggestion.
Basically you can read the nt database with biopython and if a fasta header contains a certain text you can write the read to a new file.
Probably make a new question, thanks
btw maybe this helps Downloading all COI sequences from BOLD database
I'm so sorry but I can't use BOLD database,only NCBI,but thanks for the suggestion.
You can't filter
nt
remotely. You could do a blast search against it using a specificentrez
query with species of your interest. As it stands meat and fish is too broad a term and is not one that could be used withentrez
.I can specify the genes of interest with entrez, but I will continue to hava all the species that I don't want...