Hello biologists and bioinformaticians around the world!
I'm having serious problems to accomplish a task.
I need to perform a megablast search from metagenomic reads (454) in order to determine the number of hits and extract the reads that mached the reference sequences of a specific genera.
Performing this, I need to compare the results with Bowtie2.
The first goal is choose between megablast and Bowtie, through number of hits.
The second goal is remove these sequences, which belong to this specific genera, from the original set of reads, in order to analyze the community with and without this genera.
First of all I tried to construct my custom database for megablast.
I downloaded from NCBI the list of fasta sequences and the list of GI's from my genera.
Next step was to try run the formatdb for custom databases. My code was:
formatdb -i nt -o T -p F
---> format the original database with parsing active --> it's OKformatdb -F microcystis_gis.txt -B mic_gis.gi
--> format my GI list to binary format --> it's OKformatdb -i nt -p F -L mic_gi -F mic_gis.gi -t my_mic_db
--> format database generating a alias for my GI listHere I have no success, the error was:
[formatdb] FATAL ERROR: Unable to find mic_gis.gi
formatdb -i microcystis_fasta.fasta -p F -n mic_db
--> no success againThen I tried fastacmd:
fastacmd -d nt -p F -i microcystis_gis.txt -D 1 -o microcystis.fasta
---> after a long time of processing, the result was a file with all the sequences, there wasn't the GI's filtering.Then I tried makeblastdb:
makeblastdb -in microcystis_fasta.fasta -dbtype 'nucl' –parse_seqids -title microcystis_db
and the error was: Error: Too many positional arguments (1), the offending value: –parse_seqids
I think that my problem can be the format of my input files. Maybe the header of my microcystis_fasta.fasta
isn't correct, here it is:
>gi|469475955|gb|KC166868.1| Shewanella putrefaciens strain DCH-5 16S ribosomal RNA gene, partial sequence
GTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTA
ATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGCGAGATGTGAAAGCCCTGGGCTC......
and this file is with a empty line between each new sequence.
My GI list and my fasta file are the ones downloaded from NCBI, I dont understand how they can be in a wrong format.... Maybe it is necessary to change this files in some way? I can do this by Perl script, but I was wondering if you can tell something more effective before.
By constructing my database for Bowtie2, it worked, but with this warning:
Warning: Encountered empty reference sequence
But, as usually, Bowtie2 performed fine and worked.....
I think that this warning is a indication that there is something wrong with the format of my reference fasta file.
Thank you everyone, Brazilian greetings!!
Hello, thanx for the answer! Yes,
mic_gis.gi
is in the same directory. I always use right click to copy the name of the files to avoid typos.I performed makeblastdb with and without quotes, no success again...
Cool. Are you working on a Mac or a Linux machine? You might wanna start dabbling in the command line for basic file operations - you'll need the skills :)
Yes, sure
I ran again and it worked! Thanx, but I have this issue:
I look inside my file and found this:
Searching in the NCBI website I found this:
As I downloaded this fasta file from NCBI, this empty entry can cause a error in the database and must be removed or can I let it this?
Thank you
You can remove that record.
Or leave it, it does no harm.
Yes, I am working on Linux.
Now I am running blastall with this code:
it is running, but with this warning:
I think that I need to use some filtering for low complexity regions, am I correct?
I usually run megablast with formatdb, so I am not sure about the best parameters to mimic a Bowtie-like alignment on Blastall.
It will be great if you could help me with this filtering and masking on makeblastdb and blastall, since all my attempts with formatdb were unsuccessful.... Is it the case of using dustmasker and/or gi_mask on makeblastdb? The filtering for dust is a default parameter on blastall, so maybe this warning that I got may not concern low complexity regions problem?
Might be really small sequences in the DB and low complexity regions, yes. Not sure why you're trying to use BLAST though. Read your question, but I'm unable to understand the exact reason :(
yeah kkk I know, but it is superior orders
The final goal is remove all reads that (previously) match with Microcystis genera from the original metagenomic data set (5 temporal samples). This genera is dominant in my study area and our hypothesis is that, entering the data without it, the functional profile generated my MEGAN will be from the associated community.What I need is a blast output without any hit on this genera. This will change all my biodiversity profile and maybe generate a different functional profile for each sample (without removing Microcystis, all samples appear with the exactly same functional profile).
Bowtie2 performs very well, but in the exact opposite way. It align short reads very well, but is not the same algorithm as blast, so, even I get the list of reads that match in Bowtie2, I will need to do a final blast search anyway.
Therefore, right now I need to find the exact hits for each method and compare. The one with a larger number of matches (under the same parameters) will be choosen (and that reads removed). All of this to ensure that the MEGAN LCA algorithm will not find, in some way, a remaining Microcystis read.
Ah, I see. I am not sure what the problem with the BLAST run is. Any luck with that?
I got my list of reads that match from blastn and Bowtie2 in default mode and mapping quality=>10, but I am running Bowtie2 in a mode now, because I want all the alignments, not just the best.
I filtered my blastn output by identity and score, extracted the reads to a list and now I am fighting with perl codes and unix commands to remove this reads from my original fasta file.
I found some interesting ways to do this (in this forum):
But I don't understand how I put this in a file, neither if this file will contain that sequences or exclude it.
I found something about unix sort and join, but I am confuse about the sort, the kind of sort that it does, and I am studying how to use the join command.
I found a python script from a member, Eric Normandeau, but I couldn't run it because some package limitations.
How To Remove Certain Sequences From A Fasta File
Finally I found a perl script selectSeqs.pl
http://raven.iab.alaska.edu/~ntakebay/teaching/programming/perl-scripts/perl-scripts.html
that seems to work. I used it to remove the sequences, but I am not sure if this will work. I am running blastn of this
no_microcystis
fasta file against nt now, lets see what happens next :)Thanx
I'll deal with your questions one by one.
a.
xargs samtools faidx X.fasta <Y.txt
will, for each line from Y.txt, pass the line as input to
samtools faidx X.fasta <line>
. This will extract fromX.fasta
a region specified by the line (assuming the line is of the formatheader:start-end
(e.g.:gi|390408878|gb|JQ749704.1| 1-100
)If you get this to work, you might not need sort or join. sort sorts a file, join merges columns from multiple files. They are not of much use here on their own.
Oh, I see you have a Perl script that works. I guess I should have read thru your entire post first. And we can solve the Python problem - just give me the exact message.
Hello, yes, the perl script worked.
This samtools faidx seems to extract that sequences from my reads list, but what I want is the opposite, I need all the sequences except those.
Well, I am now waiting for the blastn finishes, its a huge fasta file, even without my target sequences.
for now, my pipeline is:
waiting for the blastn output, I will tell you if worked!
Thanx a lot!
Cool! Glad your path is clear now.