Help with custom database of a genera and count of hits (NGS, 454)
2
0
Entering edit mode
10.0 years ago
marcelelaux ▴ 20

Hello biologists and bioinformaticians around the world!

I'm having serious problems to accomplish a task.

I need to perform a megablast search from metagenomic reads (454) in order to determine the number of hits and extract the reads that mached the reference sequences of a specific genera.

Performing this, I need to compare the results with Bowtie2.

The first goal is choose between megablast and Bowtie, through number of hits.

The second goal is remove these sequences, which belong to this specific genera, from the original set of reads, in order to analyze the community with and without this genera.

First of all I tried to construct my custom database for megablast.

I downloaded from NCBI the list of fasta sequences and the list of GI's from my genera.

Next step was to try run the formatdb for custom databases. My code was:

  1. formatdb -i nt -o T -p F ---> format the original database with parsing active --> it's OK
  2. formatdb -F microcystis_gis.txt -B mic_gis.gi --> format my GI list to binary format --> it's OK
  3. formatdb -i nt -p F -L mic_gi -F mic_gis.gi -t my_mic_db --> format database generating a alias for my GI list

    Here I have no success, the error was: [formatdb] FATAL ERROR: Unable to find mic_gis.gi

  4. formatdb -i microcystis_fasta.fasta -p F -n mic_db --> no success again

    Then I tried fastacmd:

  5. fastacmd -d nt -p F -i microcystis_gis.txt -D 1 -o microcystis.fasta ---> after a long time of processing, the result was a file with all the sequences, there wasn't the GI's filtering.

    Then I tried makeblastdb:

  6. makeblastdb -in microcystis_fasta.fasta -dbtype 'nucl' –parse_seqids -title microcystis_db

and the error was: Error: Too many positional arguments (1), the offending value: –parse_seqids

I think that my problem can be the format of my input files. Maybe the header of my microcystis_fasta.fasta isn't correct, here it is:

>gi|469475955|gb|KC166868.1| Shewanella putrefaciens strain DCH-5 16S ribosomal RNA gene, partial sequence
GTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTA
ATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGCGAGATGTGAAAGCCCTGGGCTC......

and this file is with a empty line between each new sequence.

My GI list and my fasta file are the ones downloaded from NCBI, I dont understand how they can be in a wrong format.... Maybe it is necessary to change this files in some way? I can do this by Perl script, but I was wondering if you can tell something more effective before.

By constructing my database for Bowtie2, it worked, but with this warning:

Warning: Encountered empty reference sequence

But, as usually, Bowtie2 performed fine and worked.....

I think that this warning is a indication that there is something wrong with the format of my reference fasta file.

Thank you everyone, Brazilian greetings!!

formatdb fastacmd next-gen megablast • 4.8k views
ADD COMMENT
2
Entering edit mode
10.0 years ago
Michael 55k
makeblastdb -in microcystis_fasta.fasta -dbtype 'nucl' –parse_seqids -title microcystis_db
                                                       ^

The highlighted character is not a dash -, but some unicode dash (), not interpreted correctly, delete and retype, copy-pasting has some downsides ;)

not visible in web-sites, but when pasted into the terminal you can see it's a bit longer

ADD COMMENT
0
Entering edit mode
10.0 years ago
Ram 44k

I see a few possible bugs here:

3. Please check you have the mic_gis.gi file in the same directory. Ensure you don't have any typos in the file name and that the file is in the same directory in which you're running the command

6. the value nucl does not need to be quoted. The quoting confuses the program and it seems to then go on and mistake -parse_seqids as an input parameter instead of a flag.

ADD COMMENT
0
Entering edit mode

Hello, thanx for the answer! Yes, mic_gis.gi is in the same directory. I always use right click to copy the name of the files to avoid typos.

I performed makeblastdb with and without quotes, no success again...

ADD REPLY
0
Entering edit mode

Cool. Are you working on a Mac or a Linux machine? You might wanna start dabbling in the command line for basic file operations - you'll need the skills :)

ADD REPLY
0
Entering edit mode

Yes, sure

I ran again and it worked! Thanx, but I have this issue:

Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: No residues given
Ignoring sequence 'gi|389772283' as it has no sequence data

I look inside my file and found this:

>gi|389772283|emb|CAIK00000000.1|CAIK01000000 Microcystis aeruginosa PCC 7941, whole genome shotgun sequencing project

>gi|390408878|gb|JQ749704.1| Microcystis aeruginosa PCC 7941 alkaline phosphatase (phoX) gene, partial cds
AGTCGCGGGGATG.....

Searching in the NCBI website I found this:

This entry is the master record for a whole genome shotgun sequencing project and contains no sequence data

As I downloaded this fasta file from NCBI, this empty entry can cause a error in the database and must be removed or can I let it this?

Thank you

ADD REPLY
0
Entering edit mode

You can remove that record.

ADD REPLY
0
Entering edit mode

Or leave it, it does no harm.

ADD REPLY
0
Entering edit mode

Yes, I am working on Linux.

Now I am running blastall with this code:

blastall -i RL1quality25.fasta -d microcystis_fasta.fasta -e 1e-5 -p blastn -b 1 -v 1 -m 8 -o RL1_1_mic.m8

it is running, but with this warning:

[blastall] WARNING: HD4RJIW01DZ8Z8: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options

I think that I need to use some filtering for low complexity regions, am I correct?

I usually run megablast with formatdb, so I am not sure about the best parameters to mimic a Bowtie-like alignment on Blastall.

It will be great if you could help me with this filtering and masking on makeblastdb and blastall, since all my attempts with formatdb were unsuccessful.... Is it the case of using dustmasker and/or gi_mask on makeblastdb? The filtering for dust is a default parameter on blastall, so maybe this warning that I got may not concern low complexity regions problem?

ADD REPLY
0
Entering edit mode

Might be really small sequences in the DB and low complexity regions, yes. Not sure why you're trying to use BLAST though. Read your question, but I'm unable to understand the exact reason :(

ADD REPLY
0
Entering edit mode

yeah kkk I know, but it is superior orders

The final goal is remove all reads that (previously) match with Microcystis genera from the original metagenomic data set (5 temporal samples). This genera is dominant in my study area and our hypothesis is that, entering the data without it, the functional profile generated my MEGAN will be from the associated community.What I need is a blast output without any hit on this genera. This will change all my biodiversity profile and maybe generate a different functional profile for each sample (without removing Microcystis, all samples appear with the exactly same functional profile).

Bowtie2 performs very well, but in the exact opposite way. It align short reads very well, but is not the same algorithm as blast, so, even I get the list of reads that match in Bowtie2, I will need to do a final blast search anyway.

Therefore, right now I need to find the exact hits for each method and compare. The one with a larger number of matches (under the same parameters) will be choosen (and that reads removed). All of this to ensure that the MEGAN LCA algorithm will not find, in some way, a remaining Microcystis read.

ADD REPLY
0
Entering edit mode

Ah, I see. I am not sure what the problem with the BLAST run is. Any luck with that?

ADD REPLY
0
Entering edit mode

I got my list of reads that match from blastn and Bowtie2 in default mode and mapping quality=>10, but I am running Bowtie2 in a mode now, because I want all the alignments, not just the best.

I filtered my blastn output by identity and score, extracted the reads to a list and now I am fighting with perl codes and unix commands to remove this reads from my original fasta file.

I found some interesting ways to do this (in this forum):

xargs samtools faidx RL1quality25.fasta < RL1_mic_reads.txt

But I don't understand how I put this in a file, neither if this file will contain that sequences or exclude it.

I found something about unix sort and join, but I am confuse about the sort, the kind of sort that it does, and I am studying how to use the join command.

I found a python script from a member, Eric Normandeau, but I couldn't run it because some package limitations.

How To Remove Certain Sequences From A Fasta File

Finally I found a perl script selectSeqs.pl

http://raven.iab.alaska.edu/~ntakebay/teaching/programming/perl-scripts/perl-scripts.html

that seems to work. I used it to remove the sequences, but I am not sure if this will work. I am running blastn of this no_microcystis fasta file against nt now, lets see what happens next :)

Thanx

ADD REPLY
0
Entering edit mode

I'll deal with your questions one by one.

a. xargs samtools faidx X.fasta <Y.txt

will, for each line from Y.txt, pass the line as input to samtools faidx X.fasta <line>. This will extract from X.fasta a region specified by the line (assuming the line is of the format header:start-end (e.g.: gi|390408878|gb|JQ749704.1| 1-100)

If you get this to work, you might not need sort or join. sort sorts a file, join merges columns from multiple files. They are not of much use here on their own.

Oh, I see you have a Perl script that works. I guess I should have read thru your entire post first. And we can solve the Python problem - just give me the exact message.

ADD REPLY
0
Entering edit mode

Hello, yes, the perl script worked.

This samtools faidx seems to extract that sequences from my reads list, but what I want is the opposite, I need all the sequences except those.

Well, I am now waiting for the blastn finishes, its a huge fasta file, even without my target sequences.

for now, my pipeline is:

makeblastdb -in microcystis_fasta.fasta -dbtype nucl -parse_seqids -title microcystis
blastall -i RL1quality25.fasta -d microcystis_fasta.fasta -e 1e-5 -p blastn -b 1 -v 1 -m 8 -o RL1_1_mic.m8
cat RL1_1_mic.m8 | awk '$3>95 && $4>50 {print $0}' > RL1_1_mic_filtered.m8
cat RL1_1_mic_filtered.m8 | awk '{print $1}' > RL1_1_mic_reads.txt
perl extract_seq.pl
perl extract_seq.pl -v -f RL1_1_mic_reads.txt RL1quality25.fasta > RL1quality25_no_mic.fasta
blastall -i RL1quality25_no_mic.fasta -d microcystis_fasta.fasta -e 1e-5 -p blastn -m 8 -o RL1_1_no_mic.m8

waiting for the blastn output, I will tell you if worked!

Thanx a lot!

ADD REPLY
0
Entering edit mode

Cool! Glad your path is clear now.

ADD REPLY

Login before adding your answer.

Traffic: 1476 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6