Error in RGI-CARD database
1
0
Entering edit mode
4.4 years ago
flo21 • 0

Hi,

I have some metagenomic .fasta files that I'm trying to analyze via CARD database https://github.com/arpcard/rgi which states that .fasta or .fasta.gz as accepted as input sequence.

My lowest paired-end file is 5.6GB. I run the analysis to test it but somewhere during the analysis the computer's space memory was not enough and I think that caused the analysis to be cut (a). As result I do get the 2 output files: .json & .txt but both are empty.

I tried compressing the file with

$ gzip filename

But when using the ##.fasta.gz file the analysis is not even carried out because "its doesn't support the format" (b) I have tried now in both linux and macOS terminal and still getting the same result. Don't have a clue what I'm doing wrong, please, any advice/suggestion would be much appreciate it

Observations from the run: During the analysis with the .fasta file I can see 5 temporal files (##.fasta.temp, ##.fasta.temp.potentrialGenes, ##.fasta.temp.contigToORF.fsa, ##.fasta.temp.contig.fsa, ##.fasta.temp.contig.fsa.blastRes.xml) Some of them are really heavy ~55GB (is that normal?) .

(a).

Error: [blastp] Failed s_BlastXMLAddIteration Q(0/1
Process Process-1:4:
Traceback (most recent call last):
  File "/Users/anaconda3/envs/rgi2/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/anaconda3/envs/rgi2/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/anaconda3/envs/rgi2/lib/python3.6/site-packages/app/Filter.py", line 116, in process_rrna
    self.format_fasta()
  File "/Users/anaconda3/envs/rgi2/lib/python3.6/site-packages/app/Filter.py", line 160, in format_fasta
    fout.write(">{}\n{}\n".format(header, seq))
OSError: [Errno 28] No space left on device
WARNING 2020-08-15 15:30:56,939 : Exception: <class 'OSError'> -> [Errno 28] No space left on device -> model_type: homolog
WARNING 2020-08-15 15:31:14,101 : Exception: <class 'OSError'> -> [Errno 28] No space left on device -> model_type: overexpression
WARNING 2020-08-15 20:49:47,327 : Exception: <class 'xml.parsers.expat.ExpatError'> -> unclosed token: line -2047941080, column 23 -> model_type: variant

(b).

ERROR 2020-08-14 12:17:04,726 : gz
ERROR 2020-08-14 12:17:04,726 : application/gzip
WARNING 2020-08-14 12:17:04,726 : Sorry, no support for this format.
software error fasta fasta.gz card • 2.1k views
ADD COMMENT
0
Entering edit mode

My lowest paired-end file is 5.6GB.

If you file is paired-end and has 5.6Gb, it is probably a fastq (not fasta) with sequencing reads. You don't show the command-line you used, but it seems to me you are trying to run sequencing reads with rgi main, which has --input_type contig or input_type protein. You are then running out of disk space:

WARNING 2020-08-15 15:30:56,939 : Exception: <class 'OSError'> -> [Errno 28] No space left on device -> model_type: homolog
WARNING 2020-08-15 15:31:14,101 : Exception: <class 'OSError'> -> [Errno 28] No space left on device -> model_type: overexpression

Even if you didn't, a blast search with an 5.6Gb input file would take a very, very long time.

You can use fastq files with rgi bwt, which has the following warning:

This is an unpublished algorithm undergoing beta-testing.

ADD REPLY
0
Entering edit mode

Thanks for your input!

I could try the analysis using .fastq files as you recommend . However, since fastq files are heavier I assumed/hadn't much hopes after seeing that disk space with fasta files uncompressed is already a problem.

Yes, this is the command line I'm trying:

rgi main --input_sequence /path/to/nucleotide_input.fasta --output_file /path/to/output_file --input_type contig --local --clean

I originally had myForward_sequence.fastq and myRevervse_sequence.fastq , and I merged and converted into my 5.6 GB fasta Did so as following:

sed -n '1~4s/^@/>/p;2~4p' in.fastq > out.fasta

Merge them:

cat myForward_sequence.fasta myRevervse_sequence.fasta > my.fasta
ADD REPLY
0
Entering edit mode
4.4 years ago
h.mon 35k

You have to assemble the genome - Shovill is very fast and light on resources - and use the assembled contigs (then use --input_type contig), or predict the proteins after assembling the genome (then use input_type protein). You can not use the reads - either in fasta or fastq - with rgi main, this is not what it was designed for.

If you want to use the reads without assembling, then you need to use rgi bwt.

ADD COMMENT

Login before adding your answer.

Traffic: 2025 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6