Dear all,
I have a fasta file with more than 200MM protein sequences that I would like to cluster in a non-redundant catalogue (100% identity) using cd-hit, but as this file is so big I thought using cd-hit-para.pl could be a good option to optimize it. At my institution we use SGE and I was trying to run this qsub script (below) to send the job to a queue, but with not success (Error message: no host at /bin/cd-hit-para.pl line 97). I followed the user guide (http://www.bioinformatics.org/cd-hit/cd-hit-user-guide.pdf) but think I didn't understand well on how to use it. Do you have an example on how to run cd-hit-para.pl in SGE or tell me if there is a better way to use cd-hit for a large file like that?
Script:
#!/bin/bash #$ -N cdhit #$ -o /output/logs/$JOB_NAME_$JOB_ID.out #$ -e /output/error/$JOB_NAME_$JOB_ID.err #$ -l virtual_free=20G,h_vmem=20G,h_rt=6:00:00 #$ -q long-sl7 #$ -pe smp 8 cd-hit-para.pl -i file.faa -o file_100.faa -c 1.0 -M 20000 -T $NSLOTS --T "SGE"-Q 20
Command line:
$ qsub cdhit