Entering edit mode
3.4 years ago
foxiw
▴
10
Hi.
I am trying to generate a index genome for my alignment, however I am running into a problem that isn't making any sense. This is the code in my slurm script:
#!/bin/bash
#SBATCH --partition=defq # the requested queue
#SBATCH --nodes=1 # number of nodes to use
#SBATCH --tasks-per-node=1 # for parallel distributed jobs
#SBATCH --cpus-per-task=4 # for multi-threaded jobs
#SBATCH --mem-per-cpu=4G # in megabytes, unless unit explicitly stated
#SBATCH --error=%J.err # redirect stderr to this file
#SBATCH --output=%J.out # redirect stdout to this file
#SBATCH --mail-user=foxiw@cardiff.ac.uk # email address used for event notification
#SBATCH --mail-type=BEGIN,END,FAIL # email on job start, end, and/or failure
# Load modules
module load STAR/2.7.3a
export RefDir=/mnt/scratch/c1818206/fastqs/merged_files/trimmedfiles/trimmedfiles_final
## Change --sjdbOverhang to length of your sequence data minus 1
STAR --runThreadN ${SLURM_CPUS_PER_TASK} \
--limitGenomeGenerateRAM 31G \
--runMode genomeGenerate \
--genomeDir $RefDir/ \
--genomeFastaFiles $RefDir/Mus_musculus.GRCm39.dna.primary_assembly.fa \
--sjdbGTFfile $RefDir/Mus_musculus.GRCm39.104.gtf \
--sjdbOverhang 75
However, I get this error when I run it on the server:
EXITING because of INPUT ERROR: the file format of the genomeFastaFile: /mnt/scratch/c1818206/fastqs/merged_files/trimmedfiles/trimmedfiles_final/Mus_musculus.GRCm39.dna.primary_assembly.fa is not fasta: the first character is '^_' (31), not '>'.
Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).
Jun 17 17:53:03 ...... FATAL ERROR, exiting
This error makes no sense to me. I used to get it when I tried to run the code when the assembly and annotation files were zipped. I then used gunzip to unzip both, changed my code to the one above, but I still get this error. It makes no sense to me? Any help would be much appreciated.
What is the output of:
Hi, I can your coe and got this output. Could it be that my file is corrupted?
Your fasta file is not plain text. Are you sure it was gzipped and you gunzipped it properly? What is the output of:
It says that its gzip compressed data, so obviously it wasn't gunzipped properly! Thanks. How would I go about doing it properly then? Last time I wrote:
Thanks again.
That should have worked. Try
gzip -dc Mus_musculus.GRCm39.dna.primary_assembly.fa.gz > Mus_musculus.GRCm39.dna.primary_assembly.fa && echo "Completed Successfully"
so you have both files and can ensure gzip ran to completion properly.I did as you said and the echo was printed, I then re-ran the file comannd, however I now get this:
I used less command and it is indeed empty.
EDIT:
I guess you're downloading from EnsEMBL FTP and not GENCODE FTP. In that case, here's the code you need:
It seems that EnsEMBL FTP's CHECKSUMS are not MD5SUMS. I can't figure out what the contents in the CHECKSUMS file means.
Can you try re-downloading and decompressing the FASTA file by running this:
The above block ensures the download completed successfully and that the transfer was not corrupted in transit.
I think that may have worked. The output of the last file command states its ASCII file. I'm going to re-run my script and see if it works this time. Thank you so much for your help. Could I run a similar script to get the annotation file (gencode.vM27.primary_assembly.annotation.gtf.gz)?
Thanks again
Yep, that's worked now. It must have become corrupted when I first downloaded the files. Thanks again for your help :)
No problem. You can do this for the GTF file too. By the way, the CHECKSUMS provided by EnsEMBL are generated using the
sum
command (I only knew thatmd5sum
andsha1sum
commands).You may have to manually verify though,
sum
does not seem to have the-c
equivalent ofmd5sum
.