dear all,
I downloaded the Homo_sapiens.GRCh38.dna.toplevel.fa.gz from Ensembl but when I created the index with Bowtie (and also when I opened it with a text editor) the file is empty: there are only Ns.
What did I get wrong?
Thank you,
Luigi
dear all,
I downloaded the Homo_sapiens.GRCh38.dna.toplevel.fa.gz from Ensembl but when I created the index with Bowtie (and also when I opened it with a text editor) the file is empty: there are only Ns.
What did I get wrong?
Thank you,
Luigi
What does the size of the file show as on your computer? If you saw Ns then the file cannot be 'empty'. Most human chromosome sequences have Ns at the ends because the telomere sequences are repetitive and difficult to sequence, so the first part of the chr1 sequence for example, would be largely Ns.
The Homo_sapiens.GRCh38.dna.toplevel.fa.gz is 1.0 Gb but when i run Botwtie2 with
bowtie2-build -f ./Homo_sapiens.GRCh38.dna.toplevel.fa.gz ./GRCh38_idx.fa
the output is
...
Warning: Encountered reference sequence with only gaps
Warning: Encountered empty reference sequence
Time reading reference sizes: 00:00:26
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Reference file does not seem to be a FASTA file
Time to join reference sequences: 00:00:01
Total time for call to driver() for forward index: 00:00:27
Error: Encountered internal Bowtie 2 exception (#1)
Command: bowtie2-build --wrapper basic-0 -f ./refSeq/Homo_sapiens.GRCh38.dna.toplevel.fa.gz ./refSeq/GRCh38_idx.fa
Deleting "./refSeq/GRCh38_idx.fa.3.bt2" file written during aborted indexing attempt.
Deleting "./refSeq/GRCh38_idx.fa.4.bt2" file written during aborted indexing attempt.
Deleting "./refSeq/GRCh38_idx.fa.1.bt2" file written during aborted indexing attempt.
Deleting "./refSeq/GRCh38_idx.fa.2.bt2" file written during aborted indexing attempt.
and no file is generated. Same thing when using GRCh37. When opened with a text editor the first 100,001 lines are:
>CHR_HSCHR15_4_CTG8 dna:chromosome chromosome:GRCh38:CHR_HSCHR15_4_CTG8:1:102071387:1 HAP
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN (x100000)
If your gz file is about 1.0 Gb, then it's probably about the right size.
Maybe the files are somehow corrupted, or something is going wrong with the download.
For what it's worth, the copy of the hg19 genome I downloaded from the Broad Institute GATK resource bundle has 4.79 million lines composed entirely of Ns, out of a total 62.7 million lines. (Lines are 50 characters wide in this fasta file).
Could the problem be that bowtie2-build doesn't work with zipped files?
What version of bowtie2 are you using?
The manual I am using states that the zip files are accepted -- actually the file is opened otherwise there would not be the warnings about empty sequences. The "Encountered internal Bowtie 2 exception (#1)" worries me; from google it comes out bowtie does not find a file, possibly the output file ./GRCh38_idx.fa?
I think it's not finding a file because it doesn't recognize the fasta.gz as a fasta file.
I don't know which manual you are using but the online bowtie2 manual,
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer
doesn't say anything about zipped fasta files as input for bowtie2-build.
This has been discussed in previous Biostars threads:
Can bowtie2-build index for gzipped file?
I think what you can try is:
gunzip the fa.gz file and see if bowtie2-build will work with that.
If you are not sure that your genome file is OK, download a human genome file from another source, as suggested by Shicheng Guo above, and try whether bowtie2-build works with that.
Or else download the pre-built human genome Bowtie2 indexes from the Bowtie2 web page.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Download the GRCh38 from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz