GRCh38 toplevel fasta is empty
1
0
Entering edit mode
8.8 years ago

dear all,
I downloaded the Homo_sapiens.GRCh38.dna.toplevel.fa.gz from Ensembl but when I created the index with Bowtie (and also when I opened it with a text editor) the file is empty: there are only Ns.
What did I get wrong?
Thank you,
Luigi

rna-seq Assembly • 4.0k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
3
Entering edit mode
8.8 years ago
mastal511 ★ 2.1k

What does the size of the file show as on your computer? If you saw Ns then the file cannot be 'empty'. Most human chromosome sequences have Ns at the ends because the telomere sequences are repetitive and difficult to sequence, so the first part of the chr1 sequence for example, would be largely Ns.

ADD COMMENT
0
Entering edit mode

The Homo_sapiens.GRCh38.dna.toplevel.fa.gz is 1.0 Gb but when i run Botwtie2 with

bowtie2-build -f ./Homo_sapiens.GRCh38.dna.toplevel.fa.gz ./GRCh38_idx.fa

the output is

...
Warning: Encountered reference sequence with only gaps
Warning: Encountered empty reference sequence
  Time reading reference sizes: 00:00:26
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Reference file does not seem to be a FASTA file
  Time to join reference sequences: 00:00:01
Total time for call to driver() for forward index: 00:00:27
Error: Encountered internal Bowtie 2 exception (#1)
Command: bowtie2-build --wrapper basic-0 -f ./refSeq/Homo_sapiens.GRCh38.dna.toplevel.fa.gz ./refSeq/GRCh38_idx.fa 
Deleting "./refSeq/GRCh38_idx.fa.3.bt2" file written during aborted indexing attempt.
Deleting "./refSeq/GRCh38_idx.fa.4.bt2" file written during aborted indexing attempt.
Deleting "./refSeq/GRCh38_idx.fa.1.bt2" file written during aborted indexing attempt.
Deleting "./refSeq/GRCh38_idx.fa.2.bt2" file written during aborted indexing attempt.

and no file is generated. Same thing when using GRCh37. When opened with a text editor the first 100,001 lines are:

>CHR_HSCHR15_4_CTG8 dna:chromosome chromosome:GRCh38:CHR_HSCHR15_4_CTG8:1:102071387:1 HAP
 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN (x100000)
ADD REPLY
0
Entering edit mode

If your gz file is about 1.0 Gb, then it's probably about the right size.

Maybe the files are somehow corrupted, or something is going wrong with the download.

For what it's worth, the copy of the hg19 genome I downloaded from the Broad Institute GATK resource bundle has 4.79 million lines composed entirely of Ns, out of a total 62.7 million lines. (Lines are 50 characters wide in this fasta file).

Could the problem be that bowtie2-build doesn't work with zipped files?

What version of bowtie2 are you using?

ADD REPLY
0
Entering edit mode

The manual I am using states that the zip files are accepted -- actually the file is opened otherwise there would not be the warnings about empty sequences. The "Encountered internal Bowtie 2 exception (#1)" worries me; from google it comes out bowtie does not find a file, possibly the output file ./GRCh38_idx.fa?

ADD REPLY
1
Entering edit mode

I think it's not finding a file because it doesn't recognize the fasta.gz as a fasta file.

I don't know which manual you are using but the online bowtie2 manual,

http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer

doesn't say anything about zipped fasta files as input for bowtie2-build.

This has been discussed in previous Biostars threads:

Can bowtie2-build index for gzipped file?

I think what you can try is:

  1. gunzip the fa.gz file and see if bowtie2-build will work with that.

  2. If you are not sure that your genome file is OK, download a human genome file from another source, as suggested by Shicheng Guo above, and try whether bowtie2-build works with that.

  3. Or else download the pre-built human genome Bowtie2 indexes from the Bowtie2 web page.

ADD REPLY
0
Entering edit mode

Yes, that must have been the problem. When I used the unzipped version of the file it did work. Thank you

ADD REPLY

Login before adding your answer.

Traffic: 1965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6