Please help! I am trying to obtain fastq files from the GDSC, all we have in the lab is CRAM files. Unfortunately, the reference genome seems to not exist when pulled from an online source. I have attempted to use the Samtools code which allows me to put in my own reference but I get an error message
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0 10507..19965
[E::cram_next_slice] Failure to decode slice
[main_samview] truncated file.
Has your copy of samtools (htslib actually) been compiled with curl support? If you just did "make" the answer will probably be no. You need ./configure;make. It's dumb and an annoyance of mine that we have two different build routes that by default choose two different combinations of features. Argh. This is almost certainly what the "protocol not supported" means as without curl it won't be able to handle https; only http.
If you do indeed have a curl capable samtools, then is it maybe failing due to a firewall rule? You can use htslib's "hts_file -vvvvv -c in.cram > /dev/null" to read a CRAM file with maximum verbosity enabled, which may shed some light on the manner. (Sadly samtools itself doesn't have any way to enter htslib debug mode, unless you attach a debugger to it.)
In our institute, the tools are all installed on a cluster by the systems admin, so I don't know whether or not ./configure was run. Could you please tell me at which step did you use --enable lib curl? I'm guessing you used it with samtools view -b -T /GRCh37.p13.genome.fa -o /EGAR00001252191_13305_1.bam /EGAR00001252191_13305_1.cram, but in samtools-1.9 (the version on our cluster), there is no --enable lib curl option under samtools view.
For anyone else that runs into this issue, the following worked for me:
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.ref_cache.tar.gz
tar xzf Homo_sapiens_assembly38.ref_cache.tar.gz
export REF_PATH="$(pwd)/ref/cache/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s"
export REF_CACHE="$(pwd)/ref/cache/%2s/%2s/%s"
...
{command that was generating the '[W::find_file_url] Failed to open reference..' error}
Broad seems to have removed all hg19 files from their google public bucket. Other people have asked about these and we have told them to ask Broad tech support. No one has come back and posted a response as far as I recollect. If you are willing, please do that. It would be useful for all.
I assume this data is coming from European Genome Phenome Archive? You should email datasharing at sanger.ac.uk to get assistance with reference.
Hello,
how have you installed samtools? Is htslib installed as well?
Related: https://github.com/samtools/samtools/issues/857
fin swimmer