Hi Biostars,
This is my first question on the site, so thanks for all the help I've gotten from lurking, and thanks for the help I'm about to get. I'll try to share my research and follow all the etiquette. Can anyone help me with the following situation?
I'm working with the python packages Pysam and Pysamstats. I am using BAM and FASTA files from the 1000genomes project. I would like to provide a 50 base-pair interval on chromosome 22 in HG18 coordinates and in turn receive the average depth of coverage from that interval. My attempts so far have opened the files first, then fed them into pysamstats.stat_coverage_binned
, like this.
fasta_file = pysam.Fastafile(<fasta_ftp_path>)
bamfile = pysam.AlignmentFile(<alignment_ftp_path>)
pysamstats.stat_coverage_binned(bamfile, fasta_file, <chrom>, <start_pos>, <end_pos>).
From the first line, I encounter this:
ValueError: could not locate index file
The same error appears on the third line if I try to feed in the ftp addresses rather than file objects. My understanding is that these packages look for index files by appending ".bai" or ".fai" to the filename I feed them, which should work (you can look at the files available further down my post). This is supported when I RTFS at line 142 of this file, which has my error message. I can't quite read cython, but given "filename.fa" I think it says to look for "filename.fa.fai". Here's TFS: https://github.com/pysam-developers/pysam/blob/master/pysam/cfaidx.pyx
FASTA-associated files: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome
BAM-associated files: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/other_exome_alignments/HG00096/exome_alignment/
A minimal example to generate the error (sorry about the formatting; can't figure out how this will render):
import os
import pysam
import pysamstats
alignment_ftp_path = "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/" \
"other_exome_alignments/HG00096/exome_alignment/"
alignment_filename = "HG00096.mapped.illumina.mosaik.GBR.exome.20111114.bam"
alignment_ftp_path = os.path.join(alignment_ftp_path, alignment_filename)
fasta_ftp_path = "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/" \
"GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa"
fasta_file = pysam.Fastafile(fasta_ftp_path)
bamfile2 = pysam.AlignmentFile(filename = alignment_ftp_path)
ex_bin_read_depth = pysamstats.stat_coverage_binned(bamfile2, fasta_ftp_path,\
chrom="22", \
start=14481425, \
end=14481475)
I think the problem stems from the fact that you're trying to access FTP links like local files. Try downloading the data first.
If space is an issue, you could try linking the FTP data to a named pipe, though that's a little more tricky.
That worked; thanks. Is there an "accept" button like on SE?
I notice that
os.path.isfile(<address>)
isFalse
even for correct ftp addresses.os.path.isfile
returnsTrue
for files that exist on your current filesystem. It does not work for remote URLs.There is an accept button on Biostars, but only for actual answers. I was guessing at the problem so I posted a comment instead. :)
I think you found a bug in pysam. I reported it here. You'll probably have to download the FASTA file for now, as Dan D says. The remote BAM file should work, though.
Indeed, the remote BAM works. Thanks.