Can I open a BAM file without a header in Rsamtools?
1
0
Entering edit mode
7.6 years ago
biomagician ▴ 410

Dear all,

I am trying to analyse the quality of reads from a BAM file but the BAM file does not have a header so I get this error:

Rsamtools::scanBam('output/alignment/pacbio/ref/sx.subreads.bam')

Error in value[3L] : failed to open BamFile: SAM/BAM header missing or empty file: 'output/alignment/pacbio/ref/sx.subreads.bam'

Does anybody know how I can get round this, please?

Thanks.

Best wishes,

C.

rsamtools bam scanBam samtools sequencing • 7.0k views
ADD COMMENT
0
Entering edit mode

Is this one of those new BAM files that SMRTanalysis generates as default output?

ADD REPLY
0
Entering edit mode

Yes. This is the subreads.bam file that the pacbio facility sent me.

ADD REPLY
0
Entering edit mode

Have you tried reheader in samtools? http://www.htslib.org/doc/samtools.html

ADD REPLY
0
Entering edit mode

Hi Alex, I don't have the corresponding SAM file for the BAM file. I've tried to convert to SAM and then to BAM again but that did not add a header.

ADD REPLY
0
Entering edit mode

I think, but I'm not 100% certain, that you would just convert from BAM to SAM: samtools view -h -o old.sam old.bam. Then run samtools reheader old.sam new.bam to get a re-headered BAM.

ADD REPLY
0
Entering edit mode

That didn't work. According to the doc:

" samtools reheader [-iP] in.header.sam in.bam

Replace the header in in.bam with the header in in.header.sam. This command is much faster than replacing the header with a BAM→SAM→BAM conversion.

By default this command outputs the BAM or CRAM file to standard output (stdout), but for CRAM format files it has the option to perform an in-place edit, both reading and writing to the same file. No validity checking is performed on the header, nor that it is suitable to use with the sequence data itself. "

So I did:

samtools reheader old.sam old.bam > new.bam

I get the new.bam but it still has no header. The .sam file also has no header when I convert from BAM to SAM.

ADD REPLY
0
Entering edit mode

These BAM files should be in PacBio's BAM format. It appears that sequencing facility that gave you the files may not have generated them properly. You could ask them to regenerate the files.

ADD REPLY
0
Entering edit mode

Hi genomax,

I believe that my BAM file is an unaligned BAM file that contains reads and metadata without the reads having been aligned to the reference sequence I provided them with. That's fine, I can align it myself. So, do you think that even unaligned BAM files should have a header?

ADD REPLY
1
Entering edit mode

I have mostly worked with the older *.h5 PacBio data (we gave up on our RSII) so don't have direct experience with new BAM data (we are waiting on getting access to a local sequel).

PacBio BAM format spec seems to indicate that usual headers should be present in those files but clearly your file does not have them.

I suggest that you also contact PacBio tech support directly with this question.

ADD REPLY
0
Entering edit mode

Hi,

My bad, my BAM does have a header. I can see it with this command:

samtools view -H sx.subreads.bam

@HD VN:1.5  SO:unknown  pb:3.0.3
@RG ID:73c06d55 PL:PACBIO   DS:READTYPE=SUBREAD;Ipd:CodecV1=ip;PulseWidth:CodecV1=pw;BINDINGKIT=100-862-200;SEQUENCINGKIT=100-861-800;BASECALLERVERSION=4.0.0.189308;FRAMERATEHZ=80.000000  PU:m54097_170512_141504 PM:SEQUEL
@PG ID:baz2bam  PN:baz2bam  VN:4.0.0.189308 CL:/opt/pacbio/ppa-4.0.0/bin/baz2bam /data/pb/m54097_170512_141504.baz -o /data/pb/m54097_170512_141504 --metadata /data/pb/.m54097_170512_141504.metadata.xml -j 12 -b 12 --progress --silent --minSubLength 50 --minSnr 4.000000 --adapters /data/pb/m54097_170512_141504.adapters.fasta --controls /etc/pacbio/600bp_Control_c2.fasta 
@PG ID:bazFormat    PN:bazformat    VN:1.3.0
@PG ID:bazwriter    PN:bazwriter    VN:4.0.0.189308

The problem still persists with Rsamtools though:

Rsamtools::scanBam('output/alignment/pacbio/ref/sx.subreads.bam')

Error in value[[3L]](cond) : 
  failed to open BamFile: SAM/BAM header missing or empty
  file: 'output/alignment/pacbio/ref/sx.subreads.bam'
ADD REPLY
0
Entering edit mode

But the file is there:

dir('output/alignment/pacbio/ref/')

[1] "sx.arrow" "sx.fasta" "sx.subreads.bam" "sx.subreads.sam" "sx2.subreads.bam"

ADD REPLY
0
Entering edit mode

And is definitely not empty:

file.size('output/alignment/pacbio/ref/sx.subreads.bam') [1] 5865760774

ADD REPLY
0
Entering edit mode

Do you need to sort and/or index the file?

ADD REPLY
0
Entering edit mode

Follow the directions on this page to convert the PacBio BAM file to fastq reads (see section on subread bam to fastq/a). Then realign (or do whatever you want to do) with the reads.

ADD REPLY
0
Entering edit mode

Hi genomax,

The BAM file I've got contains information about pulse width and inter-pulse width. If I convert this to FASTQ and then align, am I not going to lose this information? I would have liked to extract the pulse information from the BAM file to plot it for some reads.

ADD REPLY
1
Entering edit mode
7.6 years ago
biomagician ▴ 410

My facility manager gave me the answer. It looks like Rsamtools wants a @SQ line in the BAM file so I added this to the SAM file:

@SQ SN:phix LN:5386 M5:c3f78539481dde3e7fb732c446d42d93 UR:N.fa

and then reheaded the BAM.

system('samtools reheader input/test.h1 output/alignment/pacbio/ref/sx.subreads.bam > output/alignment/pacbio/ref/newSx.subreads.bam')

input/test.h1 contains the new header.

C.

ADD COMMENT
1
Entering edit mode

Was this data for phiX?

ADD REPLY
0
Entering edit mode

What's phiX? Bacteriophage?

ADD REPLY
1
Entering edit mode

It's a control genome which is spiked in illumina sequencing for quality control.

ADD REPLY
0
Entering edit mode

mmh The data I have is not for phiX. It's for C. elegans. Can I change it to "Caenorhabditis elegans"?

ADD REPLY
1
Entering edit mode

You are misidentifying the data by choosing the wrong genome. Use the actual genome name the data is from. There will be multiple SQ lines for each chromosome (with correct allied info) for C. elegans.

ADD REPLY
0
Entering edit mode

OK thanks!

ADD REPLY
0
Entering edit mode

Do you know what I can put in the M5 field and UR field? And do you know where I identify the chromosome numbers?

ADD REPLY
0
Entering edit mode

Chromosome numbers/sizes: http://www.genome.jp/kegg-bin/show_organism?org=cel

M5 is the MD5 checksum and UR can be a web link to where you are going to get the genome from or a local path for the genome file. I don't think those are required. Only SN/LN are.

Actually since the data you have is unaligned I am not sure including this information is serving a useful purpose but since you want Rsamtools to be happy ...

ADD REPLY
0
Entering edit mode

Yes. It is a bacteriophage.

ADD REPLY

Login before adding your answer.

Traffic: 1732 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6