How do I filter a bam on length with pybedtools?
2
1
Entering edit mode
8.5 years ago
endrebak ▴ 970

Here is my attempt:

def remove_too_long_reads(bam):

    for read in bam:
        if read.end - read.start < 60:
            yield read

bam = BedTool(input.bam)
bam = BedTool(remove_too_long_reads(bam))

But when I try to use the bam file afterwards like so:

bam.intersect(...

I get an error:

<class 'brokenpipeerror'="">: Broken pipe The command was:

    bedtools intersect -b data/gencode_annotation/tss.bed -a stdin
  

Things to check: Error in job intersect_data_tss while creating output file data/tss/bam/Exp2_9h_PolII.bam. RuleException: KeyError in line 31 of /local/home/endrebak/code/programmable_epigenetics/rules/tss/intersect_bam_gencode.rules: 32 File "/local/home/endrebak/code/programmable_epigenetics/rules/tss/intersect_bam_gencode.rules", line 31, in __rule_intersect_data_tss File "/local/home/endrebak/anaconda3/lib/python3.5/site-packages/pybedtools/bedtool.py", line 773, in decorated File "/local/home/endrebak/anaconda3/lib/python3.5/site-packages/pybedtools/bedtool.py", line 336, in wrapped File "/local/home/endrebak/anaconda3/lib/python3.5/site-packages/pybedtools/helpers.py", line 373, in call_bedtools

pybedtools • 3.9k views
ADD COMMENT
2
Entering edit mode

There seems to be two different jobs you're trying to do here. The first is ditching reads that are longer than a certain length. Presumably you just want to look at small RNAs? But then reads of 59 and smaller (<60 throws out 60) is an odd size for RNA... if you gave us more context here someone might be able to tell you a better way to reach your goal, like maybe trimming reads or marking reads with a custom tag is better? Regardless, filtering is one issue, and has it's own complications like what to do with the mate of a read whose length is too long if you're doing paired-end sequencing. I suggest you first make the BAM file using your filter method, and then validate it with Picard's ValidataBam. It will tell you if theres anything odd with the filtering, which might be the cause of the error. Run it on the original BAM before filtering too, just to be on the safe side. My money is on either mate-pair issues, or truncated file because BedTool probably has a special method for finalizing a BAM file (I have no idea, i've never used pybedtools, but i know BAM files need an EOF marker).

Once you are able to be sure it's not the BAM that's at fault, then try the intersection.

Whilst endrebak's method will work, it's kind of cheating - and worse, its an inefficient cheat :) Sometimes cheats are necessary, like when you have to chain together multiple samtools processes to get it to count reads in truncated BAM files without giving an error - but in this occasion we can do better than leaving the underlying problem for bamToBed to figure out. Furthermore, if you don't attack the real source of your problem, you may miss out on an opportunity to learn how to filter BAM files properly.

But let's make no mistake, "Broken Pipe Error" is the most undescriptive error I have ever seen. First of all - which pipe broke?! The stdin (suggesting you have a truncated BAM file) or the stdout (which might indicate no space left on device, etc), or more likely the stdin/out of some subprocess... sometimes when you put wrappers around wrappers around wrappers, you end up with a lot of trash.

ADD REPLY
1
Entering edit mode

Thanks! The data are not PE though :)

ADD REPLY
1
Entering edit mode

Any suggestions on how to make the broken pipe error more clear? Currently it's printing out the exact command that gave the error, which allows troubleshooting like this. For example, do you know of any ways to check whether it's stdin or stdout that was the cause of the error? As far as I know this is the same error message returned by unix in similar situations.

ADD REPLY
0
Entering edit mode

Ah, now i feel childish for saying wrappers are trash :( I am sorry Ryan, particularly because I didn't offer a solution with my criticism either. That was immature of me, and I'm really sorry.

I can't backtrack on saying over-abstraction is the root of the issue though. At some point, something, either your wrapper or bedtools, tries to read/write to a file descriptor (stdin/stdout/stderr) and couldn't because it's now closed. Unfortunately whether it was a read operation or a write operation didn't bubble up through the stack trace, and we now can't troubleshoot effectively. We know something went wrong during the writing of the Exp2_9h_PolII.bam file, although it is unclear what process was supposed to be writing it as the command in the error doesn't include that path, so we don't know if that was bedtools or pybedtools or the OPs script. But really, whatever closed the file descriptor prematurely should have given a detailed error on the stderr, and either it didn't, or that information was lost during one of the abstractions.

Again, I never intended to be rude or unfair. I just get very frustrated as an onlooker when people can't do simple stuff without running into some sort of issue -- but that's my problem at the end of the day, and I should have been more respectful regarding your work.

ADD REPLY
1
Entering edit mode

No worries, no need to apologize. I totally agree that it's frustrating. As you can see in https://github.com/daler/pybedtools/issues/49 and issues like https://github.com/daler/pybedtools/issues/77 I recognize it's a problem. I'd love to find a good solution. In this case I'm pretty sure a call to saveas() will fix this particular example and I will add a note to the error message suggesting that.

While there's abstraction I personally don't think it's overboard since it opens up a lot of sophisticated analysis otherwise impossible on the command line (for example, the OP's generator function can get arbitrarily complex). It does come at the cost of errors like this, and it would be nice to have bedtools, pybedtools, and the shell communicate to each other about closing file descriptors. Luckily, in the meantime we have biostars to help mitigate the cost of deciphering errors!

ADD REPLY
1
Entering edit mode

Try using .saveas() first on the BAM. The generator may be causing issues like https://github.com/daler/pybedtools/issues/49, which I still don't have a good solution for.

However, saving the intermediate file is inefficient use of I/O. As @John says, knowing what your end goal is will help know what a good strategy will be.

ADD REPLY
0
Entering edit mode

Checking out that issue on github it sounds like you might need the select module to let python read/write without blocking. It's kind of a pain to use select because the docs are so bad, but basically subprocess is not going to be much help to you on that one. Even when you mess with it's buffers, the system will buffer regardless, and reading will block. The only way is to use threads (ew), Twisted (lots of work), or let the OS tell you when it's got something for you to read/write via select (best of a bad situation). Your schematics don't show the stderr being passed on - perhaps that's why the OP doesn't see an error from bedtools?

ADD REPLY
1
Entering edit mode

Thanks, I will look into select. I had tried threads and twisted; they made interactive work in IPython a nightmare. I convinced Aaron and Neil to add -nobuf and -iobuf to bedtools, which helped in some cases but not all, likely from OS buffering as you mention. I was also considering switching to sarge (http://sarge.readthedocs.io) for the calls to bedtools but have not had the time for such an overhaul and I'm not sure it would solve the problem anyway.

stderr is going to subprocess.PIPE, so it is being passed on but just isn't indicated in the schematic.

Thanks for the info . . . to be continued over on https://github.com/daler/pybedtools/issues/49 . . .

ADD REPLY
0
Entering edit mode
bamToBed -i file.bam | head -1000 | awk '{if ($3 - $2 < 60) {print}}' | bedToBam ...
ADD REPLY
6
Entering edit mode
8.5 years ago
Ryan Dale 5.0k

In order to maintain compatibility with VCF/BED/GFF/GTF functionality, pybedtools converts BAM/SAM reads into pybedtools.Interval objects. When just filtering on length as you're doing here, that's unnecessary overhead. Here's a more efficient way to do it with pysam, which also has the benefit of providing access to all the read details if you need to do more sophisticated filtering. You can then use the filtered bam for downstream operations (e.g., histograms of read counts across intervals):

ADD COMMENT
0
Entering edit mode

Another library to learn. Thanks!

ADD REPLY
0
Entering edit mode

True, but FWIW I'd consider it totally worth the time investment to learn pysam if you're doing anything with BAM files

ADD REPLY
0
Entering edit mode

Thanks for this. When I wrote the code in my answer it occurred to me that a pysam solution would be possible, but decided to got with pybedtools because it tidies up the tmp files. Also, everything else in the script was already using pybedtools. But thinking it over, I might replace my code with this pysam solution.

ADD REPLY
3
Entering edit mode
8.5 years ago
A. Domingues ★ 2.7k

I have this function in one my scripts - maybe it will help:


from pybedtools import BedTool

def filterReadsByLength(inbam, minlength, maxlength):
    '''
    Takes BedTool which is a bam, and selects intervals that are within the defined lengths.
    Input: bam file and min/max lengths
    Output: bedTool
    '''
    print 'Filtering reads by length'
    # convert bam to bed
    intervals = inbam.bam_to_bed()
    filt = intervals.filter(lambda x: len(x) > minlength and len(x) < maxlength).saveas()
    return filt

inbam = BedTool(path_to_bam)
filtered = filterReadsByLength(inbam, minlength, maxlength)

ADD COMMENT
0
Entering edit mode

The .saveas() call is the trick here, which saves to an intermediate tempfile.

ADD REPLY
0
Entering edit mode

A trick you taught me at some point during some GitHub exchanges :)

ADD REPLY

Login before adding your answer.

Traffic: 2121 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6