Here is my attempt:
def remove_too_long_reads(bam):
for read in bam:
if read.end - read.start < 60:
yield read
bam = BedTool(input.bam)
bam = BedTool(remove_too_long_reads(bam))
But when I try to use the bam file afterwards like so:
bam.intersect(...
I get an error:
<class 'brokenpipeerror'="">: Broken pipe The command was:
bedtools intersect -b data/gencode_annotation/tss.bed -a stdin
Things to check: Error in job intersect_data_tss while creating output file data/tss/bam/Exp2_9h_PolII.bam. RuleException: KeyError in line 31 of /local/home/endrebak/code/programmable_epigenetics/rules/tss/intersect_bam_gencode.rules: 32 File "/local/home/endrebak/code/programmable_epigenetics/rules/tss/intersect_bam_gencode.rules", line 31, in __rule_intersect_data_tss File "/local/home/endrebak/anaconda3/lib/python3.5/site-packages/pybedtools/bedtool.py", line 773, in decorated File "/local/home/endrebak/anaconda3/lib/python3.5/site-packages/pybedtools/bedtool.py", line 336, in wrapped File "/local/home/endrebak/anaconda3/lib/python3.5/site-packages/pybedtools/helpers.py", line 373, in call_bedtools
There seems to be two different jobs you're trying to do here. The first is ditching reads that are longer than a certain length. Presumably you just want to look at small RNAs? But then reads of 59 and smaller (<60 throws out 60) is an odd size for RNA... if you gave us more context here someone might be able to tell you a better way to reach your goal, like maybe trimming reads or marking reads with a custom tag is better? Regardless, filtering is one issue, and has it's own complications like what to do with the mate of a read whose length is too long if you're doing paired-end sequencing. I suggest you first make the BAM file using your filter method, and then validate it with Picard's ValidataBam. It will tell you if theres anything odd with the filtering, which might be the cause of the error. Run it on the original BAM before filtering too, just to be on the safe side. My money is on either mate-pair issues, or truncated file because BedTool probably has a special method for finalizing a BAM file (I have no idea, i've never used pybedtools, but i know BAM files need an EOF marker).
Once you are able to be sure it's not the BAM that's at fault, then try the intersection.
Whilst endrebak's method will work, it's kind of cheating - and worse, its an inefficient cheat :) Sometimes cheats are necessary, like when you have to chain together multiple samtools processes to get it to count reads in truncated BAM files without giving an error - but in this occasion we can do better than leaving the underlying problem for bamToBed to figure out. Furthermore, if you don't attack the real source of your problem, you may miss out on an opportunity to learn how to filter BAM files properly.
But let's make no mistake, "Broken Pipe Error" is the most undescriptive error I have ever seen. First of all - which pipe broke?! The stdin (suggesting you have a truncated BAM file) or the stdout (which might indicate no space left on device, etc), or more likely the stdin/out of some subprocess... sometimes when you put wrappers around wrappers around wrappers, you end up with a lot of trash.
Thanks! The data are not PE though :)
Any suggestions on how to make the broken pipe error more clear? Currently it's printing out the exact command that gave the error, which allows troubleshooting like this. For example, do you know of any ways to check whether it's stdin or stdout that was the cause of the error? As far as I know this is the same error message returned by unix in similar situations.
Ah, now i feel childish for saying wrappers are trash :( I am sorry Ryan, particularly because I didn't offer a solution with my criticism either. That was immature of me, and I'm really sorry.
I can't backtrack on saying over-abstraction is the root of the issue though. At some point, something, either your wrapper or bedtools, tries to read/write to a file descriptor (stdin/stdout/stderr) and couldn't because it's now closed. Unfortunately whether it was a read operation or a write operation didn't bubble up through the stack trace, and we now can't troubleshoot effectively. We know something went wrong during the writing of the Exp2_9h_PolII.bam file, although it is unclear what process was supposed to be writing it as the command in the error doesn't include that path, so we don't know if that was bedtools or pybedtools or the OPs script. But really, whatever closed the file descriptor prematurely should have given a detailed error on the stderr, and either it didn't, or that information was lost during one of the abstractions.
Again, I never intended to be rude or unfair. I just get very frustrated as an onlooker when people can't do simple stuff without running into some sort of issue -- but that's my problem at the end of the day, and I should have been more respectful regarding your work.
No worries, no need to apologize. I totally agree that it's frustrating. As you can see in https://github.com/daler/pybedtools/issues/49 and issues like https://github.com/daler/pybedtools/issues/77 I recognize it's a problem. I'd love to find a good solution. In this case I'm pretty sure a call to
saveas()
will fix this particular example and I will add a note to the error message suggesting that.While there's abstraction I personally don't think it's overboard since it opens up a lot of sophisticated analysis otherwise impossible on the command line (for example, the OP's generator function can get arbitrarily complex). It does come at the cost of errors like this, and it would be nice to have bedtools, pybedtools, and the shell communicate to each other about closing file descriptors. Luckily, in the meantime we have biostars to help mitigate the cost of deciphering errors!
Try using
.saveas()
first on the BAM. The generator may be causing issues like https://github.com/daler/pybedtools/issues/49, which I still don't have a good solution for.However, saving the intermediate file is inefficient use of I/O. As @John says, knowing what your end goal is will help know what a good strategy will be.
Checking out that issue on github it sounds like you might need the select module to let python read/write without blocking. It's kind of a pain to use select because the docs are so bad, but basically subprocess is not going to be much help to you on that one. Even when you mess with it's buffers, the system will buffer regardless, and reading will block. The only way is to use threads (ew), Twisted (lots of work), or let the OS tell you when it's got something for you to read/write via select (best of a bad situation). Your schematics don't show the stderr being passed on - perhaps that's why the OP doesn't see an error from bedtools?
Thanks, I will look into select. I had tried threads and twisted; they made interactive work in IPython a nightmare. I convinced Aaron and Neil to add -nobuf and -iobuf to bedtools, which helped in some cases but not all, likely from OS buffering as you mention. I was also considering switching to sarge (http://sarge.readthedocs.io) for the calls to bedtools but have not had the time for such an overhaul and I'm not sure it would solve the problem anyway.
stderr is going to subprocess.PIPE, so it is being passed on but just isn't indicated in the schematic.
Thanks for the info . . . to be continued over on https://github.com/daler/pybedtools/issues/49 . . .