I'm wondering if it's possible to stream the output of samtools mpileup into R for processing. If someone has done this, can you share the bits of code for accepting the stdin and writing the stdout?
Likewise, if I'm imaging that I can pipe this into R please let me know.
The motivation behind this is to process the data to look for exactly what I'm after without creating excessively large intermediate mpileup files.
Yes, possible. But before we go into details, what analysis do you want to do. Almost certainly tjere is a more efficient command line tool that does it faster and with less memory.
Calculate the error rate (number of bases, indels, starts or stops) over a user-specified window...sure, there are tools I could try to make useful here but I'd rather just take the direct output from samtools and write my own function for this.
Should my take-away be that R can't handle streams? Or something else? Any scripting language that might work? (I'm not going to learn c++ for this...)
You need a variant caller to reliably call indels and things like that, that has tested heuristics and cutoffs to distinguish signal from noisy calls. I would run this through a variant caller and then process the VCF file rather than the pileup itself.
I actually found this to be bad (outdated?) advice. I had no problem to pipe the output of samtools mpileup into an Rscript (R > 4.1) that loaded packages and functions, had intermediate file structures, and streamed to stdout. Copying a skeleton script below which can be called with samtools mpileup -l my.bed -f my.fa my.bam | Rscript my.R > R.stdout.tsv
Again, my goal here was to have a work-around for creating large intermediate files.
This is straightforward. Can perform functions on multiple lines at a time if they are loaded as such. Skeleton Rscript below, which can be called as a pipe with samtools mpileup -l my.bed -f my.fa my.bam | Rscript my.R > myR.stdout
Yes, possible. But before we go into details, what analysis do you want to do. Almost certainly tjere is a more efficient command line tool that does it faster and with less memory.
Calculate the error rate (number of bases, indels, starts or stops) over a user-specified window...sure, there are tools I could try to make useful here but I'd rather just take the direct output from samtools and write my own function for this.
(cough) (cough) (cough) don't use R for this (cough) (cough ) (cough)
haha, you should have that cough checked out ;)
Should my take-away be that R can't handle streams? Or something else? Any scripting language that might work? (I'm not going to learn c++ for this...)
are you learning a new language ?, great
Is Rust easier to learn than c++?
You need a variant caller to reliably call indels and things like that, that has tested heuristics and cutoffs to distinguish signal from noisy calls. I would run this through a variant caller and then process the VCF file rather than the pileup itself.
I get that I could process a vcf file, but my question seems like something so simple and achievable ...guess not.
I actually found this to be
bad(outdated?) advice. I had no problem to pipe the output of samtools mpileup into an Rscript (R > 4.1) that loaded packages and functions, had intermediate file structures, and streamed to stdout. Copying a skeleton script below which can be called withsamtools mpileup -l my.bed -f my.fa my.bam | Rscript my.R > R.stdout.tsv
Again, my goal here was to have a work-around for creating large intermediate files.