So storage has become an issue due to the volume of TCGA data sets - I'm easily capping at 30 Tb. I would like to revise my pipeline so that I can slim down intermediary output files (if possible).
My current pipe:
- Collect TCGA data with CGHub (paired end files are .tar .gz compressed into 1 file)
- Extract
- cutadapt trim adapter, low quality (intermediate .fastq output)
- PRINSEQ trim poly-A/T/G/C reads (intermediate .fastq output)
- STAR align (intermediate .bam output)
- picard add RG info (intermediate .bam output)
- picard mark duplicate (intermediate .bam output)
- GATK trim N CIGAR (intermediate .bam output)
- MuTect call variants (intermediate .vcf output)
- snpSift filter variants (intermediate .vcf output)
- snpEff annotate variants (final .vcf output)
My ideal setup:
- Collect TCGA data with CGHub (paired end files are .tar .gz compressed into 1 file)
- Use compressed input to trim adapter, low quality, poly-A/T/G/C - output to std out
- steam std out as input for alignment (preferably STAR) - output to bam file
- ideally I would like to add RG and mark duplicates at the same time using picard if possible (output bam file)
- GATK trim N CIGAR (intermediate .bam file)
- Mutect call variants (intermediate .vcf file)
- snpSiftfilter variants (intermediate .vcf file)
- snpEff annotate variants (final .vcf output)
Question:
- Which adapter can take compressed paired files (as 1 file) as input - and trim adapter, low quality, and poly-A/T/G/C? Is this trimmomatic or fastx?
- Which splice aware aligner can take std out stream as input?
I'm working on a script that transparently wraps a bam file so it can be simultaneously read and written, provided that:
You use it like:
It handles everything else and read/writes without any blocking, so its not like the whole BAM is just stored in memory. It does use memory though if your reordering reads or deleting reads. Perhaps with FUSE it could even be made to do random read/write by having an attached process constantly updating the indexes, however most bioinformatic software probably reads the index to memory once then caches it, which makes random read/writes more complicated. But for now the above fits 90% of my needs.
It will be done by tomorrow. It would have been done today if there weren't multiple different ways to sort a BAM by QNAME.... grr.