Question

Managing large data sets

2

Entering edit mode

8.7 years ago

umn_bist ▴ 390

So storage has become an issue due to the volume of TCGA data sets - I'm easily capping at 30 Tb. I would like to revise my pipeline so that I can slim down intermediary output files (if possible).

My current pipe:

Collect TCGA data with CGHub (paired end files are .tar .gz compressed into 1 file)
Extract
cutadapt trim adapter, low quality (intermediate .fastq output)
PRINSEQ trim poly-A/T/G/C reads (intermediate .fastq output)
STAR align (intermediate .bam output)
picard add RG info (intermediate .bam output)
picard mark duplicate (intermediate .bam output)
GATK trim N CIGAR (intermediate .bam output)
MuTect call variants (intermediate .vcf output)
snpSift filter variants (intermediate .vcf output)
snpEff annotate variants (final .vcf output)

My ideal setup:

Collect TCGA data with CGHub (paired end files are .tar .gz compressed into 1 file)
Use compressed input to trim adapter, low quality, poly-A/T/G/C - output to std out
steam std out as input for alignment (preferably STAR) - output to bam file
ideally I would like to add RG and mark duplicates at the same time using picard if possible (output bam file)
GATK trim N CIGAR (intermediate .bam file)
Mutect call variants (intermediate .vcf file)
snpSiftfilter variants (intermediate .vcf file)
snpEff annotate variants (final .vcf output)

Question:

Which adapter can take compressed paired files (as 1 file) as input - and trim adapter, low quality, and poly-A/T/G/C? Is this trimmomatic or fastx?
Which splice aware aligner can take std out stream as input?

TCGA RNAseq • 2.5k views

ADD COMMENT • link updated 8.7 years ago by harold.smith.tarheel ★ 5.0k • written 8.7 years ago by umn_bist ▴ 390

0

Entering edit mode

I'm working on a script that transparently wraps a bam file so it can be simultaneously read and written, provided that:

you can read the input bam from the stdin (no index requirements)
you write output bam on the stdout

You use it like:

bam2sql.py --make ./input.bam ./output.bam.sql
bam2sql.py --out ./output.bam.sql | picard MarkDuplicates | bam2sql.py --in ./output.bam.sql
... many different jobs like the one above ...
bam2sql.py --out ./output.bam.sql > final.bam # optional

It handles everything else and read/writes without any blocking, so its not like the whole BAM is just stored in memory. It does use memory though if your reordering reads or deleting reads. Perhaps with FUSE it could even be made to do random read/write by having an attached process constantly updating the indexes, however most bioinformatic software probably reads the index to memory once then caches it, which makes random read/writes more complicated. But for now the above fits 90% of my needs.

It will be done by tomorrow. It would have been done today if there weren't multiple different ways to sort a BAM by QNAME.... grr.

ADD REPLY • link 8.7 years ago by John 13k

score 1 · Answer 1 · 2016-03-23

I would recommend BBMap tools (available here) for both.

1) BBDuk can be used for both quality and adapter trimming, and you could handle polynucleotide runs by including those in the adapter reference file. It works on compressed data (.gz), but you'd have to check with Brian Bushnell about .tar files.

2) BBMap is a splice-aware aligner, and pipes using in=stdin/out=stdout syntax.

score 0 · Answer 2 · 2016-03-23

0

Entering edit mode

8.7 years ago

harold.smith.tarheel ★ 5.0k

P.S.-For steps that cannot be piped (e.g., Picard MarkDuplicates), you can always 'rm' the intermediate file after it's been used. Inelegant, but it works...

ADD COMMENT • link 8.7 years ago by harold.smith.tarheel ★ 5.0k