How To Speed Up Picard Tools, Particularly Fixmateinformation?
2
1
Entering edit mode
12.3 years ago
C Shao ▴ 140

I am trying to find mutations from whole cancer exome sequencing data, right now I am following a guide at here : http://seqanswers.com/wiki/How-to/exome_analysis

I found that some of the commands ran quite slowly, especially FixMateInformation, which took more than 3 hours. Here is the command I used (incorporated in python)

os.system("java -Xmx4g -jar FixMateInformation  INPUT=%s OUTPUT=%s SO=coordinate CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT"  % (realigned, fixedRealign ))

Could you give me some suggestions for speeding up? How do you run you picard or other java scripts?

Another question is I want to find somatic mutations in cancer exome, while the guide I used is for SNP calling.

In my vies, the way to find SNP and somatic mutation are quite similar in principle, but I am more interesting in specific tools for mutation calling. I found several high-rank papers employing mutect, which is an nonpublic tools by now.
what do you use for cancer mutation calling?

Thanks!

picard exome mutation calling • 7.1k views
ADD COMMENT
0
Entering edit mode

I maybe wrong, but I think its not possible for this command. Where ever possible, picard-tools has a "USE_THREADING=TRUE/FALSE" parameter that improves runtime by 20% or 30% etc... This one doesn't seem to have one.

ADD REPLY
0
Entering edit mode

only MergeSamFiles have "USE_THREADING=TRUE/FALSE", so sad

ADD REPLY
0
Entering edit mode

I believe that you can speed this process if you do not specify the sort order. In my experience, telling Picard tools to sort your output file significantly increases processing time. If you still need to sort, just sort before or after using samtools or Picard SortSam.

ADD REPLY
0
Entering edit mode

Thanks for your tips, I will try to sort it independently.

ADD REPLY
3
Entering edit mode
12.0 years ago

be careful with that exome sequencing guide you are following, because although it states that has been modified recently it looks like the same pipeline I saw back in 2010 (at least). in particular, FixMateInformation was a step that had to be performed when GATK's IndelRealigner was not doing its job properly (probably at the time the guide was written), but that's currently an unnecessary step, as you may see here.

as I said, I have also gone through that guide when I started on NGS; but you have to realize that lots of things have changed from when it was written, mainly GATK has moved from v1.x to v2.x, and this represent major improvements. some walkers and options renaming or even being deprecated (CountCovariates and TableRecalibration for instance have been substituted by BaseRecalibrator and PrintReads), plus new threading methods have been recently implemented. Picard tools though, as mentioned, does only allow threading on its MergeSamFiles, but not when marking duplicates.

in terms of percentage, marking duplicates represents less than 10% of my processing time, and the rest of the job is GATK centered, so I would focus in this new GATK threading technology (roughly described here, although I eage to see a well documented description of that NanoScheduler) if I would like to improve my timings.also, I would highly recommend you to check the updated best practices for variant detection, plus the current recommendations for recalibrating variants.

ADD COMMENT
1
Entering edit mode
12.3 years ago

If you have paired tumor / normal samples from the same individual, I would highly recommend the following tools for somatic variant discovery, in order of personal preference:

  1. Strelka
  2. VarScan2
  3. SomaticSniper

Strelka is relatively new, but extremely powerful with paired samples. Take a look at the paper (Saunders CT, Wong WSW, Swamy S, Becq J, Murray LJ, and Cheetham RK. 2012. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28: 1811–1817.)

Somatic variant discovery is an emerging methodology, and you would be wise to apply a detection method different than that for germline variant detection.

ADD COMMENT
0
Entering edit mode

I will get the paired sample data soon, but right now I could only with these unpaired data. Maybe I will try different methods and get the overlapping one. Thanks for your suggestions.

ADD REPLY

Login before adding your answer.

Traffic: 1301 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6