As I continue to add steps to my SNP/Indel discovery workflow, the latest recommendation is local realignment around Indels using GATK following the initial alignment step. I have just commenced the step which generates the target intervals for realignment (RealignerTargetCreator) and it looks like it will take an hour to complete, with the realignment still required after that. My test data set is a single sample of approx 5 million paired end 100bp reads.
For an upcoming project, my plan is to run 150 similarly sized samples. Therefore the addition of such time-consuming steps will have a major impact on timelines. Can anyone with experience in this area comment on the time required for Indel realignment vs the benefits received? Is it worth it?
I'll give SRMA a go. The GATK RealignerTargetCreator didn't generate a file in the end-up. Not sure what the problem is...
Also - I do have access to a cluster. I was hoping to avoid home-brew parallelization but it's looking increasingly necessary!
Why is it home brew :) I might misunderstand you, but running say 10 separate jobs in parallel seems efficient to me.
I guess I just mean that the program doesn't do it all automatically for me. Maybe I'm just lazy :)