Variant Analysis on MiSeq Data
2
0
Entering edit mode
9.7 years ago
gkuffel22 ▴ 100

Hi everyone,

I am trying to figure out the most efficient method to perform variant analysis on a large dataset. I have 200 samples and forward and reverse reads for a total of 400 fastq files. I was able to load all of these files into Galaxy and create a workflow that looks like this:

fastQ Groomer -> Trim -> BWA-mem -> flagstat -> Generate pileup -> Filter pileup

I have now realized that there is no way to loop through or automate my workflow on my fastq files. Is there a better way to do this other than running this workflow 200 times manually? Can create a script through the command line and use my fastq files as the input? If anyone has any suggestions or is aware of software to handle this type of job I would really appreciate your help.

Variant-Analysis SNP • 2.7k views
ADD COMMENT
0
Entering edit mode

Do you have to use galaxy? If so, you might want to post on the galaxy-specific version of this site. If not, you can certainly just create a script to do this for you (that's what most of us do).

ADD REPLY
0
Entering edit mode

I am completely open writing a script and leaving Galaxy behind, I just don't know where to start. I have some programming experience (Java, python) any suggestions would be helpful.

ADD REPLY
0
Entering edit mode

Popular options would be to use shell scripts or a Makefile. You could also use python, but I imagine that'd prove a bit more work. There's also ngsxml, though I have to confess not being very familiar with it (though the author, Pierre, is a regular here and writes great stuff, so I expect it's good).

ADD REPLY
0
Entering edit mode

Do you have Linux system? I am bioinformatician and we have MiSeq and HiSeq - I wrote lot of shell script designed for Illumina reads - Filtration - Alignment - Variant calling. Do you want o share it?

ADD REPLY
1
Entering edit mode
9.7 years ago
Yahan ▴ 400

Assuming that you are working in a grid environment, what we use is bpipe, an excellent tool to develop pipelines and workflows. You could use it to perform the different steps needed to arrive at your snp calling. One of the advantages of bpipe is that it manages the parallelisation of the different steps for you. The documentation has an example of a snp calling workflow.

Then, it also depends on what snp caller you want to use. If you would use samtools or GATK using read mappings in bam files, then you will need a full pipeline doing read mapping, sorting, duplicate removal, realignment, indexing etc.

However, discoSnp, is an interesting alternative that does snp calling without a reference. This would limit your needs to quality trimming after which you can do the calling in one command line including all your samples. Not sure how it performs on 200 samples though. It does not support paired end reads so you would also have to merge your paired data into one fastq per sample, but maybe that's a trade off you are willing to accept considering the simplification of the task it implies.

ADD COMMENT
0
Entering edit mode
9.7 years ago
Zaag ▴ 870

I only use galaxy for small jobs, but with Workflow Control you can select Input Dataset Collection, maybe that helps.

ADD COMMENT

Login before adding your answer.

Traffic: 1924 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6