This is the original version of the question. Please jump to the edit below as it makes the question clearer.
I don't have a lot of experience with bacterial NGS but have recently started working on a project that explores the differences (e.g. SNPs, indels) between bacterial strains. My first idea was to take a reference and go for some sort of GATK-based pipeline. I was thinking of first aligning the Illumina reads for strain A against a reference, find the indels and SNPs, than do the same for strain B and than compare the results. Then I realised that I might be beating around the bush here. Wouldn't it make sense to directly compare the raw reads of strain A to those of strain B and find out about the differences without aligning to an intermediate reference?
The closest equivalent setting I can think of would be in cancer genomics, i.e. comparing tumours to healthy cells. Would any of those cancer genomics pipeline be applicable?
EDIT
Judging from the responses I haven't expressed my case very well. Here is my project:
- I have a set of unknown strains of species A taken from, say, the oral cavities of a group of people. Each individual gives rise to a single sample. The bacterial species is cultured and then sequenced on MiSeq
- Similarly to above I have a set of unknown strains of the same species but taken from another body compartment - say the gut. This may (or may not) be the same people as above.
The task is to find whether there is anything that systematically differs between the first and the second set.
I was first thinking of picking a reference and than doing alignment for each sample. But this poses the problem of picking the right reference. As we have dozens of samples this may not even be the same reference. Isn't there a way to compare the samples as sets so that you can say something along the lines:
- the first set lacks gene X
or
- the first set has a particular mutation in a gene Y
In order to be able to do make such statements do I really need to perform alignment on each sample in isolation?
Thanks, Dan. I've edited my question to make it more precise. Can you, perhaps, suggest a pipeline that would help me address this task?
Not beyond what I've already suggested really. I'm not an expert and have only dabbled in this area.