Dear all,
this might be (is!) very much a n00b question, and I hope you forgive me on this one ;)
To set the context, I'm a biologist, specialised in (terrestrial) ecology with main interests in dispersal and population dynamics. However, I recently started at a microbiology lab to do research on community composition and turnover. Now, I had some microbiology courses during my training, but that's about it. No experience whatsoever with sequencing and analysis hereof.
I am the first one of this lab who will use high throughput techniques (pyrosequencing), but there is already data present (obtained through boarding out). The hyper variable V1-V3 regions of 16S SSU were sequenced. AmpliconNoise was used to clean up the raw data, after which the sequences were run against the RDP-database.
So right now I'm processing this particular dataset. I learnt some Perl to perform some quality checks on the sequences (orientation, length, ...). I removed primers and tags and ran them against the RDP to compare 'my' results with the ones already in the database, not only for sake of double checking, also to get acquainted with the matter ...
To get to the point, how do you analyse this kind and - not in the least - amount of data, just to make sure I'm doing everything the right way (under the presumption everything up to this point was processed correctly)?
People here use phylogenetic trees (which I still need to master) to describe community structure - as is standard practice I assume -, but this seems impractical for this kind of data. So I'm creating pivot tables to compare the presence (and abundances) between samples. I compare the 'raw' (yet AmpliconNoised) data with selections based on sequence length (200, 250, 300) and similarity (more than .95, .97 or .99), separate for both the forward and reverse reads. Further I total the forward and reverse reads (to look for the 'total diversity') and take the highest abundance as the 'correct' one (notice the quotation marks ...).
Does this seem as a correct way to do this? And how do you represent the results (I mean, plots, tables, ...)? At this stage, we really only want to look at diversity (what is present). In a later stage, we want to link the community composition to environmental parameters and find indicator species. Now, one (personal) problem I have with all this is the lack of replicates ... At least with the data I currently have to work with. For future analyses this will be dealt with.
Further, some other questions
Concerning the possible dubious nature of pyrosequenced reads, do you think AmpliconNoise is (good) enough to ensure that the reads are biologically relevant? Where to set limits in cleaning up your data? I feel like using only those reads of 300 bp or higher for robustness' sake, but perhaps this way you miss a lot of (relevant) information?
Do you think it is possible to infer abundances from pyrosequence reads, taking into account the possibility of multiple operons. Amend et al. (2010) put forward the possibility of semi-quantitativeness, implying that abundances are only comparable within a species/OTU between samples.
How do you process raw high-throughput data for community analysis?
Other suggestions? (articles, books, experiences, ...)
Many thanks! Kind regards.