Purely based on required header information and data/stats pulled from the BAM file, is there any way to guess the sequencing platform (454, Illumina, Ion Torrent, etc.) used to generate data for a BAM file? Does a tool already exist that does this?
So far all I can find is average read lengths and number produced, and error rate, which vary from platform to platform. Also I thought encoding quality may be useful in this guess too. I found this How To Determine The Version Used To Generate Solexa/Illumina Fastq Files? to be useful, though this too is just a guess of what the encoding could be.
Any ideas of other stats that may be useful would be extremely appreciated.
Thanks!
To a certain extent this is often possible. Read name formatting is often machine dependent.
Keep in mind SAM and BAM files may include reads from different runs, samples, technologies, etc.
This should be interesting - finding discrete patterns (or sets of patterns) to predict data sources.
The SAM spec has tags for the read group (RG) field in the header that could help: platform/technology (PL) and platform model (PM). These are usually filled in by the aligner or user that made the file, so you're completely at their mercy.
As suggested, read naming schemes are usually machine-dependent. This depends on having the raw data, though. In some cases the reads may be relabeled with uninformative names, and then you're out of luck. For example, the SRA does this, and I've seen published datasets that have been aggressively filtered with renamed reads.
Interesting question - it is very likely that it would be possible to detect the platform from the data itself - for example adapter contamination (see if a few of your reads end with GATCGGAA the Illumina adapter), the error distribution, read lengths and orientations (the 454 produces variable read lengths) and many other information combined could help identify the platform. But there is probably no tool to do this - since it just not what scientists use the data for.