Question

Is it possible to guess sequencing platform used based on a FASTQ/BAM file?

2

Entering edit mode

9.4 years ago

Andrew ▴ 60

Purely based on required header information and data/stats pulled from the BAM file, is there any way to guess the sequencing platform (454, Illumina, Ion Torrent, etc.) used to generate data for a BAM file? Does a tool already exist that does this?

So far all I can find is average read lengths and number produced, and error rate, which vary from platform to platform. Also I thought encoding quality may be useful in this guess too. I found this How To Determine The Version Used To Generate Solexa/Illumina Fastq Files? to be useful, though this too is just a guess of what the encoding could be.

Any ideas of other stats that may be useful would be extremely appreciated.

Thanks!

BAM FASTQ • 3.8k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.4 years ago by Andrew ▴ 60

2

Entering edit mode

To a certain extent this is often possible. Read name formatting is often machine dependent.

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

1

Entering edit mode

Keep in mind SAM and BAM files may include reads from different runs, samples, technologies, etc.

ADD REPLY • link 9.4 years ago by h.mon 35k

1

Entering edit mode

This should be interesting - finding discrete patterns (or sets of patterns) to predict data sources.

ADD REPLY • link 9.4 years ago by Ram 44k

1

Entering edit mode

The SAM spec has tags for the read group (RG) field in the header that could help: platform/technology (PL) and platform model (PM). These are usually filled in by the aligner or user that made the file, so you're completely at their mercy.

As suggested, read naming schemes are usually machine-dependent. This depends on having the raw data, though. In some cases the reads may be relabeled with uninformative names, and then you're out of luck. For example, the SRA does this, and I've seen published datasets that have been aggressively filtered with renamed reads.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by matted 7.8k

1

Entering edit mode

Interesting question - it is very likely that it would be possible to detect the platform from the data itself - for example adapter contamination (see if a few of your reads end with GATCGGAA the Illumina adapter), the error distribution, read lengths and orientations (the 454 produces variable read lengths) and many other information combined could help identify the platform. But there is probably no tool to do this - since it just not what scientists use the data for.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by Istvan Albert 102k