I am looking for small, short and lightweight bioinformatics pipelines to use for demonstration purposes which can be easily run without downloading large amounts of reference data sets, for use in program development. Ideally, something that I can bundle in with a large program as a basic placeholder to show that "generic bioinformatic pipeline is working". This would be something that meets criteria such as:
- required input data is small, less than a Megabyte (MB) in size
- required software can be easily bundled with Docker or conda/pip (preferably the latter)
- total execution time does not exceed more than a minute or so on a lightweight machine (e.g. a cheap laptop)
- the pipeline output data would ideally be in a flat text format of some sort, so that it can be easily parsed by unit testing tools to verify results
For a time I had thought I found a good candidate in .vcf file annotation with VEP, since I can easily bundle some tiny demo .vcf files in a git repo along with a VEP install script and Docker container, but unfortunately I found the MySQL ports required for VEP to query its online reference databases are blocked by my employer.
Any other suggestions for this?
We have a small demonstration for polygenic risk score analysis. We simulated data using the 1000 genome and the resulting file size, after compression is around 100M. Maybe you can use that?
Website is here