What is MLBlocks?¶
Free software: MIT license
MLBlocks is a simple framework for seamlessly combining any possible set of Machine Learning tools developed in Python, whether they are custom developments or belong to third party libraries, and build Pipelines out of them that can be fitted and then used to make predictions.
This is achieved by providing a simple and intuitive annotation language that allows the user to specify how to integrate with each tool, here called primitives, in order to provide a common uniform interface to each one of them.
At a high level:
Each available primitive has been annotated using a standardized JSON file that specifies its native interface, as well as which hyperparameters can be used to tune its behavior.
A list of primitives that will be combined into a pipeline is provided by the user, optionally passing along the hyperparameters to use for each primitive.
An MLBlock instance is build for each primitive, offering a common interface for all of them.
The MLBlock instances are then combined into an MLPipeline instance, able to run them all in the right order, passing the output from each one as input to the next one.
The training data is passed to the MLPipeline.fit method, which sequentially fits each MLBlock instance following the JSON annotation specification.
The data used to make predictions is passed to the MLPipeline.predict method, which uses each MLBlock sequentially to obtain the desired predictions.
In its first iteration, in 2015, MLBlocks was designed for only multi table, multi entity temporal data. A good reference to see our design rationale at that time is Bryan Collazo’s thesis, written under the supervision of Kalyan Veeramachaneni:
Machine learning blocks. Bryan Collazo. Masters thesis, MIT EECS, 2015.
In 2018, with recent availability of a multitude of libraries and tools, we decided it was time to integrate them and expand the library to address other data types, like images, text, graph or time series, as well as introduce the usage of deep learning libraries. A second iteration of our work was then started by the hand of William Xue:
A Flexible Framework for Composing End to End Machine Learning Pipelines. William Xue. Masters thesis, MIT EECS, 2018.
Later in 2018, Carles Sala joined the project to make it grow as a reliable open-source library that would become part of a bigger software ecosystem designed to facilitate the development of robust end-to-end solutions based on Machine Learning tools. This third iteration of our work was presented in 2019 as part of the Machine Learning Bazaar:
The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development. Micah J. Smith, Carles Sala, James Max Kanter, and Kalyan Veeramachaneni. Sigmod 2020.