atm package¶

Subpackages¶

atm.api package
- Submodules
  - atm.api.utils module
- Module contents

Submodules¶

Module contents¶

Auto Tune Models A multi-user, multi-data AutoML framework.

Classes

`ATM`([dialect, database, username, password, …])
`Model`(method, params, judgment_metric, …)	This class contains everything needed to run an end-to-end ATM classifier pipeline.

class atm.ATM(dialect='sqlite', database='atm.db', username=None, password=None, host=None, port=None, query=None, access_key=None, secret_key=None, s3_bucket=None, s3_folder=None, models_dir='models', metrics_dir='metrics', verbose_metrics=False)[source]¶

Bases: object

Methods

`add_datarun`(dataset_id[, budget, …])	Register one or more Dataruns to the Database.
`add_dataset`(train_path[, test_path, name, …])	Add a new dataset to the Database.
`load_model`(classifier_id)	Load a Model from the Database.
`run`(train_path[, test_path, name, …])	Create a Dataset and a Datarun and then work on it.
`work`([datarun_ids, save_files, …])	Get unfinished Dataruns from the database and work on them.

add_datarun(dataset_id, budget=100, budget_type='classifier', gridding=0, k_window=3, metric='f1', methods=['logreg', 'dt', 'knn'], r_minimum=2, run_per_partition=False, score_target='cv', priority=1, selector='uniform', tuner='uniform', deadline=None)[source]¶

Register one or more Dataruns to the Database.

The methods hyperparameters will be analyzed and Hyperpartitions generated from them. If run_per_partition is True, one Datarun will be created for each Hyperpartition. Otherwise, a single one will be created for all of them.

Parameters

dataset_id (int) – Id of the Dataset which this Datarun will belong to.
budget (int) – Budget amount. Optional. Defaults to 100.
budget_type (str) – Budget Type. Can be ‘classifier’ or ‘walltime’. Optional. Defaults to 'classifier'.
gridding (int) – gridding setting for the Tuner. Optional. Defaults to 0.
k_window (int) – k setting for the Selector. Optional. Defaults to 3.
metric (str) – Metric to use for the tuning and selection. Optional. Defaults to 'f1'.
methods (list) – List of methods to try. Optional. Defaults to ['logreg', 'dt', 'knn'].
r_minimum (int) – r_minimum setting for the Tuner. Optional. Defaults to 2.
run_per_partition (bool) – whether to create a separated Datarun for each Hyperpartition or not. Optional. Defaults to False.
score_target (str) – Which score to use for the tuning and selection process. It can be 'cv' or 'test'. Optional. Defaults to 'cv'.
priority (int) – Priority of this Datarun. The higher the better. Optional. Defaults to 1.
selector (str) – Type of selector to use. Optional. Defaults to 'uniform'.
tuner (str) – Type of tuner to use. Optional. Defaults to 'uniform'.
deadline (str) – Time deadline. It must be a string representing a datetime in the format '%Y-%m-%d %H:%M'. If given, budget_type will be set to 'walltime'.

Returns

The created Datarun or list of Dataruns.

Return type

Datarun

add_dataset(train_path, test_path=None, name=None, description=None, class_column=None)[source]¶

Add a new dataset to the Database.

Parameters

train_path (str) – Path to the training CSV file. It can be a local filesystem path, absolute or relative, or an HTTP or HTTPS URL, or an S3 path in the format s3://{bucket_name}/{key}. Required.
test_path (str) – Path to the testing CSV file. It can be a local filesystem path, absolute or relative, or an HTTP or HTTPS URL, or an S3 path in the format s3://{bucket_name}/{key}. Optional. If not given, the training CSV will be split in two parts, train and test.
name (str) – Name given to this dataset. Optional. If not given, a hash will be generated from the training_path and used as the Dataset name.
description (str) – Human friendly description of the Dataset. Optional.
class_column (str) – Name of the column that will be used as the target variable. Optional. Defaults to 'class'.

Returns

The created dataset.

Return type

Dataset

load_model(classifier_id)[source]¶

Load a Model from the Database.

Parameters: classifier_id (int) – Id of the Model to load.
Returns: The loaded model instance.
Return type: Model

run(train_path, test_path=None, name=None, description=None, class_column='class', budget=100, budget_type='classifier', gridding=0, k_window=3, metric='f1', methods=['logreg', 'dt', 'knn'], r_minimum=2, run_per_partition=False, score_target='cv', selector='uniform', tuner='uniform', deadline=None, priority=1, save_files=True, choose_randomly=True, cloud_mode=False, total_time=None, verbose=True)[source]¶

Create a Dataset and a Datarun and then work on it.

Parameters

train_path (str) – Path to the training CSV file. It can be a local filesystem path, absolute or relative, or an HTTP or HTTPS URL, or an S3 path in the format s3://{bucket_name}/{key}. Required.
test_path (str) – Path to the testing CSV file. It can be a local filesystem path, absolute or relative, or an HTTP or HTTPS URL, or an S3 path in the format s3://{bucket_name}/{key}. Optional. If not given, the training CSV will be split in two parts, train and test.
name (str) – Name given to this dataset. Optional. If not given, a hash will be generated from the training_path and used as the Dataset name.
description (str) – Human friendly description of the Dataset. Optional.
class_column (str) – Name of the column that will be used as the target variable. Optional. Defaults to 'class'.
budget (int) – Budget amount. Optional. Defaults to 100.
budget_type (str) – Budget Type. Can be ‘classifier’ or ‘walltime’. Optional. Defaults to 'classifier'.
gridding (int) – gridding setting for the Tuner. Optional. Defaults to 0.
k_window (int) – k setting for the Selector. Optional. Defaults to 3.
metric (str) – Metric to use for the tuning and selection. Optional. Defaults to 'f1'.
methods (list) – List of methods to try. Optional. Defaults to ['logreg', 'dt', 'knn'].
r_minimum (int) – r_minimum setting for the Tuner. Optional. Defaults to 2.
run_per_partition (bool) – whether to create a separated Datarun for each Hyperpartition or not. Optional. Defaults to False.
score_target (str) – Which score to use for the tuning and selection process. It can be 'cv' or 'test'. Optional. Defaults to 'cv'.
priority (int) – Priority of this Datarun. The higher the better. Optional. Defaults to 1.
selector (str) – Type of selector to use. Optional. Defaults to 'uniform'.
tuner (str) – Type of tuner to use. Optional. Defaults to 'uniform'.
deadline (str) – Time deadline. It must be a string representing a datetime in the format '%Y-%m-%d %H:%M'. If given, budget_type will be set to 'walltime'.
verbose (bool) – Whether to be verbose about the process. Optional. Defaults to True.

Returns

The created Datarun or list of Dataruns.

Return type

Datarun

work(datarun_ids=None, save_files=True, choose_randomly=True, cloud_mode=False, total_time=None, wait=True, verbose=False)[source]¶

Get unfinished Dataruns from the database and work on them.

Check the ModelHub Database for unfinished Dataruns, and work on them as they are added. This process will continue to run until it exceeds total_time or there are no more Dataruns to process or it is killed.

Parameters

datarun_ids (list) – list of IDs of Dataruns to work on. If None, this will work on any unfinished Dataruns found in the database. Optional. Defaults to None.
save_files (bool) – Whether to save the fitted classifiers and their metrics or not. Optional. Defaults to True.
choose_randomly (bool) – If True, work on all the highest-priority dataruns in random order. Otherwise, work on them in sequential order (by ID). Optional. Defaults to True.
cloud_mode (bool) – Save the models and metrics in AWS S3 instead of locally. This option works only if S3 configuration has been provided on initialization. Optional. Defaults to False.
total_time (int) – Total time to run the work process, in seconds. If None, continue to run until interrupted or there are no more Dataruns to process. Optional. Defaults to None.
wait (bool) – If True, wait for more Dataruns to be inserted into the Database once all have been processed. Otherwise, exit the worker loop when they run out. Optional. Defaults to False.
verbose (bool) – Whether to be verbose about the process. Optional. Defaults to True.

class atm.Model(method, params, judgment_metric, class_column, testing_ratio=0.3, verbose_metrics=False)[source]¶

Bases: object

This class contains everything needed to run an end-to-end ATM classifier pipeline. It is initialized with a set of parameters and trained like a normal sklearn model. This class can be pickled and saved to disk, then unpickled outside of ATM and used to classify new datasets.

Attributes

`ATM_KEYS`	list() -> new empty list
`MINMAX`	str(object=’‘) -> str
`N_FOLDS`	int(x=0) -> integer
`PCA`	str(object=’‘) -> str
`PCA_DIMS`	str(object=’‘) -> str
`SCALE`	str(object=’‘) -> str
`WHITEN`	str(object=’‘) -> str

Methods

`load`(path)	Loads a saved Model instance from a path.
`predict`(data)	Generate predictions from new data.
`save`(path[, force])	Save this Model using pickle.
`train_test`(dataset)	Train and test this model using Cross Validation and Holdout.

ATM_KEYS = ['_scale', '_whiten', '_scale_minmax', '_pca', '_pca_dimensions']¶

MINMAX = '_scale_minmax'¶

N_FOLDS = 5¶

PCA = '_pca'¶

PCA_DIMS = '_pca_dimensions'¶

SCALE = '_scale'¶

WHITEN = '_whiten'¶

classmethod load(path)[source]¶

Loads a saved Model instance from a path.

Parameters: path (str) – path where the model is saved.
Returns: New model instance.
Return type: Model

predict(data)[source]¶

Generate predictions from new data.

Parameters: data (pandas.DataFrame) – Data for which to predict classes
Returns: Vector of predictions
Return type: pandas.Series

save(path, force=False)[source]¶

Save this Model using pickle.

Parameters

path (str) – Path where the model should be saved.
force (bool) – If True, overwrite the model if it already exists.

train_test(dataset)[source]¶

Train and test this model using Cross Validation and Holdout.

Parameters

dataset (Dataset) – Dataset object from database.

Returns

Dictionary containing:

cv (list): The cross validation scores array
test (dict): The test scores dictionary

Return type

dict