utils module

class utils.Distribution(column=None, summary=None, categorical=False)

Bases: object

estimate_args(data)
fix_args(args)
Fixes the args so that they are valid for the distribution this
is supposed to represent.
Raises:Exception – Unsupported distribution:
get_cdf(args)
get_ppf(args)
get_summary()

Returns all the data necessary to recreate this object later.

set_args(args)
class utils.NonVariable(column, is_key=False, regex=None)

Bases: object

generate_new()

Create a new value that could belong in the column using the regex.

exception utils.SDVException

Bases: exceptions.Exception

utils.add_noise(cov)
Add noise to the covariance matrix by dividing all of
the off-diagonal elements by 2.0. This means that they are less dependent of each other
Parameters:cov (array) – the covariance matrix
Returns:ndarray
utils.generate_samples(covariance, ppfs, N, means=None)

Use a Gaussian Copula along with the given quantile functions to generate N samples whose elements are appropriately correlated

utils.get_date_converter(col, missing, meta)
Returns a converter that takes in an integer representing ms
and turns it into a string date
Parameters:
  • col (str) – name of column
  • missing (bool) – true if column has NULL values
  • meta (str) – type of column values
Returns:

function

utils.get_ll(X, covariance, cdfs, check)
Given a vector X, covariance matrix, and cdfs for each element,
return the log likelihood of X in that distribution
Parameters:
  • X (ndarray) – a vector
  • covariance – a covariance matrix
  • cdfs (list<Distribution>) – cdfs
  • check (list<bool>) – each check var represents whether or not to check for noise
utils.get_many(ct, regex, unique_set=None)

Synthesizing many new values based on the regex

Parameters:
  • ct (int) – length of the dataframe
  • regex (str) – type of column values
Returns:

list

utils.get_normalize_fn(cdf, check=False)
Normalizing should be: Phi^-1(F(x)) but because F(x) is sometimes 0 or 1,
we fudge the extremeties a little.
Parameters:
  • cdf (bool) – cdf
  • check – whether or not to check for noise
utils.get_number_converter(col, missing, meta)
Returns a converter that takes in a value and turns it into an
integer, if necessary
Parameters:
  • col (str) – name of column
  • missing (bool) – true if column has NULL values
  • meta (str) – type of column values
Returns:

function

utils.make_covariance_matrix(dim, triu_vals)
Make a symmetric covariance matrix of shape (dim x dim)

given an array of values that belong to the upper triangle. For example, if dim=3 and triu_vals=[1, 2, 3, 4, 5, 6] then the covariance is:

[[1, 2, 3]
[2, 4, 5] [3, 5, 6]]
Parameters:
  • dim (int) – matrix has shape (dim x dim)
  • triu_vals (list<int>) – list of vals to make the covariance matrix from (see summary)
Returns:

ndarray – symmetric covariance matrix

utils.update(obs, obs_indices, covariance, cdfs, obs_care)

Perform inference to update the covariance and the means based on the observed values. Returns the updated (normalized) mean and covariance matrix.

Parameters:
  • obs (list) – the observations made on original table
  • obs_indices – the indices of those observations (column indices)
  • covariance (ndarray) – the full covariance matrix of table
  • cdfs (list<Distribution>) – list of all the cdfs for each column in the table