utils module¶
-
class
utils.
Distribution
(column=None, summary=None, categorical=False)¶ Bases:
object
-
estimate_args
(data)¶
-
fix_args
(args)¶ - Fixes the args so that they are valid for the distribution this
- is supposed to represent.
Raises: Exception – Unsupported distribution:
-
get_cdf
(args)¶
-
get_ppf
(args)¶
-
get_summary
()¶ Returns all the data necessary to recreate this object later.
-
set_args
(args)¶
-
-
class
utils.
NonVariable
(column, is_key=False, regex=None)¶ Bases:
object
-
generate_new
()¶ Create a new value that could belong in the column using the regex.
-
-
exception
utils.
SDVException
¶ Bases:
exceptions.Exception
-
utils.
add_noise
(cov)¶ - Add noise to the covariance matrix by dividing all of
- the off-diagonal elements by 2.0. This means that they are less dependent of each other
Parameters: cov (array) – the covariance matrix Returns: ndarray
-
utils.
generate_samples
(covariance, ppfs, N, means=None)¶ Use a Gaussian Copula along with the given quantile functions to generate N samples whose elements are appropriately correlated
-
utils.
get_date_converter
(col, missing, meta)¶ - Returns a converter that takes in an integer representing ms
- and turns it into a string date
Parameters: - col (str) – name of column
- missing (bool) – true if column has NULL values
- meta (str) – type of column values
Returns: function
-
utils.
get_ll
(X, covariance, cdfs, check)¶ - Given a vector X, covariance matrix, and cdfs for each element,
- return the log likelihood of X in that distribution
Parameters: - X (ndarray) – a vector
- covariance – a covariance matrix
- cdfs (list<Distribution>) – cdfs
- check (list<bool>) – each check var represents whether or not to check for noise
-
utils.
get_many
(ct, regex, unique_set=None)¶ Synthesizing many new values based on the regex
Parameters: - ct (int) – length of the dataframe
- regex (str) – type of column values
Returns: list
-
utils.
get_normalize_fn
(cdf, check=False)¶ - Normalizing should be: Phi^-1(F(x)) but because F(x) is sometimes 0 or 1,
- we fudge the extremeties a little.
Parameters: - cdf (bool) – cdf
- check – whether or not to check for noise
-
utils.
get_number_converter
(col, missing, meta)¶ - Returns a converter that takes in a value and turns it into an
- integer, if necessary
Parameters: - col (str) – name of column
- missing (bool) – true if column has NULL values
- meta (str) – type of column values
Returns: function
-
utils.
make_covariance_matrix
(dim, triu_vals)¶ - Make a symmetric covariance matrix of shape (dim x dim)
given an array of values that belong to the upper triangle. For example, if dim=3 and triu_vals=[1, 2, 3, 4, 5, 6] then the covariance is:
- [[1, 2, 3]
- [2, 4, 5] [3, 5, 6]]
Parameters: - dim (int) – matrix has shape (dim x dim)
- triu_vals (list<int>) – list of vals to make the covariance matrix from (see summary)
Returns: ndarray – symmetric covariance matrix
-
utils.
update
(obs, obs_indices, covariance, cdfs, obs_care)¶ Perform inference to update the covariance and the means based on the observed values. Returns the updated (normalized) mean and covariance matrix.
Parameters: - obs (list) – the observations made on original table
- obs_indices – the indices of those observations (column indices)
- covariance (ndarray) – the full covariance matrix of table
- cdfs (list<Distribution>) – list of all the cdfs for each column in the table