core package

Submodules

core.AutoML module

class core.AutoML.autogluon(endogenous_variable, save_path='twinstat_autogluon', num_trials=10, search_strategy='auto', eval_metric='rmse', problem_type='regression', validation_frac=0.1, preload=True)

Bases: object

Enables low effort utilization of autogluon for echelon model development and hyperparameter tuning. Specifically setup for only tabular data.

See https://auto.gluon.ai/dev/index.html for more options and customization.

Parameters:

endogenous_variable (str) – Variable name of the output variable.
save_path (str, optional) – File directory path where the Autogluon files will be saved. The default is ‘twinstat_autogluon’.
num_trials (int, optional) – Number of hyperparameter tuning iterations. The default is 10.
search_strategy (str, optional) – Hyperparameter tuning method. The default is ‘auto’.
eval_metric (str, optional) – Metric used . The default is ‘rmse’.
problem_type (str, optional) – The default is ‘regression’.
validation_frac (float, optional) – Fraction of data held in the validation test. The default is 0.1.
preload (bool, optional) – At inference time, autogluon can be quite slow when making predictions. This option will preload all of the models into RAM enabling significantly faster predictions, albeit at the cost of more RAM usage. The default is True.

Return type:

None.

train(data, num_gpus=0)

Training include Monte Carlo hyperparameter search for each model used in training.

Return type:

None

Parameters:

data (pandas.DataFrame) – Data includes both the model inputs and outputs
num_gpus (int, optional) – The default is 0.

Return type:

None

predict(data)

Return type:: Series

load_models()

Return type:: None

determine_variable_sensitivity()

Performs permutation importance sampling to determine model sensitivity to inputs.

A matplotlib figure is generated with the results.

Return type:: None
Return type:: None

core.LinearRegression module

class core.LinearRegression.LinearModel(backend='torch')

Bases: object

Base linear regression object.

Parameters:: backend (str, optional) – The backend solver can be “numpy” or “torch”. Pytorch provides GPU functionality if needed for large problems. The default is ‘torch’.
Return type:: None.

fit(X, y, beta_relevance=None)

Fit a linear regression to the data.

mse

Type:: mean squared error

betas

Type:: regression coefficients

betas_se

Type:: standard error of the betas

beta_pvalues

Type:: pvalue for significance of betas

Parameters:

X (np.array) – Input data.
y (np.array) – Output data
beta_relevance (list[float], optional) –
If set, the coefficient pvalues will be relative to the beta_relevance value.

E.g. if t_crit = (beta[0] - beta_relevance[0])/se

The default is None.

Return type:

None.

predict(Xnew, return_uncertainty=False, uncertainty='confidence')

Return type:

array

Parameters:

Xnew (np.array) –
return_uncertainty (bool, optional) – If true, return both the prediction and the uncertainty of the prediction. The default is False.
uncertainty (str, optional) – Can be “confidence” or “prediction” intervals. The default is ‘confidence’.

Returns:

prediction (np.array)
If return_uncertainty is True also return the SE of the prediction.
sigma (np.array)

class core.LinearRegression.Polynomial(poly_order, **kwargs)

Bases: LinearModel

Sets the parametric function form of the linear regression to be a polynomial.

Parameters:: poly_order (int) – Polynomial order.
Return type:: None.

class core.LinearRegression.PiecewisePolynomial(poly_order, n_knots=1, **kwargs)

Bases: LinearModel

Sets the parametric function form of the linear regression to be a polynomial. However, add the ability to create piecewise knot points that discontinours change slope. L0 continuity and not L1 continuity.

Parameters:

poly_order (int) – Polynomial order.
n_knots (int, optional) – The default is 1.

Return type:

None.

fit_knots(X, y)

Find the optimal location for the knots.

Return type:

None

Parameters:

X (np.array) –
y (np.array) –

Return type:

None

check_for_duplicates(thelist)

class core.LinearRegression.Exponential(**kwargs)

Bases: LinearModel

Sets the parametric function form of the linear regression to be exponential.

Return type:: None.

core.knn_models module

class core.knn_models.QuantileKNNRegressor(tau=0.5, **kwargs)

Bases: KNeighborsRegressor

Extends the Scikit KNN Regression to enable quantile regressions.

All of the existing functionality of scikit is supported.

Parameters:

tau (float, optional) – The percentile that is used when fitting the data. The default is 0.5.
**kwargs – Send any scikit specific options.

Return type:

None.

predict(X)

Return type:: array
Parameters:: X (np.array) – (n data x n features)
Returns:: y_pred – (n data x 1)
Return type:: np.array

class core.knn_models.OutlierKNNDetector(outlier_distance_threshold=None, outlier_percent_threshold=None, endog_idx=None, removal_iterations=10, **kwargs)

Bases: NearestNeighbors

Automatically find outliers and remove from the data set.

Method #1 —- If outlier_percent_threshold is set to a value between 0 and 1, the endog_idx is checked. If None, a KDTree is generated to determine the distance between all points. The outliers are flagged as the points above and below the outlier_percent_threshold threshold.

If endog_idx is not None, then a QuantileKNNRegressor is used to relate all of the data to the endog_idx variable. All data points above and below the regressed outlier_percent_threshold of QuantileKNNRegressor will be removed from the data set.

Method #2 —- If outlier_distance_threshold is provided a QuantileKNNRegressor will be used to regress the endog_idx on to the remaining data. The residuals of prediction are calculated, i.e. r = y - yhat.

All data points exhibiting residuals larger than outlier_distance_threshold will be flagged as outliers (i.e. they are not following the behavior of surrounding data points.) This process is carried out removal_iterations number of times.

Parameters:

outlier_distance_threshold (float, optional) – The default is None.
outlier_percent_threshold (float, optional) – The default is None.
endog_idx (int, optional) – Index of the response variable. The default is None.
removal_iterations (int, optional) – Small clusters of outliers with varying magnitudes may only flag the most extreme outlier. Repeating the method ensures surrounding outliers are also captured. The default is 10.
**kwargs (scikit arguments) –

Return type:

None.

remove_outliers(X, make_plot=True)

Return type:

array

Parameters:

X (np.array) –
make_plot (bool, optional) – The default is True.

Returns:

newX

Return type:

np.array

core.neural_network_base module

class core.neural_network_base.base_neural_network(AR=5, n_exog_variables=0, n_layers=1, hidden_units=64, activation='relu', lr=0.005, batch_size=32, validation_frac=0.1, epochs=10000, dropout_frac=0.0, optimize=False, include_endog=True, isfilter=False, scale_y=False, test_train_split_method='timeseries')

Bases: object

Base object used to define tensorflow objects. Other downstream TwinStat objects are built off of this object or users can create custom neural networks off of this object.

Parameters:

AR (int, optional) – Number of autoregressive lags to include in the BNN. The default is 5.
n_exog_variables (int, optional) – Number of exogenous predictors to generate a ANN model size. The default is 0.
n_layers (int, optional) – Depth of the ANN. The default is 1.
hidden_units (int, optional) – Width of the ANN. The default is 64.
activation (str, optional) – Activation function used after each dense layer. Accepts all tensorflow input. The default is “relu”.
lr (float, optional) – Optimizer learning rate. The default is 5e-3.
batch_size (int, optional) – SGD batch size. The default is 32.
validation_frac (float, optional) – Fraction of data to be used in the validation set. The default is 0.1.
epochs (int, optional) – SGD epochs. The default is 10000.
dropout_frac (float, optional) – Fraction of dropout ANN nodes. Only used in training. The default is 0.0.
optimize (bool, optional (not setup yet)) – If true, the NN will be optimized with a bayesian optimizer to find the best depth, width, learning rate, and activation function for the provided data. The default is False.
include_endog (bool, optional) – If true, the lagged endogenous variable will be included as a predictor. The default is True.
isfilter (bool, optional) –

If true, the current time endogenous variable will be
included as a predictor. Warning, this is useful when creating a filter, but otherwise the neural network will overfit by learning to simply pass the current time through the graph and to the output to achieve perfect accuracy. The default is False.

scale_ybool, optional
Scale the response variable prior to training and prediction. The default is False.

test_train_split_methodstr, optional
If ‘timeseries’ the split is a cut in the timeseries in which the first (1-validation_frac) is used for training and the last validation_frac is used for validation.

If ‘random’ the split is a randomly selected (1 - validation_frac) for training and the remaining validation_frac used for validation.

The default is ‘timeseries’.

Return type:

None.

train_test_split(y, X)

Timeseries train/test splitting uses the first (1-validation_frac) for the training set and the last validation_frac for the test set.

Unless the ‘test_train_split_method’ is set to random in which the (1-validation_frac) is randomly selected.

Return type:

array

Parameters:

data (np.array) –

Returns:

train_y (np.array)
train_X (np.array)
test_y (np.array)
test_X (np.array)

train(y, X=None, patience=500, weights=None)

Train the neural network.

Return type:

None

Parameters:

y (np.array) –
X (np.array, optional) –
patience – Early stoppage criteria. Number of iterations with no improvement in the validation set. The model with best validation score is kept.
int – Early stoppage criteria. Number of iterations with no improvement in the validation set. The model with best validation score is kept.
optional – Early stoppage criteria. Number of iterations with no improvement in the validation set. The model with best validation score is kept.

get_estimate(y=None, X=None)

Make a prediction with ANN.

Return type:

array

Parameters:

y (np.array) – Endogenous Variable
X (np.array) – Exogenous Variables

Returns:

predictions

Return type:

np.array

plot(save_fig=False)

Generates the following plots: :rtype: None

*a plot of the training and validation
loss functions over the number of epochs.

*Residual vs yhat and yhat vs y *Timeseries of y and the fitted line yhat

Parameters:: save_fig (bool, optional) – Save a jpg of the figure. The default is False.
Return type:: None.

save_model(filename)

save the model for later use

Return type:: None

load_model(filename)

load the model weights and config values

Return type:: None

core.optimization module

class core.optimization.GeneticAlgorithm(fitness_function, population_size, generations, bounds, record_method='csv', sql_info={}, mutation_rate=0.1, best_fraction=0.5, crossover_method='chromosome')

Bases: object

Create object for genetic algorithm global heuristic optimization.

Parameters:

fitness_function – Python exectuable function that accepts an array X and ouputs a fitness score. The algorithm seeks to minimize this function. To perform a maximization, users should output the negative score.
population_size (int) – Total sample size for individual chromosomes
generations (int) – Number of generations to refine the population
bounds (list[tuple]) –
The population will be prevented from going outside of these bounds. The location in the list must correspond to the input variables.

E.g. bounds = [(0,1)]
record_method (str, optional) – Only csv file record is setup. All data saved to ‘GA_generation_data.csv’. Eventually SQL support will be added. The default is ‘csv’.
mutation_rate (float, optional) – What fraction of the population will experience a random mutation. The default is 0.1.
best_fraction (float, optional) – The ‘best_fraction’ of the population will be kept and the rest removed from the population. The default is 0.5.
crossover_method (str, optional) –
chromosome | hillclimbing | switch.

’Chromosome’ uses a random switch point in the two parent vectors to create new offspring.

’Hilleclimbing’ uses the random weighted average to create offspring.

’Switch’ will use chromosome until the last 10 generations and then the method is switched to hillclimbing to fine tune the remainder.

The default is ‘chromosome’.

genetic_algorithm:: perform the optimization

create_images:: create a scatter plot for each generation to show the evolution of the population also creates a convergence plot after each generation to provide some understanding of how the optimization progress.

Return type:: None.

genetic_algorithm()

Perform the genetic algorithm optimization.

Return type:

float

Returns:

np.array – Best vector
float – Best fitness score

create_images(only_convergence=False)

Return type:

None

Parameters:

only_convergence (bool, optional) –

If false, a scatter plot is created for each generation of the optimization.

In additional the minimum fitness score is plotted for each generation.

If true, only the convergence figure is generated.

The default is False.

Return type:

None.

core.sensitivity_analysis module

core.sensitivity_analysis.shapely_sensitivity(inputs, outputs, df)

Determine the shapely sensitivities of each ‘output’ variable.

A TwinStat Gaussian Process is used to map the inputs to outputs and shapely sampling is performed on the Gaussian Process.

The package shap is used for the shap sampling. https://shap.readthedocs.io/en/latest/index.html

Parameters:

inputs (list[str]) – List of the names of inputs.
outputs (list[str]) – List of the names of outputs. The sensivitiy of the inputs to each output will be calculated and a new gaussian process made for each input/output combination.
data (pandas.DataFrame) –

Returns:

keys: output variables values: list of shapely values normalized to sum to 1.0

Return type:

dictionary

core.statistical_tests module

core.statistical_tests.distribution_difference_MC_test(X, Y, n_mixtures_X=50, n_mixtures_Y=50, gmm_type='standard', n_samples=500000)

Fit a Gaussian Mixture Model to the input data sets. Then use Monte Carlo integration to determine the volume of overlap in probablity space. Small overlap volumes indicate there is little evidence that Y was generated from the same distribution as X.

Large dimensional problems will require normalization of the results due to overall parameter space volume. Example: if we have X data for 50 IoT data streams, if a few points are taken from X and used as Y, due to the vastness of the hyper-volume when the overlap is integreated, the result is a small value. Therefore, when using this algorithm, we are looking for small values on the order of 1e-9, not standard statistical inferrencing values such as 0.01.

Return type:

GaussianMixture

Parameters:

X (np.array) – Data set 1 (n data X n features)
Y (np.array) – Data set 2 (n data X n features)
n_mixtures_X (int, optional) – If using a standard Gaussian Mixture Model, the number of clusters must be inputed. When using “bayesian”, this is the max clusters that could be used. The default is 50.
n_mixtures_Y (int, optional) – If using a standard Gaussian Mixture Model, the number of clusters must be inputed. When using “bayesian”, this is the max clusters that could be used. The default is 50.
gmm_type (str, optional) – “standard” or “bayesian”. The default is “standard”.
n_samples (int, optional) – Number of monte carlo samples to use in the integration. The default is 500000.

Returns:

The dictionary includes the pvalues for each clustered distribution, in which a large p-value means it is likely data from Y came from X and a low pvalue means there is little evidence that Y came from X.

Return type:

dict and GaussianMixture and GaussianMixture

core.statistical_tests.distribution_difference_hotelling_test(X, Y, alpha=0.05, n_mixtures=50, gmm_type='standard')

Fit a Gaussian Mixture Model to the input data sets. Determine the Hotelling T^2 and p-values to determine if two data sets are different.

Note this method is determining if the means of multivariate distributions are difference and accounting for the variance in both distributions. It will likely flag all IoT data has being different from the original X data if you looking at sub regions of the original sample set. Hence, this can be used for comparing overal distributions, not individual data from peripherial regions.

Return type:: GaussianMixture

Reference:: https://en.wikipedia.org/wiki/Hotelling%27s_T-squared_distribution

Parameters:

X (np.array) – Data set 1 (n data X n features)
Y (np.array) – Data set 2 (n data X n features)
alpha (float, optional) – Statistical significance level. The default is 0.05.
n_mixtures_Y (int, optional) – If using a standard Gaussian Mixture Model, the number of clusters must be inputed. When using “bayesian”, this is the max clusters that could be used. The default is 50.
gmm_type (str, optional) – “standard” or “bayesian”. The default is “standard”.

Returns:

The dictionary includes the pvalues for each clustered distribution, in which a large p-value means it is likely data from Y came from X and a low pvalue means there is little evidence that Y came from X.

Return type:

dict and GaussianMixture and GaussianMixture

core.statistical_tests.get_optimal_n_cluster(X, max_cluster=20, make_plot=False)

Use the silhouette score to find the optimal number of clusters in the data. This function leverages KMeans unsupervised clustering.

This method looks for 2 or more clusters

Return type:

int

Parameters:

X (np.array) – Data to look for clustering.
max_cluster (int, optional) – The default is 20.
make_plot (bool, optional) – Plot the silhouette score for all number of clusters. The default is False.

Returns:

Number of clusters with the maximum silhouette score.

Return type:

int

core.uncertainty_propagation module

core.uncertainty_propagation.uncertainty_propagation(evaluate_data, config, pce_degree=6, method='monte_carlo', sampling_method='latin_hypercube', seed=0)

chaospy provides utility functions for generating joint probablity sampling distributions.

The actual creation of a the PCE becomes exponentially slower with the number of variables and thus basic MC might be needed for higher dimension count.

Also, if the objective function runtime is fast, no need for PCE.

Function assumes the TwinModules function ‘setup_uncertainty_propagation_db’ was used to setup an RDS database for data storage.

Upon completion of the posterior distribution calculation, this function will automatically run the ‘shapely_sensitivity’ to determine which inputs have the greatest impact on the outputs.

Return type:

None

Parameters:

evaluate_data (function) – Python function that accepts a vector X
config (dict) –
Contains:
’num_samples’: int, number of sobel samples ‘sample_dist_input_#’: list, [str, float]

[sampling distribution, inputs into sampling distribution]

’result_#’: str, outputs of the evaluate_data function ‘secret_name’ : str, AWS secret name providing security credentials ‘region_name’ : str, AWS region of the RDS database ‘mysql_db_endpoint’ : str, location of SQL database such as AWS RDS endpoint

Example:
{ 'num_samples': int, number of sobel samples 'sample_dist_input_0': ["TruncNormal", 1e-3, 100, 0, 0.01], "sample_dist_input_1" : ["Uniform", 1e-3, 0.3], "result_0" : "deg", "result_1" : "ms", "result_2" : "LoadCellTension1_N_1", }
pce_degree (int, optional) – PCE polynomial order. The default is 6.
method (str, optional) –
Either ‘monte_carlo’ or ‘pce’.

Both methods will use use sobol sampling of the user provided sampling distributions.

’pce’polynomial chaos expansion requires appreciably less samples than
traditional monte carlo methods to determine the posterior distribution. However, the complexity of the mapping polynomial scales poorly. Depending on the user defined sampling distribution, pce may become prohibitively slow for higher dimensions, e.g. > ~20

’monte_carlo’ :

The default is ‘monte_carlo’.
sampling_method (str, optional) –
Sets how will the input distributions be sampled.

Using purely random sampling is typically not advisable due to the inefficiency associated with natural clustering.

A space filling method is recommended, but there are many options.

References discussing Sobol and Latin Hyper Cube:

Mainly discussing integration; concludes Sobol is best for integration:

https://arxiv.org/ftp/arxiv/papers/1505/1505.02350.pdf

Specifically discussing uncertainty propagation; concludes Latin Hyper Cube generally is best for uncertainty quantification:

https://www.osti.gov/servlets/purl/806696

For a sample size of 100,000 with 9 dimensions

Latin Hyper Cube is ~21.0% slower than brute Monte Carlo Sobol is ~438.0% slower than brute Monte Carlo

The default is ‘latin_hypercube’.
seed (int, optional) – The default is 0.

Return type:

None

core.util module

core.util.pdf_to_cdf(x, p)

Convert a pdf to cdf

Parameters:

x (np.array) – The random variable output.
p (np.array) – The probability density of x

Returns:

cdf – The cumulative probability

Return type:

np.array

core package

Submodules

core.AutoML module

core.LinearRegression module

core.knn_models module

core.neural_network_base module

core.optimization module

core.sensitivity_analysis module

core.statistical_tests module

core.uncertainty_propagation module

core.util module

Module contents