Package 'stressor' reference manual

Title:	Algorithms for Testing Models under Stress
Description:	Traditional model evaluation metrics fail to capture model performance under less than ideal conditions. This package employs techniques to evaluate models "under-stress". This includes testing models' extrapolation ability, or testing accuracy on specific sub-samples of the overall model space. Details describing stress-testing methods in this package are provided in Haycock (2023) <doi:10.26076/2am5-9f67>. The other primary contribution of this package is provided to R users access to the 'Python' library 'PyCaret' <https://pycaret.org/> for quick and easy access to auto-tuned machine learning models.
Authors:	Sam Haycock [aut, cre], Brennan Bean [aut], Utah State University [cph, fnd], Thermo Fisher Scientific Inc. [fnd]
Maintainer:	Sam Haycock <[email protected]>
License:	MIT + file LICENSE
Version:	0.2.0
Built:	2025-01-24 04:11:52 UTC
Source:	https://github.com/cran/stressor

Boston Housing Data

Description

A subset of data from the Housing data for 506 census tracts of Boston from the 1970 Census. Original data set can be found in the mlbench package.

Usage

data(boston)
data(boston)

Format

A data.frame with 506 rows and 13 columns:

cmedv: corrected median value of owner-occupied homes in USD 1000's
crim: per capita crime rate by town
zn: proportion of residential land zoned for lots over 25,000 sq.ft
indus: proportion of non-retail business acres per town
nox: nitric oxides concentration (parts per 10 million)
rm: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
dis: weighted distances to five Boston employment centres
rad: index of accessibility to radial highways
tax: full-value property-tax rate per USD 10,000
ptratio: pupil-teacher ratio by town
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
lstat: percentage of lower status of the population

Source

mlbench package

Create Groups for CV

Description

Create groups for the data by separating them either into 10 fold cross-validation, LOO cross-validation, or k-means grouping.

Usage

create_groups(
  formula,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)
create_groups(
  formula,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)

Arguments

`formula`	A formula object that specifies the model to be fit.
`data`	The data that will be separated into each group.
`n_folds`	An integer value defaulted to 10 fold cross-validation. If NULL uses Leave One Out(LOO) instead.
`k_mult`	When specified, this is passed onto the cv_cluster to fit the data into k_groups.
`repl`	A Boolean value defaulted to 'FALSE', change to 'TRUE' when replicates need to be included in the same group.
`grouping_formula`	A formula object that specifies how the groups will be gathered.

Details

If 'k_mult' is specified as an integer, the formula object will be used to help determine the features specified by the user. This will be passed to the cv_cluster function, which takes a scaled matrix of features.

This function is called by the cv methods as it forms the groups necessary to perform the cross-validation. If you want to use this, it is a nice function that separates the 'data' into groups for training and testing.

Value

A vector of the length equal to number of rows of data.frame from the data argument.

Examples

 # data generation
 lm_data <- data_gen_lm(1000)

 # 10 Fold CV group
 create_groups(Y ~ ., lm_data)

 # Spatial CV
 create_groups(Y ~ ., lm_data, n_folds = 10, k_mult = 5)

 # LOO CV group
 create_groups(Y ~ ., lm_data, n_folds = NULL)
# data generation
 lm_data <- data_gen_lm(1000)

 # 10 Fold CV group
 create_groups(Y ~ ., lm_data)

 # Spatial CV
 create_groups(Y ~ ., lm_data, n_folds = 10, k_mult = 5)

 # LOO CV group
 create_groups(Y ~ ., lm_data, n_folds = NULL)

Create 'Python' Virtual Environment

Description

Allows the user to create a stressor 'python' environment with 'PyCaret' installed in the environment. This function assumes that you have properly installed 'python'. We recommend version 3.8.10. It uses existing stressor environments.

Usage

create_virtualenv(python = Sys.which("python"), delete_env = FALSE)
create_virtualenv(python = Sys.which("python"), delete_env = FALSE)

Arguments

`python`	Defaults to your install of 'python'. We prefer version 3.8.10. This is assuming that you installed python from python.org. Currently 'Anaconda' installations of python are not implemented.
`delete_env`	Boolean value to indicate if the environments need to be deleted.

Details

To install 'python', it is recommended using 'python' version 3.8.10 from python.org. This is the same version recommended by 'PyCaret', as it is the most stable. Users have reported troubles using the 'Anaconda' distribution of 'python'.

For MacOS and Linux Users note that in order to run this package, 'LightGBM' package on python requires the install of an additional compiler 'cmake' and the 'libomp' (Open Multi-Processing interface). Troubleshoot link from the 'LightGBM'documentation here.

Value

A message indicating which environment is being used.

Troubleshoot

If 'python' is not being found properly, trying setting the 'RETICULATE_PYTHON' to blank string. Also ensure that you do not have other 'python' objects in your environment.

Also note that on some instances that a warning message may be displayed as to which version of 'python' is being used.

Examples


 create_virtualenv()

create_virtualenv()

Cross Validation

Description

This is the core of cross-validation- both standard and using k-mean groups. This method is called by other cv methods of classes.

Usage

cv(
  object,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)

## S3 method for class 'lm'
cv(
  object,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)

## S3 method for class 'mlm_stressor'
cv(
  object,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)

## S3 method for class 'reg_asym'
cv(
  object,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)

## S3 method for class 'reg_sine'
cv(
  object,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)
cv(
  object,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)

## S3 method for class 'lm'
cv(
  object,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)

## S3 method for class 'mlm_stressor'
cv(
  object,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)

## S3 method for class 'reg_asym'
cv(
  object,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)

## S3 method for class 'reg_sine'
cv(
  object,
  data,
  n_folds = 10,
  k_mult = NULL,
  repl = FALSE,
  grouping_formula = NULL
)

Arguments

`object`	One of the four objects that is accepted: mlm_stressor, reg_sine, reg_asym, or lm.
`data`	A data.frame object that contains all the entries to be cross-validated on.
`n_folds`	An integer value for the number of folds defaulted to 10. If NULL, it will run LOO cross-validation.
`k_mult`	Used to specify if k-means clustering is to be used, defaulted to NULL.
`repl`	A Boolean value defaulted to 'FALSE', change to 'TRUE' when replicates need to be included in the same group.
`grouping_formula`	A formula object that specifies how the groups will be gathered.

Value

If the object is of class mlm_stressor, then a data.frame will be returned. Otherwise, a vector of the predictions will be returned.

Methods (by class)

cv(lm): Cross-Validation for lm
cv(mlm_stressor): Cross-Validation for mlm_stressor
cv(reg_asym): Cross-Validation for reg_asym
cv(reg_sine): Cross-Validation for reg_sine

Examples

 # lm example
 lm_test <- data_gen_lm(20)
 lm <- lm(Y ~ ., lm_test)
 cv(lm, lm_test, n_folds = 2)


 lm_test <- data_gen_lm(20)
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_test)
 cv(mlm_lm, lm_test, n_folds = 2)

 # Asymptotic example
 asym_data <- data_gen_asym(10)
 asym_fit <- reg_asym(Y ~ ., asym_data)
 cv(asym_fit, asym_data, n_folds = 2)

 # Sine example
 sine_data <- data_gen_sine(10)
 sine_fit <- reg_sine(Y ~ ., sine_data)
 cv(sine_fit, sine_data, n_folds = 2)
# lm example
 lm_test <- data_gen_lm(20)
 lm <- lm(Y ~ ., lm_test)
 cv(lm, lm_test, n_folds = 2)


 lm_test <- data_gen_lm(20)
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_test)
 cv(mlm_lm, lm_test, n_folds = 2)

 # Asymptotic example
 asym_data <- data_gen_asym(10)
 asym_fit <- reg_asym(Y ~ ., asym_data)
 cv(asym_fit, asym_data, n_folds = 2)

 # Sine example
 sine_data <- data_gen_sine(10)
 sine_fit <- reg_sine(Y ~ ., sine_data)
 cv(sine_fit, sine_data, n_folds = 2)

Spatial Cluster-Based Partitions for Cross-Validation

Description

This function creates cluster-based partitions of a sample space based on k-means clustering. Included in the function are algorithms that attempt to produce clusters of roughly equal size.

Usage

cv_cluster(features, k, k_mult = 5, ...)
cv_cluster(features, k, k_mult = 5, ...)

Arguments

`features`	A scaled matrix of features to be used in the clustering. Scaling usually done with scale and should not include the predictor variable.
`k`	The number of partitions for k-fold cross-validation.
`k_mult`	k*k_mult determines the number of subgroups that will be created as part of the balancing algorithm.
`...`	Additional arguments passed to kmeans as needed.

Details

More information regarding spatial cross-validation can be found in Robin Lovelace's explanation of spatial cross-validation in his textbook.

Value

An integer vector that is number of rows of features with indices of each group.

Examples

 # Creating a matrix of predictor variables
 x_data <- base::scale(data_gen_lm(30)[, -1])
 groups <- cv_cluster(x_data, 5, k_mult = 5)
 groups
# Creating a matrix of predictor variables
 x_data <- base::scale(data_gen_lm(30)[, -1])
 groups <- cv_cluster(x_data, 5, k_mult = 5)
 groups

Cross Validation Function

Description

This is the machinery to run cross validation. It subsets the test and train set based on the groups it receives.

Usage

cv_core(object, data, t_groups, ...)
cv_core(object, data, t_groups, ...)

Arguments

`object`	Currently '"reg_sine", "reg_asym", "lm", "mlm_stressor"' objects are accepted.
`data`	A data.frame object that has the same formula that was fitted on the data.
`t_groups`	The groups for cross validation: standard cross validation, LOO cross_validation, or spatial cross validation.
`...`	Additional arguments that are passed to the predict function.

Value

Either a vector of predictions for '"reg_sine", "reg_asym", "lm"' and a data frame for '"mlm_stressor"'.

Examples

 # lm example
 lm_test <- data_gen_lm(20)
 lm <- lm(Y ~ ., lm_test)
 cv(lm, lm_test, n_folds = 2)


 lm_test <- data_gen_lm(20)
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_test)
 cv(mlm_lm, lm_test, n_folds = 2)

 # Asymptotic example
 asym_data <- data_gen_asym(10)
 asym_fit <- reg_asym(Y ~ ., asym_data)
 cv(asym_fit, asym_data, n_folds = 2)

 # Sine example
 sine_data <- data_gen_sine(10)
 sine_fit <- reg_sine(Y ~ ., sine_data)
 cv(sine_fit, sine_data, n_folds = 2)
# lm example
 lm_test <- data_gen_lm(20)
 lm <- lm(Y ~ ., lm_test)
 cv(lm, lm_test, n_folds = 2)


 lm_test <- data_gen_lm(20)
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_test)
 cv(mlm_lm, lm_test, n_folds = 2)

 # Asymptotic example
 asym_data <- data_gen_asym(10)
 asym_fit <- reg_asym(Y ~ ., asym_data)
 cv(asym_fit, asym_data, n_folds = 2)

 # Sine example
 sine_data <- data_gen_sine(10)
 sine_fit <- reg_sine(Y ~ ., sine_data)
 cv(sine_fit, sine_data, n_folds = 2)

Data Generation Asymptotic

Description

Creates a synthetic data set for an additive asymptotic model. See the details section for clarification.

Usage

data_gen_asym(
  n,
  weight_mat = matrix(rlnorm(10), nrow = 2, ncol = 5),
  y_int = 0,
  resp_sd = 1,
  window = 1e-05,
  ...
)
data_gen_asym(
  n,
  weight_mat = matrix(rlnorm(10), nrow = 2, ncol = 5),
  y_int = 0,
  resp_sd = 1,
  window = 1e-05,
  ...
)

Arguments

`n`	The number of observations for each parameter.
`weight_mat`	The parameter coefficients, where each column represents the coefficients and is two rows as each additive equation contains two parameters. Defaulted to be 10 random numbers from the log-normal distribution. The second row of the matrix needs to be positive.
`y_int`	The y-intercept term of the additive model.
`resp_sd`	The standard deviation of the epsilon term to be added for noise.
`window`	Used to determine for any given X variable to get you within distance to capture the asymptotic behavior.
`...`	Additional arguments that are not currently implemented.

Details

Observations are generated from the following model:

$y = \sum_{i = 1}^n -\alpha_ie^{-\beta_i \cdot x_i} + y_{int}$

Where 'n' is the number of parameters to be used, $\alpha_i$ 's are the scaling parameter and the $\beta_i$ 's are the weights associated with each $x_i$ . With the $y_{int}$ being where it crosses the y-axis.

Value

A data.frame object with the n rows and the response variable with the number of parameters being equal to the number of columns from the weight matrix.

Examples

 # Generates 10 observations
 asym_data <- data_gen_asym(10)
 asym_data
# Generates 10 observations
 asym_data <- data_gen_asym(10)
 asym_data

Data Generation for Linear Regression

Description

Creates a synthetic data set for an additive linear model. See details for clarification.

Usage

data_gen_lm(n, weight_vec = rep(1, 5), y_int = 0, resp_sd = 1, ...)
data_gen_lm(n, weight_vec = rep(1, 5), y_int = 0, resp_sd = 1, ...)

Arguments

`n`	The number of observations for each parameter.
`weight_vec`	The parameter coefficients where each entry represents the coefficients for the additive linear model.
`y_int`	The y-intercept term of the additive model.
`resp_sd`	The standard deviation of the epsilon term to be added for noise.
`...`	Additional arguments that are not currently implemented.

Details

Observations are generated from the following model:

$y = \sum_{i = 1}^n \alpha_i\cdot x_i + y_{int}$

Where 'n' is the number of parameters to be used and the $\alpha_i$ 's are the weights associated with each $x_i$ . With the $y_{int}$ being where it crosses the y-axis.

Value

A data.frame object with the n rows and the response variable with the number of parameters being equal to the number of columns from the weight matrix.

Examples

 # Generates 10 observations
 lm_data <- data_gen_lm(10)
 lm_data
# Generates 10 observations
 lm_data <- data_gen_lm(10)
 lm_data

Data Generation for Sinusoidal Regression

Description

Creates a synthetic data set for an additive sinusoidal regression model. See the details section for clarification.

Usage

data_gen_sine(
  n,
  weight_mat = matrix(rnorm(15), nrow = 3, ncol = 5),
  y_int = 0,
  resp_sd = 1,
  ...
)
data_gen_sine(
  n,
  weight_mat = matrix(rnorm(15), nrow = 3, ncol = 5),
  y_int = 0,
  resp_sd = 1,
  ...
)

Arguments

`n`	The number of observations for each parameter.
`weight_mat`	The parameter coefficients, where each column represents the coefficients and is three rows as each additive equation contains three parameters. Defaulted to be 15 random numbers from the normal distribution.
`y_int`	The y-intercept term of the additive model.
`resp_sd`	The standard deviation of the epsilon term to be added for noise.
`...`	Additional arguments that are not currently implemented.

Details

Observations are generated from the following model:

$y = \sum_{i = 1}^n \alpha_i \ \sin{(\beta_i(x_i - \gamma_i)))} + y_{int}$

Where 'n' is the number of parameters to be used, $\alpha_i$ 's are the amplitude of each sine wave, $\beta_i$ 's are the periods for each sine wave and indirectly the weight on each $x_i$ , and the $\gamma_i$ 's are the phase shift associated with each sine wave. With the $y_{int}$ being where it crosses the y-axis.

Value

A data.frame object with the n rows and the response variable with the number of parameters being equal to the number of columns from the weight matrix.

Examples

 # Generates 10 observations
 sine_data <- data_gen_sine(10)
 sine_data
# Generates 10 observations
 sine_data <- data_gen_sine(10)
 sine_data

Distance to Center

Description

Calculates the distance from center of the matrix of predictor variables using a euclidean distance, or the average of all x-dimensions.

Usage

dist_cent(formula, data)
dist_cent(formula, data)

Arguments

`formula`	A formula object.
`data`	A data.frame object.

Details

Formula used to calculate the center point:

$\bar{x} = \frac{1}{N}\sum_{j = 1}^N x_{ij}$

Where $\bar{x}$ is a vector of the center of the x-dimensions, $N$ is the number of rows in the matrix, and $x_{ij}$ is the $i,j^{th}$ entry in the matrix.

Value

A vector of distances from the center.

Examples

  data <- data_gen_lm(10)
  dist <- dist_cent(Y ~ ., data)
  dist
data <- data_gen_lm(10)
  dist <- dist_cent(Y ~ ., data)
  dist

Kappa function

Description

A function to calculate the Kappa of binary classification.

Usage

kappa_class(confusion_matrix)
kappa_class(confusion_matrix)

Arguments

confusion_matrix

A matrix or table that is the confusion matrix.

Value

A numeric value representing the kappa value.

Fit Machine Learning Classification Models

Description

Through the PyCaret module from 'python', this function fits many machine learning models simultaneously without requiring any 'python' programming on the part of the user. This function is specifically designed for the classification models fitted by 'PyCaret'.

Usage

mlm_classification(
  formula,
  train_data,
  fit_models = c("ada", "et", "lightgbm", "dummy", "lr", "rf", "ridge", "knn", "dt",
    "gbc", "svm", "lda", "nb", "qda"),
  sort_v = c("Accuracy", "AUC", "Recall", "Precision", "F1", "Kappa", "MCC"),
  n_models = 9999,
  seed = NULL,
  ...
)
mlm_classification(
  formula,
  train_data,
  fit_models = c("ada", "et", "lightgbm", "dummy", "lr", "rf", "ridge", "knn", "dt",
    "gbc", "svm", "lda", "nb", "qda"),
  sort_v = c("Accuracy", "AUC", "Recall", "Precision", "F1", "Kappa", "MCC"),
  n_models = 9999,
  seed = NULL,
  ...
)

Arguments

formula

The classification formula, as a formula object.

train_data

A data.frame object that includes data to be trained on.

fit_models

A character vector with all the possible Machine Learning classifiers that are currently being fit, the user may specify a subset of them using a character vector.

ada	AdaBoost Classifier
dt	Decision Tree Classifier
dummy	Dummy Classifier
et	Extra Trees Classifier
gbc	Gradient Boosting Classifier
knn	K Neighbors Classifier
lda	Linear Discriminant Analysis
lightgbm	Light Gradient Boosting Machine
lr	Logistic Regression
nb	Naive Bayes
qda	Quadratic Discriminant Analysis
rf	Random Forest Classifier
ridge	Ridge Classifier
svm	SVM - Linear Kernel

sort_v

A character vector indicating what to sort the tuned models on.

n_models

An integer value defaulted to a large integer value to return all possible models.

seed

An integer value to set the seed of the 'python' environment. Default value is set to 'NULL'.

...

Additional arguments passed onto mlm_init.

Details

'PyCaret' is a 'python' module where machine learning models can be fitted with little coding by the user. The pipeline that 'PyCaret' uses is a setup function to parameterize the data that is easy for all the models to fit on. Then the compare models function is executed, which fits all the models that are currently available. This process takes less than five minutes for data.frame objects that are less than 10,000 rows.

Value

A list object where the first entry is the models fitted and the second is the initial predictive accuracy on the random test data. Returns as two classes '"mlm_stressor"' and '"classifier"'.

Examples


 lm_test <- data_gen_lm(20)
 binary_response <- sample(c(0, 1), 20, replace = TRUE)
 lm_test$Y <- binary_response
 mlm_class <- mlm_classification(Y ~ ., lm_test)

lm_test <- data_gen_lm(20)
 binary_response <- sample(c(0, 1), 20, replace = TRUE)
 lm_test$Y <- binary_response
 mlm_class <- mlm_classification(Y ~ ., lm_test)

Compare Machine Learning Models

Description

Through the PyCaret module from 'python', this function fits many machine learning models simultaneously without requiring any 'python' programming on the part of the user. This is the core function to fitting the initial models. This function is the backbone to fitting all the models.

Usage

mlm_init(
  formula,
  train_data,
  fit_models,
  sort_v = NULL,
  n_models = 9999,
  classification = FALSE,
  seed = NULL,
  ...
)
mlm_init(
  formula,
  train_data,
  fit_models,
  sort_v = NULL,
  n_models = 9999,
  classification = FALSE,
  seed = NULL,
  ...
)

Arguments

formula

The regression formula or classification formula. This formula should be linear.

train_data

A data.frame object that includes data to be trained on.

fit_models

A character vector with all the possible Machine Learning regressors that are currently being fit. The user may specify a subset of them using a character vector.

ada	AdaBoost Regressor
br	Bayesian Ridge
dt	Decision Tree Regressor
dummy	Dummy Regressor
en	Elastic Net
et	Extra Trees Regressor
gbr	Gradient Boosting Regressor
huber	Huber Regressor
knn	K Neighbors Regressor
lar	Least Angle Regression
lasso	Lasso Regression
lightgbm	Light Gradient Boosting Machine
llar	Lasso Least Angle Regression
lr	Linear Regression
omp	Orthogonal Matching Pursuit
par	Passive Aggressive Regressor
rf	Random Forest Regressor
ridge	Ridge Regression

If classification is set to 'TRUE', these models can be used depending on user. These are the default values for classification:

ada	AdaBoost Classifier
dt	Decision Tree Classifier
dummy	Dummy Classifier
et	Extra Trees Classifier
gbc	Gradient Boosting Classifier
knn	K Neighbors Classifier
lda	Linear Discriminant Analysis
lightgbm	Light Gradient Boosting Machine
lr	Logistic Regression
nb	Naive Bayes
qda	Quadratic Discriminant Analysis
rf	Random Forest Classifier
ridge	Ridge Classifier
svm	SVM - Linear Kernel

sort_v

A character vector indicating what to sort the tuned models on. Default value is 'NULL'.

n_models

A defaulted integer to return the maximum number of models.

classification

A Boolean value tag to indicate if classification methods should be used.

seed

An integer value to set the seed of the python environment. Default value is set to 'NULL'.

...

Additional arguments passed to the setup function in 'PyCaret'.

Details

The formula should be linear. However, that does not imply a linear fit. The formula is a convenient way to separate predictor variables from explanatory variables.

'PyCaret' is a 'python' module where machine learning models can be fitted with little coding by the user. The pipeline that 'PyCaret' uses has a setup function to parameterize the data that is easy for all the models to fit on. Then compare models function is executed which fits all the models that are currently available. This process takes less than five minutes for data.frame objects that are less than 10,000 rows.

Value

A list object that contains all the fitted models and the CV predictive accuracy. With a class attribute of '"mlm_stressor"'.

Examples


 lm_test <- data_gen_lm(20)
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_test)

lm_test <- data_gen_lm(20)
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_test)

Refit Machine Learning Models

Description

Refits models fitted in the mlm_init, and returns the predictions.

Usage

mlm_refit(mlm_object, train_data, test_data, classification = FALSE)
mlm_refit(mlm_object, train_data, test_data, classification = FALSE)

Arguments

`mlm_object`	A '"mlm_stressor"' object.
`train_data`	A data.frame object used for refitting excludes the test data. Can be 'NULL' to allow for predictions to be used on the current model.
`test_data`	A data.frame object used for predictions.
`classification`	A Boolean value used to represent if classification methods need to be used to refit the data.

Value

A matrix with the predictions of the various machine learning methods.

Examples


 lm_train <- data_gen_lm(20)
 train_idx <- sample.int(20, 5)
 train <- lm_train[train_idx, ]
 test <- lm_train[-train_idx, ]
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_train)
 mlm_refit(mlm_lm, train, test, classification = FALSE)

lm_train <- data_gen_lm(20)
 train_idx <- sample.int(20, 5)
 train <- lm_train[train_idx, ]
 test <- lm_train[-train_idx, ]
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_train)
 mlm_refit(mlm_lm, train, test, classification = FALSE)

Fit Machine Learning Regressor Models

Description

Through the PyCaret module from 'python', this function fits many machine learning models simultaneously with without requiring any 'python' programming on the part of the user. This function is specifically designed for the regression models.

Usage

mlm_regressor(
  formula,
  train_data,
  fit_models = c("ada", "et", "lightgbm", "gbr", "lr", "rf", "ridge", "knn", "dt",
    "dummy", "lar", "br", "huber", "omp", "lasso", "en", "llar", "par"),
  sort_v = c("MAE", "MSE", "RMSE", "R2", "RMSLE", "MAPE"),
  n_models = 9999,
  seed = NULL,
  ...
)
mlm_regressor(
  formula,
  train_data,
  fit_models = c("ada", "et", "lightgbm", "gbr", "lr", "rf", "ridge", "knn", "dt",
    "dummy", "lar", "br", "huber", "omp", "lasso", "en", "llar", "par"),
  sort_v = c("MAE", "MSE", "RMSE", "R2", "RMSLE", "MAPE"),
  n_models = 9999,
  seed = NULL,
  ...
)

Arguments

formula

A linear formula object.

train_data

A data.frame object that includes data to be trained on.

fit_models

A character vector with all the possible Machine Learning regressors that are currently being fit. The user may specify a subset of them using a character vector.

ada	AdaBoost Regressor
br	Bayesian Ridge
dt	Decision Tree Regressor
dummy	Dummy Regressor
en	Elastic Net
et	Extra Trees Regressor
gbr	Gradient Boosting Regressor
huber	Huber Regressor
knn	K Neighbors Regressor
lar	Least Angle Regression
lasso	Lasso Regression
lightgbm	Light Gradient Boosting Machine
llar	Lasso Least Angle Regression
lr	Linear Regression
omp	Orthogonal Matching Pursuit
par	Passive Aggressive Regressor
rf	Random Forest Regressor
ridge	Ridge Regression

sort_v

A character vector indicating what to sort the tuned models on.

n_models

An integer value defaulted to a large integer value to return all possible models.

seed

An integer value to set the seed of the 'python' environment. Default value is set to 'NULL'.

...

Additional arguments passed onto mlm_init.

Details

Value

A list object where the first entry is the models fitted and the second is the initial predictive accuracy on the random test data. Returns as two classes '"mlm_stressor"' and '"regressor"'.

Examples


 lm_test <- data_gen_lm(20)
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_test)

lm_test <- data_gen_lm(20)
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_test)

Prediction Methods for Various Models

Description

Predict values on 'mlm_stressor', 'reg_asym', or 'reg_sine' objects. This expands the predict function.

Usage

## S3 method for class 'mlm_stressor'
predict(object, newdata, train_data = NULL, ...)

## S3 method for class 'reg_asym'
predict(object, newdata, ...)

## S3 method for class 'reg_sine'
predict(object, newdata, ...)
## S3 method for class 'mlm_stressor'
predict(object, newdata, train_data = NULL, ...)

## S3 method for class 'reg_asym'
predict(object, newdata, ...)

## S3 method for class 'reg_sine'
predict(object, newdata, ...)

Arguments

`object`	A 'mlm_stressor', 'reg_asym', or 'reg_sine' object.
`newdata`	A data.frame object that is the data to be predicted on.
`train_data`	A data.frame object defaulted to 'NULL'. This is only used when an 'mlm_stressor' object needs to be refitted.
`...`	Extending the predict function default. In this case, it is ignored.

Value

A data.frame of predictions if 'mlm_stressor' object or vector of predicted values.

Examples


 # mlm_stressor example
 lm_test <- data_gen_lm(20)
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_test)
 predict(mlm_lm, lm_test)

 # Asymptotic Examples
 asym_data <- data_gen_asym(10)
 asym_fit <- reg_asym(Y ~ ., asym_data)
 predict(asym_fit, asym_data)
 # Sinusoidal Examples
 sine_data <- data_gen_sine(10)
 sine_fit <- reg_sine(Y ~ ., sine_data)
 predict(sine_fit, sine_data)
# mlm_stressor example
 lm_test <- data_gen_lm(20)
 create_virtualenv()
 mlm_lm <- mlm_regressor(Y ~ ., lm_test)
 predict(mlm_lm, lm_test)

 # Asymptotic Examples
 asym_data <- data_gen_asym(10)
 asym_fit <- reg_asym(Y ~ ., asym_data)
 predict(asym_fit, asym_data)
 # Sinusoidal Examples
 sine_data <- data_gen_sine(10)
 sine_fit <- reg_sine(Y ~ ., sine_data)
 predict(sine_fit, sine_data)

Check if 'Python' is Available

Description

A function that allows examples to run when appropriate.

Usage

python_avail()
python_avail()

Value

A Boolean value is returned.

Examples

 python_avail()
python_avail()

Asymptotic Regression

Description

A simple example of asymptotic regression that is in the form of $y = -e^{-x}$ and is the sum of multiple of these exponential functions with a common intercept term.

Usage

reg_asym(
  formula,
  data,
  method = "BFGS",
  init_guess = rep(1, ncol(data) * 2 - 1),
  ...
)
reg_asym(
  formula,
  data,
  method = "BFGS",
  init_guess = rep(1, ncol(data) * 2 - 1),
  ...
)

Arguments

`formula`	A formula object to describe the relationship.
`data`	The response and predictor variables.
`method`	The method that is passed to the optim function. By default, it is the BFGS method which uses a gradient.
`init_guess`	The initial parameter guesses for the optim function. By default, it is all ones.
`...`	Additional arguments passed to the optim function.

Value

A "reg_asym" object is returned which contains the results from the optim function that was returned.

Examples

 asym_data <- data_gen_asym(10)
 reg_asym(Y ~ ., asym_data)
asym_data <- data_gen_asym(10)
 reg_asym(Y ~ ., asym_data)

Sinusoidal Regression

Description

A simple example of sinusoidal regression that is in the form of $y = asin(b(x - c))$ and is the sum of of multiple of these sine functions with a common intercept term.

Usage

reg_sine(
  formula,
  data,
  method = "BFGS",
  init_guess = rep(1, ncol(data) * 3 - 2),
  ...
)
reg_sine(
  formula,
  data,
  method = "BFGS",
  init_guess = rep(1, ncol(data) * 3 - 2),
  ...
)

Arguments

`formula`	A formula object to describe the relationship.
`data`	The response and predictor variables.
`method`	The method that is passed to the optim function. By default, it is the BFGS method which uses a gradient.
`init_guess`	The initial parameter guesses for the optim function. By default, it is all ones.
`...`	Additional arguments passed to the optim function.

Value

A "reg_sine" object is returned which contains the results from the optim function that was returned.

Examples

 sine_data <- data_gen_sine(10)
 reg_sine(Y ~ ., sine_data)
sine_data <- data_gen_sine(10)
 reg_sine(Y ~ ., sine_data)

Root Mean Squarred Error (RMSE)

Description

A function to calculate the RMSE.

Usage

rmse(predicted, observed)
rmse(predicted, observed)

Arguments

`predicted`	A data.frame or vector object that is the same number of rows or length as the length of observed values.
`observed`	A vector of the observed results.

Score Function for Metrics

Description

A score function takes the observed and predicted values and returns a vector or data.frame of the various metrics that are reported from 'PyCaret'. For regression, the following metrics are available: 'RMSE', 'MAE', 'MSE', 'R2', 'RMSLE', and 'MAPE'. For classification, the following metrics are available:'Accuracy', 'AUC', 'Recall', 'Prec.', 'F1', 'MCC', and 'Kappa'.

Usage

score(observed, predicted, ...)
score(observed, predicted, ...)

Arguments

observed

A vector of the observed results.

predicted

A data.frame or vector object that is the same number of rows or length as the length of observed values.

...

Arguments passed on to score_classification, score_regression

metrics: A character vector of the metrics to be fitted. This is defaulted to be the metrics from 'PyCaret'.

Value

A matrix with the various metrics reported.

Examples

lm_data <- data_gen_lm(100)
indices <- split_data_prob(lm_data, .2)
train <- lm_data[!indices,]
test <- lm_data[indices,]
model <- lm(Y ~ ., train)
pred_lm <- predict(model, test)
score(test$Y, pred_lm)
lm_data <- data_gen_lm(100)
indices <- split_data_prob(lm_data, .2)
train <- lm_data[!indices,]
test <- lm_data[indices,]
model <- lm(Y ~ ., train)
pred_lm <- predict(model, test)
score(test$Y, pred_lm)

Score Function for Binary Classification

Description

This function takes the observed and predicted values and computes metrics that are found in 'PyCaret' such as: 'Accuracy', 'AUC', 'Recall', 'Prec.', 'F1', 'MCC', and 'Kappa'.

Usage

score_classification(
  observed,
  predicted,
  metrics = c("Accuracy", "AUC", "Recall", "Prec.", "F1", "MCC", "Kappa")
)
score_classification(
  observed,
  predicted,
  metrics = c("Accuracy", "AUC", "Recall", "Prec.", "F1", "MCC", "Kappa")
)

Arguments

`observed`	A vector of the observed results.
`predicted`	A data.frame or vector object that is the same number of rows or length as the length of observed values.
`metrics`	A character vector of the metrics to be fitted. This is defaulted to be the metrics from 'PyCaret'.

Value

A vector or data.frame of the methods and metrics.

Score Function for Regression

Description

This function takes the observed and predicted values and computes metrics that are found in 'PyCaret' such as: 'RMSE', 'MAE', 'MSE', 'R2', 'RMSLE', and 'MAPE'.

Usage

score_regression(
  observed,
  predicted,
  metrics = c("RMSE", "MAE", "MSE", "R2", "RMSLE", "MAPE")
)
score_regression(
  observed,
  predicted,
  metrics = c("RMSE", "MAE", "MSE", "R2", "RMSLE", "MAPE")
)

Arguments

`observed`	A vector of the observed results.
`predicted`	A data.frame or vector object that is the same number of rows or length as the length of observed values.
`metrics`	A character vector of the metrics to be fitted. This is defaulted to be the metrics from 'PyCaret'.

Value

A vector or data.frame of the methods and metrics.

Create Train Index Set

Description

This function takes in a data.frame object and the training size and returns a logical vector indicating which entries to include.

Usage

split_data_prob(data, test_prop)
split_data_prob(data, test_prop)

Arguments

`data`	A data.frame object used to determine the length of the vector.
`test_prop`	A numeric that is between zero and one that represents the proportion of observations to be included in the test data.

Value

A logical vector is returned that is the same length as the number of rows of the data.

Examples

  lm_data <- data_gen_lm(10)
  indices <- split_data_prob(lm_data, .8)
  train <- lm_data[indices, ]
  test <- lm_data[!indices, ]
lm_data <- data_gen_lm(10)
  indices <- split_data_prob(lm_data, .8)
  train <- lm_data[indices, ]
  test <- lm_data[!indices, ]

Thinning Algorithm for Models with Predict Function

Description

Fits various train size and test sizes.

Usage

thinning(
  model,
  data,
  max = 0.95,
  min = 0.05,
  iter = 0.05,
  classification = FALSE
)
thinning(
  model,
  data,
  max = 0.95,
  min = 0.05,
  iter = 0.05,
  classification = FALSE
)

Arguments

`model`	A model that is currently of class type "reg_sine", "reg_asym", "lm", or "mlm_stressor".
`data`	A data frame with all the data.
`max`	A numeric value in (0, 1] and greater than 'min', defaulted to .95.
`min`	A numeric value in (0, 1) and less than 'max', defaulted to .05.
`iter`	A numeric value to indicate the step size, defaulted to .05.
`classification`	A Boolean value defaulted 'FALSE', used for 'mlm_classification'.

Value

A list of objects, where the first element is the RMSE values at each iteration and the second element is the predictions.

Examples

 lm_data <- data_gen_lm(1000)
 lm_model <- lm(Y ~ ., lm_data)
 thin_results <- thinning(lm_model, lm_data)
lm_data <- data_gen_lm(1000)
 lm_model <- lm(Y ~ ., lm_data)
 thin_results <- thinning(lm_model, lm_data)

Package 'stressor'

Help Index

Boston Housing Data

Description

Usage

Format

Source

Create Groups for CV

Description

Usage

Arguments

Details

Value

Examples

Create 'Python' Virtual Environment

Description

Usage

Arguments

Details

Value

Troubleshoot

Examples

Cross Validation

Description

Usage

Arguments

Value

Methods (by class)

Examples

Spatial Cluster-Based Partitions for Cross-Validation

Description

Usage

Arguments

Details

Value

Examples

Cross Validation Function

Description

Usage

Arguments

Value

Examples

Data Generation Asymptotic

Description

Usage

Arguments

Details

Value

Examples

Data Generation for Linear Regression

Description

Usage

Arguments

Details

Value

Examples

Data Generation for Sinusoidal Regression

Description

Usage

Arguments

Details

Value

Examples

Distance to Center

Description

Usage

Arguments

Details

Value

Examples

Kappa function

Description

Usage

Arguments

Value

Fit Machine Learning Classification Models

Description

Usage

Arguments

Details