Preprocess, tune, train, and test supervised learning models in a single call using nested resampling.
Usage
train(
x,
dat_validation = NULL,
dat_test = NULL,
weights = NULL,
algorithm = NULL,
preprocessor_config = NULL,
hyperparameters = NULL,
tuner_config = NULL,
outer_resampling_config = NULL,
execution_config = setup_ExecutionConfig(),
question = NULL,
outdir = NULL,
verbosity = 1L,
...
)Arguments
- x
tabular data, i.e. data.frame, data.table, or tbl_df (tibble): Training set data.
- dat_validation
tabular data: Validation set data.
- dat_test
tabular data: Test set data.
- weights
Optional vector of case weights.
- algorithm
Character: Algorithm to use. Can be left NULL, if
hyperparametersis defined.- preprocessor_config
PreprocessorConfig object or NULL: Setup using setup_Preprocessor.
- hyperparameters
Hyperparametersobject: Setup using one ofsetup_*functions.- tuner_config
TunerConfig object: Setup using setup_GridSearch.
- outer_resampling_config
ResamplerConfig object or NULL: Setup using setup_Resampler. This defines the outer resampling method, i.e. the splitting into training and test sets for the purpose of assessing model performance. If NULL, no outer resampling is performed, in which case you might want to use a
dat_testdataset to assess model performance on a single test set.- execution_config
ExecutionConfigobject: Setup using setup_ExecutionConfig. This allows you to set backend ("future", "mirai", or "none"), number of workers, and future plan if usingbackend = "future".- question
Optional character string defining the question that the model is trying to answer.
- outdir
Character, optional: String defining the output directory.
- verbosity
Integer: Verbosity level.
- ...
Not used.
Value
Object of class Regression(Supervised), RegressionRes(SupervisedRes),
Classification(Supervised), or ClassificationRes(SupervisedRes).
Details
Online book & documentation
See rdocs.rtemis.org/train for detailed documentation.
Preprocessing
There are many different stages at which preprocessing could be applied, when running a
supervised learning pipeline with nested resampling. Some operations are best done before
passing data to train():
Duplicate rows should be removed before resampling, so that duplicates don't end up in different resamples, e.g. one in training and one in test.
Constant columns should be removed before resampling. A column may appear constant in a small resample, even if it is not constant in the full dataset. Removing it inconsistently will throw an error during prediction.
All data-dependent preprocessing steps need to be performed on training data only and applied on validation and test data, e.g. scaling, centering, imputation.
User-defined preprocessing through preprocessor_config is applied on training set data,
the learned parameters are stored in the returned Supervised or SupervisedRes object, and the
preprocessing is applied on validation and test data.
Binary Classification
For binary classification, the outcome should be a factor where the 2nd level corresponds to the positive class.
Resampling
Note that you should not use an outer resampling method with replacement if you will also be using an inner resampling (for tuning). The duplicated cases from the outer resampling may appear both in the training and test sets of the inner resamples, leading to underestimated test error.
Reproducibility
If using outer resampling, you can set a seed when defining outer_resampling_config, e.g.
outer_resampling_config = setup_Resampler(n_resamples = 10L, type = "KFold", seed = 2026L)If using tuning with inner resampling, you can set a seed when defining tuner_config,
e.g.
tuner_config = setup_GridSearch(
resampler_config = setup_Resampler(n_resamples = 5L, type = "KFold", seed = 2027L)
)Parallelization
There are three levels of parallelization that may be used during training:
Algorithm training (e.g. a parallelized learner like LightGBM)
Tuning (inner resampling, where multiple resamples can be processed in parallel)
Outer resampling (where multiple outer resamples can be processed in parallel)
The train() function will automatically manage parallelization depending
on:
The number of workers specified by the user using
n_workersWhether the training algorithm supports parallelization itself
Whether hyperparameter tuning is needed
Examples
# \donttest{
iris_c_lightRF <- train(
iris,
algorithm = "LightRF",
outer_resampling_config = setup_Resampler(),
)
#> Error in UseMethod("train"): no applicable method for 'train' applied to an object of class "data.frame"
# }