library(rtemis) .:rtemis 1.0.0 🌊 aarch64-apple-darwin20
library(data.table)As an example, we will use the penguins dataset from the palmerpenguins package.
For regression, we will predict the body_mass_g from the other features.
In rtemis, the last column is the outcome variable.
We optionally convert the dataset to a data.table and inspect it:
Classes 'data.table' and 'data.frame': 344 obs. of 8 variables:
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
- attr(*, ".internal.selfref")=<externalptr>
Finally, we use set_outcome to move “body_mass_g” to the last column, thereby making it the outcome variable:
species island bill_length_mm bill_depth_mm flipper_length_mm sex
<fctr> <fctr> <num> <num> <int> <fctr>
1: Adelie Torgersen 39.1 18.7 181 male
2: Adelie Torgersen 39.5 17.4 186 female
3: Adelie Torgersen 40.3 18.0 195 female
4: Adelie Torgersen NA NA NA <NA>
5: Adelie Torgersen 36.7 19.3 193 female
---
340: Chinstrap Dream 55.8 19.8 207 male
341: Chinstrap Dream 43.5 18.1 202 female
342: Chinstrap Dream 49.6 18.2 193 male
343: Chinstrap Dream 50.8 19.0 210 male
344: Chinstrap Dream 50.2 18.7 198 female
year body_mass_g
<int> <int>
1: 2007 3750
2: 2007 3800
3: 2007 3250
4: 2007 NA
5: 2007 3450
---
340: 2009 4000
341: 2009 3400
342: 2009 3775
343: 2009 4100
344: 2009 3775
dat: A data.table with 344 rows and 8 columns.
Data types
* 2 numeric features
* 3 integer features
* 3 factors, of which 0 are ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 5 features include 'NA' values; 19 'NA' values total
* 1 factor; 2 integer; 2 numeric
* 2 missing values in the last column
Recommendations
* Consider using algorithms that can handle missingness or imputing missing values.
* Filter cases with missing values in the last column if using dataset for supervised learning.
There are 2 missing values in our chosen output, body_mass_g. As suggested, we must filter out these rows before training a model.
Let’s verify the last column has no missing values now:
dat: A data.table with 342 rows and 8 columns.
Data types
* 2 numeric features
* 3 integer features
* 3 factors, of which 0 are ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 1 feature includes 'NA' values; 9 'NA' values total
* 1 factor
Recommendations
* Consider using algorithms that can handle missingness or imputing missing values.
2026-02-21 19:40:25 Input contains more than one column; stratifying on last. [resample]
<Resampler>
type: StratSub
resamples:
Subsample_1: 2, 3, 4, 5...
config:
<StratSubConfig>
n: 1
train_p: 0.75
stratify_var: NULL
strat_n_bins: 4
id_strat: NULL
seed: NULL
254 x 8
88 x 8
As an example, we’ll train a LightRF model:
2026-02-21 19:40:25 ▶ [train]
2026-02-21 19:40:25 Training set: 254 cases x 7 features. [summarize_supervised]
2026-02-21 19:40:25 Test set: 88 cases x 7 features. [summarize_supervised]
2026-02-21 19:40:25 // Max workers: 7 => Algorithm: 7; Tuning: 1; Outer Resampling: 1 [get_n_workers]
2026-02-21 19:40:25 Training LightRF Regression... [train]
2026-02-21 19:40:25 Checking data is ready for training... ✓ [check_supervised]
2026-02-21 19:40:25 Converting 3 factors to integer... [preprocess]
2026-02-21 19:40:25 Preprocessing done. [preprocess]
<Regression>
LightRF (LightGBM Random Forest)
<Training Regression Metrics>
MAE: 303.05
MSE: 155068.47
RMSE: 393.79
Rsq: 0.76
<Test Regression Metrics>
MAE: 308.08
MSE: 156026.43
RMSE: 395.00
Rsq: 0.74
2026-02-21 19:40:26 ✓ Done in 0.71 seconds. [train]
The present() method for Supervised objects combines the describe() and plot() functions
train() can as easily train on multiple resamples, which will output objects of class RegressionRes object for regression. All you need to do is specify the outer resampling configuration using the outer_resampling_config argument.
2026-02-21 19:40:26 ▶ [train]
2026-02-21 19:40:26 Training set: 254 cases x 7 features. [summarize_supervised]
2026-02-21 19:40:26 // Max workers: 7 => Algorithm: 7; Tuning: 1; Outer Resampling: 1 [get_n_workers]
2026-02-21 19:40:26 <> Training LightRF Regression using 10 independent folds... [train]
2026-02-21 19:40:26 Input contains more than one column; stratifying on last. [resample]
⠙ 9/10 ETA: 0s | Training outer resamples...
2026-02-21 19:40:31 </> Outer resampling done. [train]
<Resampled Regression Model>
LightRF (LightGBM Random Forest)
⟳ Tested using 10 independent folds.
<Resampled Regression Training Metrics>
Showing mean (sd) across resamples.
MAE: 305.177 (4.806)
MSE: 156996.551 (4306.611)
RMSE: 396.194 (5.438)
Rsq: 0.759 (0.007)
<Resampled Regression Test Metrics>
Showing mean (sd) across resamples.
MAE: 315.671 (41.402)
MSE: 169824.939 (47410.975)
RMSE: 408.514 (57.167)
Rsq: 0.739 (0.057)
2026-02-21 19:40:31 ✓ Done in 4.75 seconds. [train]
Now, train() produced a RegressionRes object:
The present() method for RegressionRes objects combines the describe() and plot() methods: