5 Preprocess

Data preprocessing is an important step in data pipelines.
Let’s start with the Sonar dataset and introduce some missing values for this example.

data(Sonar, package = "mlbench")
dat <- Sonar
dat[c(10, 20 , 30 , 40 , 50), 1] <- NA
dat[c(15, 25 , 35 , 45 , 55), 2] <- NA

5.1 Check data

To check your data, simply enough use the check_data() function:

check_data(dat)

  dat: A data.table with 208 rows and 61 columns.

  Data types
  * 60 numeric features
  * 0 integer features
  * 1 factor, which is not ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 2 features include 'NA' values; 10 'NA' values total
    * 2 numeric

  Recommendations
  * Consider using algorithms that can handle missingness or imputing missing values.

The output produces a list of useful information about your dataset, followed by recommendations, if any.

5.2 Preprocess

To clean / preprocess the data, use the preprocess() command. In this case we want to impute missing data. By default, preprocess() uses the missRanger package to predict missing values from the available data using random forest in an iterative procedure.

dat_pre <- preprocess(
  dat,
  config = setup_Preprocessor(impute = TRUE)
)

2026-02-21 19:40:09 Imputing missing values using predictive mean matching with missRanger... [preprocess]
Missing value imputation by random forests


Variables to impute:        V1, V2
Variables used to impute:   V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, Class

iter 1 

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
iter 2 

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
iter 3 

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
iter 4 

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%

2026-02-21 19:40:10 Preprocessing done. [preprocess]

preprocess() returns a Preprocessor S7 object, which is a list of preprocessed data and additional information about the preprocessing steps taken.

class(dat_pre)

[1] "rtemis::Preprocessor" "S7_object"

Printing the object gives you a look at its structure:

dat_pre

<Preprocessor>
      config:
              <PreprocessorConfig>
                             complete_cases: <lgc> FALSE
                      remove_features_thres: <NUL> NULL
                         remove_cases_thres: <NUL> NULL
                                missingness: <lgc> FALSE
                                     impute: <lgc> TRUE
                                impute_type: <chr> missRanger
                   impute_missRanger_params: 
                                                 pmm.k: <nmr> 3.00
                                               maxiter: <nmr> 10.00
                                             num.trees: <nmr> 500.00
                            impute_discrete: <chr> get_mode
                          impute_continuous: <chr> mean
                             integer2factor: <lgc> FALSE
                            integer2numeric: <lgc> FALSE
                             logical2factor: <lgc> FALSE
                            logical2numeric: <lgc> FALSE
                             numeric2factor: <lgc> FALSE
                      numeric2factor_levels: <NUL> NULL
                              numeric_cut_n: <nmr> 0.00
                         numeric_cut_labels: <lgc> FALSE
                            numeric_quant_n: <nmr> 0.00
                       numeric_quant_NAonly: <lgc> FALSE
                          unique_len2factor: <nmr> 0.00
                           character2factor: <lgc> FALSE
                           factorNA2missing: <lgc> FALSE
                     factorNA2missing_level: <chr> missing
                             factor2integer: <lgc> FALSE
                    factor2integer_startat0: <lgc> TRUE
                                      scale: <lgc> FALSE
                                     center: <lgc> FALSE
                              scale_centers: <NUL> NULL
                         scale_coefficients: <NUL> NULL
                           remove_constants: <lgc> FALSE
              remove_constants_skip_missing: <lgc> TRUE
                          remove_duplicates: <lgc> FALSE
                            remove_features: <NUL> NULL
                                    one_hot: <lgc> FALSE
                             one_hot_levels: <NUL> NULL
                          add_date_features: <lgc> FALSE
                              date_features: <chr> weekday, month, year
                               add_holidays: <lgc> FALSE
                                    exclude: <NUL> NULL
preprocessed: 
              (data.frame with 208 rows and 61 columns.)
      values: 
                   scale_centers: <NUL> NULL
              scale_coefficients: <NUL> NULL
                  one_hot_levels: <NUL> NULL
                 remove_features: <NUL> NULL

Use the preprocessed() method to extract the preprocessed data (think of the base R fitted() equivalent for models):

dat_p <- preprocessed(dat_pre)
head(dat_p)

      V1     V2     V3     V4     V5     V6     V7     V8     V9    V10    V11
1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111 0.1609
2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872 0.4918
3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194 0.6333
4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264 0.0881
5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459 0.4152
6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039 0.2988
     V12    V13    V14    V15    V16    V17    V18    V19    V20    V21    V22
1 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797 0.5783 0.5071
2 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818 0.5212 0.4052
3 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619 0.7974 0.6737
4 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973 0.2741 0.3690
5 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636 0.4148 0.4292
6 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122 0.2074 0.3985
     V23    V24    V25    V26    V27    V28    V29    V30    V31    V32    V33
1 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857 0.1307 0.2604 0.5121
2 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028 0.3788 0.2947 0.1984
3 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514 0.8512 0.5045 0.1862
4 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559 0.6260 0.7340 0.6120
5 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724 0.5103 0.5459 0.2881
6 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067 0.5580 0.4778 0.3299
     V34    V35    V36    V37    V38    V39    V40    V41    V42    V43    V44
1 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744 0.0510 0.2834 0.2825 0.4256
2 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970 0.1674 0.0583 0.1401 0.1628
3 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719 0.4647 0.2587 0.2129 0.2222
4 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167 0.6121 0.5006 0.3210 0.3202
5 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430 0.1979 0.2444 0.1847 0.0841
6 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296 0.2707 0.2650 0.0723 0.1238
     V45    V46    V47    V48    V49    V50    V51    V52    V53    V54    V55
1 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324 0.0232 0.0027 0.0065 0.0159 0.0072
2 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061 0.0125 0.0084 0.0089 0.0048 0.0094
3 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106 0.0033 0.0232 0.0166 0.0095 0.0180
4 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294 0.0241 0.0121 0.0036 0.0150 0.0085
5 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046 0.0156 0.0031 0.0054 0.0105 0.0110
6 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081 0.0104 0.0045 0.0014 0.0038 0.0013
     V56    V57    V58    V59    V60 Class
1 0.0167 0.0180 0.0084 0.0090 0.0032     R
2 0.0191 0.0140 0.0049 0.0052 0.0044     R
3 0.0244 0.0316 0.0164 0.0095 0.0078     R
4 0.0073 0.0050 0.0044 0.0040 0.0117     R
5 0.0015 0.0072 0.0048 0.0107 0.0094     R
6 0.0089 0.0057 0.0027 0.0051 0.0062     R

Let’s check the preprocessed data. The missing values should be imputed now:

check_data(dat_p)

  dat_p: A data.table with 208 rows and 61 columns.

  Data types
  * 60 numeric features
  * 0 integer features
  * 1 factor, which is not ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 0 missing values

  Recommendations
  * Everything looks good

Note that in many scenarios, you may not need to keep the Preprocessor object, so you could simply extract the preprocessed data directly:

dat_p <- preprocess(
  dat,
  config = setup_Preprocessor(impute = TRUE)
) |> preprocessed()

2026-02-21 19:40:10 Imputing missing values using predictive mean matching with missRanger... [preprocess]
Missing value imputation by random forests


Variables to impute:        V1, V2
Variables used to impute:   V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, Class

iter 1 

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
iter 2 

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
iter 3 

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%

2026-02-21 19:40:11 Preprocessing done. [preprocess]