Configuration
This page explains how to configure training, prediction, and optional root cause analysis (ARCANA).
Quick start: minimal configuration
A minimal configuration that clips outliers, imputes missing values, and scales features:
train:
# clip training data to remove outliers (only applied for training)
data_clipping: # (optional) if not specified, not applied.
# Use features_to_exclude or features_to_clip: [feature] to skip or to apply to specific features
lower_percentile: 0.001
upper_percentile: 0.999
data_preprocessor:
steps:
# This drops features where > 20% is missing
- name: column_selector
params:
max_nan_frac_per_col: 0.2
# This drops constants by default (controlled by `min_unique_value_count`)
- name: low_unique_value_filter
# SimpleImputer and StandardScaler are always added
data_splitter:
# How to split data in train and validation sets for the autoencoder
type: sklearn
validation_split: 0.2
shuffle: true
autoencoder:
name: default
params:
layers: # Symmetric autoencoder: inputs - 200 - 100 - 50 - 20 - 50 - 100 - 200 - outputs
- 200
- 100
- 50
code_size: 20 # Size of the bottleneck layer
anomaly_score:
name: rmse
threshold_selector:
fit_on_val: true
name: quantile
params:
quantile: 0.95
root_cause_analysis:
alpha: 0.8
init_x_bias: recon
num_iter: 1000
This setup:
Applies DataClipper if specified.
Builds a DataPreprocessor with:
ColumnSelector that drops columns with more than 20% NaNs (configurable).
LowUniqueValueFilter that removes constant features by default (configurable).
SimpleImputer (mean) and a scaler (StandardScaler by default). If you do not add an imputer/scaler explicitly, the pipeline ensures mean-imputation and StandardScaler are added.
Trains a default autoencoder (with provided architecture, otherwise default values), with an RMSE anomaly score and a quantile threshold selector.
Runs ARCANA with provided parameters when calling
FaultDetector.predict(..., root_cause_analysis=True). If not provided, default ARCANA parameters are used (seeARCANA docs).
If you leave out the data_preprocessor configuration (i.e., data_preprocessor: {}), a default preprocessing pipeline
is generated, which drops constant features, features where >5% of the data is missing, imputes remaining missing values
with the mean value and scales the data to zero mean and unit standard deviation.
Detailed configuration
Below is a more thorough configuration. It shows how to specify preprocessing steps and more model parameters.
train:
# clip training data to remove outliers (only applied for training)
data_clipping: # (optional) if not specified, not applied.
lower_percentile: 0.01
upper_percentile: 0.99
# Choose one of:
# features_to_exclude:
# - do_not_clip_this_feature
# features_to_clip:
# - clip_only_this_feature
data_preprocessor:
steps:
# Replace consecutive duplicate 0-values with NaN
- name: duplicate_to_nan
params:
value_to_replace: 0
n_max_duplicates: 6
features_to_exclude:
- do_not_replace_value_with_nan_for_this_feature
# Normalize counters to differences (configure your counter columns)
# If needed, you can create multiple counter_diff_transformer steps with different settings for different counters
- name: counter_diff_transformer
step_name: counter_diff_energy
params:
counters:
- energy_total_kwh
compute_rate: false
reset_strategy: zero
fill_first: nan
# Column selection: drop columns where > 20% is missing and exclude specific features
- name: column_selector
params:
max_nan_frac_per_col: 0.20
features_to_exclude:
- feature1
- feature2
# Alternatively, keep only selected features:
# features_to_select:
# - temp_outdoor
# - flow
# - power
# Filter low unique value features or high-zero-fraction columns
- name: low_unique_value_filter
params:
min_unique_value_count: 2
max_col_zero_frac: 0.99
# Transform angles to sin/cos
- name: angle_transformer
params:
angles:
- angle1
- angle2
# Imputer (explicit; will be auto-inserted if omitted)
- name: simple_imputer
params:
strategy: mean
# Scaler (choose one; StandardScaler is auto-added by default if omitted)
- name: standard_scaler
params:
with_mean: true
with_std: true
data_splitter:
# How to split data in train and validation sets for the autoencoder
type: sklearn
validation_split: 0.2
shuffle: true # false by default (last part of the data is taken as validation data in this case)
# or block splitting, 4 weeks training, 1 week validation
# type: DataSplitter
# train_block_size: 4032
# val_block_size: 1008
autoencoder:
name: MultilayerAutoencoder
params:
batch_size: 128
# Use a ExponentialDecay schedule for the learning rate:
learning_rate: 0.001 # starting point
decay_rate: 0.99
decay_steps: 100000
# Set early stopping with max 1000 epochs, minimal improvement of 1e-4 and patience of 5 epochs
early_stopping: True
min_delta: 0.0001
patience: 5
epochs: 1000
# architecture settings
layers: [200, 100, 50]
code_size: 20
act: prelu # activation to use for hidden layers
last_act: linear # output layer activation
anomaly_score:
name: rmse
params:
scale: false
threshold_selector:
name: fbeta
params:
beta: 0.5
root_cause_analysis:
alpha: 0.5
init_x_bias: recon
num_iter: 1000
verbose: true
predict:
criticality:
max_criticality: 144
DataPreprocessor specification
A steps-based preprocessing pipeline can be configured under train.data_preprocessor.steps. Each step is a dict
with the following keys:
name(str): the registered step name (see table below).enabled(bool, optional): defaultTrue; set toFalseto skip a step.params(dict, optional): constructor arguments for the step.step_name(str, optional): custom key for the sklearn pipeline; useful if a step is repeated.
Allowed step names and aliases:
Step name |
Purpose |
Aliases |
|---|---|---|
column_selector |
Drop columns with too many NaNs |
- |
low_unique_value_filter |
Drop columns with low variance/many zeros |
- |
angle_transformer |
Convert angles to sin/cos pairs |
angle_transform |
counter_diff_transformer |
Convert counters to differences/rates |
counter_diff, counter_diff_transform |
simple_imputer |
Impute missing values |
imputer |
standard_scaler |
Standardize features (z-score) |
standardize, standardscaler, standard |
minmax_scaler |
Scale to [0, 1] |
minmax |
duplicate_to_nan |
Replace consecutive duplicate values with NaN |
duplicate_value_to_nan, duplicate_values_to_nan |
For detailed documentation of the data preprocessor pipeline, refer to the
DataPreprocessor docs.
Other training configuration sections
Data clipping:
DataClippersupportsfeatures_to_excludeandfeatures_to_clipfor fine-grained control.Data splitter (
train.data_splitter):type: one ofBlockDataSplitter(aliases:blocks,DataSplitter), orsklearn(aliastrain_test_split).For sklearn:
validation_split(float in (0, 1)) andshuffle(bool).For
BlockDataSplitter:train_block_sizeandval_block_size.Early stopping guard: if
train.autoencoder.params.early_stoppingis true, you must either set a validvalidation_splitin (0, 1), or useBlockDataSplitterwith a positiveval_block_size.
Autoencoder (
train.autoencoder):name: class name in the registry.params: architecture and training args (e.g.,layers,epochs,learning_rate,early_stopping). Refer to the autoencoder class docs (autoencoders) for specific params and their defaults.
Anomaly score (
train.anomaly_score):name: score name (e.g.,rmse,mahalanobis).params: score-specific parameters. Refer to theanomaly_scoresdocs.
Threshold selector (
train.threshold_selector):name: e.g.,quantile,fbeta, etc.fit_on_val: fit the threshold on validation only.params: selector-specific parameters (e.g.,quantilefor the quantile selector). See thethreshold_selectorsdocs for more info on the settings.
Prediction options
Under predict, you can set:
criticality.max_criticality: cap the calculated criticality (anomaly counter) to this value.
Root cause analysis (ARCANA)
If root_cause_analysis is provided, ARCANA will attempt to attribute anomalies to specific features using the
provided settings. If not provided, default settings are used. For detailed documentation refer to
ARCANA docs.
Old params data preprocessing configuration (for older versions)
Older configurations use params under train.data_preprocessor.params.
These remain supported but are deprecated in favor of steps mode.
When both steps and legacy params are present, steps take precedence and legacy params are ignored with a warning.
train:
# ...
data_preprocessor:
# only imputation and scaling are done by default, other steps can be skipped.
params:
imputer_strategy: 'mean'
scale: 'standardize' # standard scaling or minmax scaling (minmax)
include_column_selector: true # whether to apply the ColumnSelector
max_nan_frac_per_col: 0.05 # ColumnSelector option - drop columns where >5% is NaN.
features_to_exclude: # ColumnSelector option - features to always exclude
- feature1
- feature2
angles: # list of angles to transform to sine/cosine values; skipped if none provided
- angle1
- angle2
include_low_unique_value_filter: true # whether to apply the LowVarianceFilter
min_unique_value_count: 2 # LowVarianceFilter option - drop columns with less than 3 unique values
max_col_zero_frac: 0.99 # LowVarianceFilter option - drop columns which are at least 90% zero
include_duplicate_value_to_nan: false # whether to apply the DuplicateValuesToNan
value_to_replace: 0 # DuplicateValuesToNan option - if duplicated, replace duplicates with NaN
n_max_duplicates: 6 # DuplicateValuesToNan option - if duplicated for n_max_duplicates, replace after 6th value with NaN
duplicate_features_to_exclude: # DuplicateValuesToNan option - list of feature to not transform with DuplicateValuesToNan
- do_not_replace_value_with_nan
# ...