Configuration

This page explains how to configure training, prediction, and optional root cause analysis (ARCANA).

Quick start: minimal configuration 

A minimal configuration that clips outliers, imputes missing values, and scales features:

train:
  # clip training data to remove outliers (only applied for training)
  data_clipping:  # (optional) if not specified, not applied.
    # Use features_to_exclude or features_to_clip: [feature] to skip or to apply to specific features
    lower_percentile: 0.001
    upper_percentile: 0.999

  data_preprocessor:
    steps:
      # This drops features where > 20% is missing
      - name: column_selector
        params:
          max_nan_frac_per_col: 0.2
      # This drops constants by default (controlled by `min_unique_value_count`)
      - name: low_unique_value_filter
      # SimpleImputer and StandardScaler are always added

  data_splitter:
    # How to split data in train and validation sets for the autoencoder
    type: sklearn
    validation_split: 0.2
    shuffle: true

  autoencoder:
    name: default
    params:
      layers:  # Symmetric autoencoder: inputs - 200 - 100 - 50 - 20 - 50 - 100 - 200 - outputs
      - 200
      - 100
      - 50
      code_size: 20  # Size of the bottleneck layer

  anomaly_score:
    name: rmse

  threshold_selector:
    fit_on_val: true
    name: quantile
    params:
      quantile: 0.95

root_cause_analysis:
  alpha: 0.8
  init_x_bias: recon
  num_iter: 1000

This setup:

Applies DataClipper if specified.
Builds a DataPreprocessor with:
- ColumnSelector that drops columns with more than 20% NaNs (configurable).
- LowUniqueValueFilter that removes constant features by default (configurable).
- SimpleImputer (mean) and a scaler (StandardScaler by default). If you do not add an imputer/scaler explicitly, the pipeline ensures mean-imputation and StandardScaler are added.
Trains a default autoencoder (with provided architecture, otherwise default values), with an RMSE anomaly score and a quantile threshold selector.
Runs ARCANA with provided parameters when calling FaultDetector.predict(..., root_cause_analysis=True). If not provided, default ARCANA parameters are used (see ARCANA docs).

If you leave out the data_preprocessor configuration (i.e., data_preprocessor: {}), a default preprocessing pipeline is generated, which drops constant and binary features, features where >5% of the data is missing, imputes remaining missing values with the mean value and scales the data to zero mean and unit standard deviation.

You can also generate this kind of configuration programmatically using generate_quickstart_config:

from energy_fault_detector.config.quickstart_config import generate_quickstart_config
# generate a Config instance and save as YAML to base_config.yaml
config = generate_quickstart_config(output_path="base_config.yaml")

You can look up the names for the available model classes in the class registry:

from energy_fault_detector import registry

registry.print_available_classes()

Configuration updates 

To update the configuration ‘on the fly’ (for example for hyperparameter optimization), you provide a new configuration dictionary via the Config.update_config method:

from energy_fault_detector.config import Config
from copy import deepcopy

config = Config('configs/base_config.yaml')

# update some parameters:
new_config_dict = deepcopy(config.config_dict)
new_config_dict['train']['anomaly_score']['name'] = 'mahalanobis'
config = Config(new_config_dict)

# or create a new configuration object and model
new_model = FaultDetector(Config(config_dict=new_config_dict))

Detailed configuration 

Below is a more thorough configuration. It shows how to specify preprocessing steps and more model parameters.

train:
  # clip training data to remove outliers (only applied for training)
  data_clipping:  # (optional) if not specified, not applied.
    lower_percentile: 0.01
    upper_percentile: 0.99
    # Choose one of:
    # features_to_exclude:
    #   - do_not_clip_this_feature
    # features_to_clip:
    #   - clip_only_this_feature

  data_preprocessor:
    steps:
      # Replace consecutive duplicate 0-values with NaN
      - name: duplicate_to_nan
        params:
          value_to_replace: 0
          n_max_duplicates: 6
          features_to_exclude:
            - do_not_replace_value_with_nan_for_this_feature
      # Normalize counters to differences (configure your counter columns)
      # If needed, you can create multiple counter_diff_transformer steps with different settings for different counters
      - name: counter_diff_transformer
        step_name: counter_diff_energy
        params:
          counters:
            - energy_total_kwh
          compute_rate: false
          reset_strategy: zero
          fill_first: nan
      # Column selection: drop columns where > 20% is missing and exclude specific features
      - name: column_selector
        params:
          max_nan_frac_per_col: 0.20
          features_to_exclude:
            - feature1
            - feature2
          # Alternatively, keep only selected features:
          # features_to_select:
          #   - temp_outdoor
          #   - flow
          #   - power
      # Filter low unique value features or high-zero-fraction columns
      - name: low_unique_value_filter
        params:
          min_unique_value_count: 2
          max_col_zero_frac: 0.99
      # Transform angles to sin/cos
      - name: angle_transformer
        params:
          angles:
            - angle1
            - angle2
      # Imputer (explicit; will be auto-inserted if omitted)
      - name: simple_imputer
        params:
          strategy: mean
      # Scaler (choose one; StandardScaler is auto-added by default if omitted)
      - name: standard_scaler
        params:
          with_mean: true
          with_std: true

  data_splitter:
    # How to split data in train and validation sets for the autoencoder
    type: sklearn
    validation_split: 0.2
    shuffle: true  # false by default (last part of the data is taken as validation data in this case)
    # or block splitting, 4 weeks training, 1 week validation
    # type: DataSplitter
    # train_block_size: 4032
    # val_block_size: 1008

  autoencoder:
    name: MultilayerAutoencoder
    params:
      batch_size: 128
      # Use a ExponentialDecay schedule for the learning rate:
      learning_rate: 0.001  # starting point
      decay_rate: 0.99
      decay_steps: 100000
      # Set early stopping with max 1000 epochs, minimal improvement of 1e-4 and patience of 5 epochs
      early_stopping: True
      min_delta: 0.0001
      patience: 5
      epochs: 1000
      # architecture settings
      layers: [200, 100, 50]
      code_size: 20
      act: prelu  # activation to use for hidden layers
      last_act: linear  # output layer activation

  anomaly_score:
    name: rmse
    params:
      scale: false

  threshold_selector:
    name: fbeta
    params:
      beta: 0.5

root_cause_analysis:
  alpha: 0.5
  init_x_bias: recon
  num_iter: 1000
  verbose: true

predict:
  criticality:
    max_criticality: 144

DataPreprocessor specification 

A steps-based preprocessing pipeline can be configured under train.data_preprocessor.steps. Each step is a dict with the following keys:

name (str): the registered step name (see table below).
enabled (bool, optional): default True; set to False to skip a step.
params (dict, optional): constructor arguments for the step.
step_name (str, optional): custom key for the sklearn pipeline; useful if a step is repeated.

Allowed step names and aliases:

Step name	Purpose	Aliases
column_selector	Drop columns with too many NaNs	-
low_unique_value_filter	Drop columns with low variance/many zeros	-
angle_transformer	Convert angles to sin/cos pairs	angle_transform
counter_diff_transformer	Convert counters to differences/rates	counter_diff, counter_diff_transform
timestamp_transformer	Extract time features (hour, day, etc.)	timestamp_transform,timestamp_features
simple_imputer	Impute missing values	imputer
standard_scaler	Standardize features (z-score)	standardize, standardscaler, standard
minmax_scaler	Scale to [0, 1]	minmax
duplicate_to_nan	Replace consecutive duplicate values with NaN	duplicate_value_to_nan, duplicate_values_to_nan

For detailed documentation of the data preprocessor pipeline, refer to the DataPreprocessor docs.

Multi-device data (MultiIndex support)

Both counter_diff_transformer and timestamp_transformer support multi-device data via pandas MultiIndex. By default (groupby_level='auto'), they automatically detect and group by non-datetime index levels:

train:
  data_preprocessor:
    steps:
      - name: counter_diff_transformer
        params:
          counters: ['energy_total']
          compute_rate: true
          groupby_level: 'auto'  # Default: auto-detects device_id from (device_id, timestamp)
      - name: timestamp_transformer
        params:
          features: ['hour_of_day', 'day_of_week']
          groupby_level: 'auto'  # Default: auto-detects device_id

For a MultiIndex like (device_id, timestamp), this automatically computes counter diffs and time features per device. You can also specify the level explicitly: groupby_level: 'device_id' or groupby_level: 0. Set groupby_level: null to disable grouping for single-device data with a simple DatetimeIndex.

Please note that the sequence-based autoencoders do not (yet) support MultiIndex data.

Other training configuration sections 

Data clipping: DataClipper supports features_to_exclude and features_to_clip for fine-grained control.
Data splitter (train.data_splitter):
- type: one of BlockDataSplitter (aliases: blocks, DataSplitter), or sklearn (alias train_test_split).
- For sklearn: validation_split (float in (0, 1)) and shuffle (bool).
- For BlockDataSplitter: train_block_size and val_block_size.
- Early stopping guard: if train.autoencoder.params.early_stopping is true, you must either set a valid validation_split in (0, 1), or use BlockDataSplitter with a positive val_block_size.

Note

MultiIndex / multi-device data: When using MultiIndex data (device_id, timestamp) with shuffle: false, the validation split is taken positionally from the end of the sorted data — which may contain only one device. Set shuffle: true to ensure both devices appear in the validation set, leading to more representative early stopping and validation metrics.

Autoencoder (train.autoencoder):
- name: class name in the registry.
- params: architecture and training args (e.g., layers, epochs, learning_rate, early_stopping). See (autoencoders) for specific params and their defaults.
Anomaly score (train.anomaly_score):
- name: score name (e.g., rmse, mahalanobis).
- params: score-specific parameters. See anomaly_scores docs.
Threshold selector (train.threshold_selector):
- name: e.g., quantile, fbeta, etc.
- fit_on_val: fit the threshold on validation only.
- params: selector-specific parameters (e.g., quantile for the quantile selector). See the threshold_selectors docs for more info on the settings.

Prediction options 

Under predict, you can set:

criticality.max_criticality: cap the calculated criticality (anomaly counter) to this value.

Root cause analysis (ARCANA)

If root_cause_analysis is provided, ARCANA will attempt to attribute anomalies to specific features using the provided settings. If not provided, default settings are used. For detailed documentation refer to ARCANA docs and ARCANA: Root cause analysis for a full example.

Configuration

Quick start: minimal configuration

Configuration updates

Detailed configuration

DataPreprocessor specification

Other training configuration sections

Prediction options