energy_fault_detector.data_preprocessing.data_preprocessor

Generic class for building a preprocessing pipeline.

class DataPreprocessor(angles=None, imputer_strategy='mean', imputer_fill_value=None, scale='standardize', include_column_selector=True, features_to_exclude=None, max_nan_frac_per_col=0.05, include_low_unique_value_filter=True, min_unique_value_count=2, max_col_zero_frac=1.0, include_duplicate_value_to_nan=False, value_to_replace=0, n_max_duplicates=144, duplicate_features_to_exclude=None)

Bases: Pipeline, SaveLoadMixin

A data preprocessing pipeline that allows for configurable steps based on the extended pipeline.

(optional) Replace any consecutive duplicate zero-values (or another value) with NaN. This step should be
used if 0 can also represent missing values in the data.

(optional) Column selection: A ColumnSelector object filters out columns/features with too many NaN values.

(optional) Features containing angles are transformed to sine/cosine values.

(optional) Low unique value filter: Remove columns/features with a low number of unique values or
high fraction of zeroes. The high fraction of zeros setting should be used if 0 can also represent missing values in the data.

Imputation with sklearn’s SimpleImputer

Scaling: Apply either sklearn’s StandardScaler or MinMaxScaler.

Parameters:

angles (Optional[List[str]]) – List of angle features for transformation. Defaults to None. If none provided (or empty list), this step is skipped.
imputer_strategy (str) – Strategy for imputation (‘mean’, ‘median’, ‘most_frequent’, ‘constant’). Defaults to ‘mean’.
imputer_fill_value (Optional[int]) – Value to fill for imputation (if imputer_strategy==’constant’).
scale (str) – Type of scaling (‘standardize’ or ‘normalize’). Defaults to ‘standardize’.
include_column_selector (bool) – Whether to include the column selector step. Defaults to True.
features_to_exclude (Optional[List[str]]) – ColumnSelector option, list of features to exclude from processing.
max_nan_frac_per_col (float) – ColumnSelector option, max fraction of NaN values allowed per column. Defaults to 0.05.
include_low_unique_value_filter (bool) – Whether to include the low unique value filter step. Defaults to True.
min_unique_value_count (int) – Minimum number of unique values for low unique value filter. Defaults to 2.
max_col_zero_frac (float) – Maximum fraction of zeroes for low unique value filter. Defaults to 1.0.
include_duplicate_value_to_nan (bool) – Whether to include the duplicate value replacement step. Defaults to False.
value_to_replace (float) – Value to replace with NaN (if using duplicate value replacement). Defaults to None.
n_max_duplicates (int) – Max number of consecutive duplicates to replace with NaN. Defaults to 144.

Configuration example:

train:
  data_preprocessor:
    params:
      scale: normalize
      imputer_strategy: mean
      max_nan_frac_per_col: 0.05
      include_low_unique_value_filter: true
      min_unique_value_count: 2
      max_col_zero_frac: 0.99
      angles:
      - angle1
      - angle2
      features_to_exclude:
      - feature1
      - feature2

fit_transform(x, **kwargs)

Fit the model and transform with the final estimator.

Parameters:: x (DataFrame) – Input DataFrame.
Return type:: DataFrame
Returns:: Transformed DataFrame with the same index as the input dataframe.

inverse_transform(x, **kwargs)

Reverses the scaler and angle transforms applied to the data. Other transformations are not reversed.

Parameters:: x (DataFrame) – The transformed data.
Return type:: DataFrame
Returns:: A DataFrame with the inverse transformed data.

transform(x, **kwargs)

Transforms the input DataFrame using the pipeline.

Parameters:: x (DataFrame) – Input DataFrame.
Return type:: DataFrame
Returns:: a dataframe with the same index as the input dataframe.