energy_fault_detector.data_preprocessing.data_preprocessor

Generic class for building a preprocessing pipeline.

class DataPreprocessor(angles=None, imputer_strategy='mean', imputer_fill_value=None, scale='standardize', include_column_selector=True, features_to_exclude=None, max_nan_frac_per_col=0.05, include_low_unique_value_filter=True, min_unique_value_count=2, max_col_zero_frac=1.0, include_duplicate_value_to_nan=False, value_to_replace=0, n_max_duplicates=144, duplicate_features_to_exclude=None)

Bases: Pipeline, SaveLoadMixin

A data preprocessing pipeline that allows for configurable steps based on the extended pipeline.

  1. (optional) Replace any consecutive duplicate zero-values (or another value) with NaN. This step should be

    used if 0 can also represent missing values in the data.

  2. (optional) Column selection: A ColumnSelector object filters out columns/features with too many NaN values.

  3. (optional) Features containing angles are transformed to sine/cosine values.

  4. (optional) Low unique value filter: Remove columns/features with a low number of unique values or

    high fraction of zeroes. The high fraction of zeros setting should be used if 0 can also represent missing values in the data.

  5. Imputation with sklearn’s SimpleImputer

  6. Scaling: Apply either sklearn’s StandardScaler or MinMaxScaler.

Parameters:
  • angles (Optional[List[str]]) – List of angle features for transformation. Defaults to None. If none provided (or empty list), this step is skipped.

  • imputer_strategy (str) – Strategy for imputation (‘mean’, ‘median’, ‘most_frequent’, ‘constant’). Defaults to ‘mean’.

  • imputer_fill_value (Optional[int]) – Value to fill for imputation (if imputer_strategy==’constant’).

  • scale (str) – Type of scaling (‘standardize’ or ‘normalize’). Defaults to ‘standardize’.

  • include_column_selector (bool) – Whether to include the column selector step. Defaults to True.

  • features_to_exclude (Optional[List[str]]) – ColumnSelector option, list of features to exclude from processing.

  • max_nan_frac_per_col (float) – ColumnSelector option, max fraction of NaN values allowed per column. Defaults to 0.05.

  • include_low_unique_value_filter (bool) – Whether to include the low unique value filter step. Defaults to True.

  • min_unique_value_count (int) – Minimum number of unique values for low unique value filter. Defaults to 2.

  • max_col_zero_frac (float) – Maximum fraction of zeroes for low unique value filter. Defaults to 1.0.

  • include_duplicate_value_to_nan (bool) – Whether to include the duplicate value replacement step. Defaults to False.

  • value_to_replace (float) – Value to replace with NaN (if using duplicate value replacement). Defaults to None.

  • n_max_duplicates (int) – Max number of consecutive duplicates to replace with NaN. Defaults to 144.

Configuration example:

train:
  data_preprocessor:
    params:
      scale: normalize
      imputer_strategy: mean
      max_nan_frac_per_col: 0.05
      include_low_unique_value_filter: true
      min_unique_value_count: 2
      max_col_zero_frac: 0.99
      angles:
      - angle1
      - angle2
      features_to_exclude:
      - feature1
      - feature2
fit_transform(x, **kwargs)

Fit the model and transform with the final estimator.

Parameters:

x (DataFrame) – Input DataFrame.

Return type:

DataFrame

Returns:

Transformed DataFrame with the same index as the input dataframe.

inverse_transform(x, **kwargs)

Reverses the scaler and angle transforms applied to the data. Other transformations are not reversed.

Parameters:

x (DataFrame) – The transformed data.

Return type:

DataFrame

Returns:

A DataFrame with the inverse transformed data.

transform(x, **kwargs)

Transforms the input DataFrame using the pipeline.

Parameters:

x (DataFrame) – Input DataFrame.

Return type:

DataFrame

Returns:

a dataframe with the same index as the input dataframe.