energy_fault_detector.data_preprocessing.data_preprocessor
Generic class for building a preprocessing pipeline.
- class DataPreprocessor(angles=None, imputer_strategy='mean', imputer_fill_value=None, scale='standardize', include_column_selector=True, features_to_exclude=None, max_nan_frac_per_col=0.05, include_low_unique_value_filter=True, min_unique_value_count=2, max_col_zero_frac=1.0, include_duplicate_value_to_nan=False, value_to_replace=0, n_max_duplicates=144, duplicate_features_to_exclude=None)
Bases:
Pipeline,SaveLoadMixinA data preprocessing pipeline that allows for configurable steps based on the extended pipeline.
- (optional) Replace any consecutive duplicate zero-values (or another value) with NaN. This step should be
used if 0 can also represent missing values in the data.
(optional) Column selection: A ColumnSelector object filters out columns/features with too many NaN values.
(optional) Features containing angles are transformed to sine/cosine values.
- (optional) Low unique value filter: Remove columns/features with a low number of unique values or
high fraction of zeroes. The high fraction of zeros setting should be used if 0 can also represent missing values in the data.
Imputation with sklearn’s SimpleImputer
Scaling: Apply either sklearn’s StandardScaler or MinMaxScaler.
- Parameters:
angles (
Optional[List[str]]) – List of angle features for transformation. Defaults to None. If none provided (or empty list), this step is skipped.imputer_strategy (
str) – Strategy for imputation (‘mean’, ‘median’, ‘most_frequent’, ‘constant’). Defaults to ‘mean’.imputer_fill_value (
Optional[int]) – Value to fill for imputation (if imputer_strategy==’constant’).scale (
str) – Type of scaling (‘standardize’ or ‘normalize’). Defaults to ‘standardize’.include_column_selector (
bool) – Whether to include the column selector step. Defaults to True.features_to_exclude (
Optional[List[str]]) – ColumnSelector option, list of features to exclude from processing.max_nan_frac_per_col (
float) – ColumnSelector option, max fraction of NaN values allowed per column. Defaults to 0.05.include_low_unique_value_filter (
bool) – Whether to include the low unique value filter step. Defaults to True.min_unique_value_count (
int) – Minimum number of unique values for low unique value filter. Defaults to 2.max_col_zero_frac (
float) – Maximum fraction of zeroes for low unique value filter. Defaults to 1.0.include_duplicate_value_to_nan (
bool) – Whether to include the duplicate value replacement step. Defaults to False.value_to_replace (
float) – Value to replace with NaN (if using duplicate value replacement). Defaults to None.n_max_duplicates (
int) – Max number of consecutive duplicates to replace with NaN. Defaults to 144.
Configuration example:
train: data_preprocessor: params: scale: normalize imputer_strategy: mean max_nan_frac_per_col: 0.05 include_low_unique_value_filter: true min_unique_value_count: 2 max_col_zero_frac: 0.99 angles: - angle1 - angle2 features_to_exclude: - feature1 - feature2- fit_transform(x, **kwargs)
Fit the model and transform with the final estimator.
- Parameters:
x (
DataFrame) – Input DataFrame.- Return type:
DataFrame- Returns:
Transformed DataFrame with the same index as the input dataframe.
- inverse_transform(x, **kwargs)
Reverses the scaler and angle transforms applied to the data. Other transformations are not reversed.
- Parameters:
x (
DataFrame) – The transformed data.- Return type:
DataFrame- Returns:
A DataFrame with the inverse transformed data.
- transform(x, **kwargs)
Transforms the input DataFrame using the pipeline.
- Parameters:
x (
DataFrame) – Input DataFrame.- Return type:
DataFrame- Returns:
a dataframe with the same index as the input dataframe.