Exploratory Data Analysis

EDA module for exploratory analysis of synthetic and real 5G network datasets.

This module contains functions to load data, preprocess it, visualize it, and perform feature selection using multiple strategies including RF, RFE, RFECV, SFS and permutation importance.

Functions in this module should be called explicitly from a main script or notebook.

eda.compute_class_weights(y)

Compute class weights for imbalanced classes.

Args
  • y (pd.Series): Target variable.

Returns
  • dict: Dictionary mapping class labels to their corresponding weights.

eda.load_dataset(path)

Load dataset from CSV file.

Return type:

DataFrame

Args
  • path (str): Path to the CSV file.

Returns
  • pd.DataFrame: Loaded dataset.

eda.load_maps(log_map_path='./json/log_map.json', app_map_path='./json/app_map.json', uc_map_path='./json/uc_map.json')

Load mapping dictionaries from JSON files.

Args
  • log_map_path (str): Path to the log type mapping JSON file.

  • app_map_path (str): Path to the application mapping JSON file.

  • uc_map_path (str): Path to the use case mapping JSON file.

Returns
  • tuple: A tuple containing three dictionaries:
    • log_map (dict): Mapping of log types to integers.

    • app_map (dict): Mapping of applications to integers.

    • uc_map (dict): Mapping of use cases to integers.

eda.permutation_importance_stable(X, y, selected_features, n_runs=10)

Calculate stable permutation importances over multiple runs.

Args
  • X (pd.DataFrame): Feature matrix.

  • y (pd.Series): Target variable.

  • selected_features (list): List of selected feature names.

  • n_runs (int): Number of runs for stability.

Returns
  • pd.DataFrame: DataFrame containing mean and std of importances.

eda.preprocess_data(df, log_map, app_map, uc_map)

Preprocess dataset: fill NA, map strings to ints, scale numeric columns.

Args
  • df (pd.DataFrame): Input DataFrame to preprocess.

  • log_map (dict): Mapping of log types to integers.

  • app_map (dict): Mapping of applications to integers.

  • uc_map (dict): Mapping of use cases to integers.

Returns
  • tuple: A tuple containing:
    • X_scaled (np.ndarray): Scaled feature matrix.

    • X (pd.DataFrame): Original feature matrix.

    • y (pd.Series): Target variable.

eda.random_forest_importance(X_scaled, X, y)

Train Random Forest and return feature importances.

Args
  • X_scaled (np.ndarray): Scaled feature matrix.

  • X (pd.DataFrame): Original feature matrix.

  • y (pd.Series): Target variable.

Returns
  • pd.Series: Feature importances sorted in descending order.

  • RandomForestClassifier: Trained Random Forest model.

eda.rfe_selection(X_scaled, y, X, rf)

Recursive Feature Elimination.

Args
  • X_scaled (np.ndarray): Scaled feature matrix.

  • y (pd.Series): Target variable.

  • X (pd.DataFrame): Original feature matrix.

  • rf (RandomForestClassifier): Trained Random Forest model.

Returns
  • pd.Series: Boolean mask indicating selected features.

eda.rfecv_selection(X_scaled, y, X, rf)

RFECV - RFE with cross-validation.

Args
  • X_scaled (np.ndarray): Scaled feature matrix.

  • y (pd.Series): Target variable.

  • X (pd.DataFrame): Original feature matrix.

  • rf (RandomForestClassifier): Trained Random Forest model.

Returns
  • pd.Series: Boolean mask indicating selected features.

eda.sfs_selection(X_scaled, y, X, rf)

Sequential Feature Selector.

Args
  • X_scaled (np.ndarray): Scaled feature matrix.

  • y (pd.Series): Target variable.

  • X (pd.DataFrame): Original feature matrix.

  • rf (RandomForestClassifier): Trained Random Forest model.

Returns
  • pd.Series: Boolean mask indicating selected features.