Exploratory Data Analysis

EDA module for exploratory analysis of synthetic and real 5G network datasets.

This module contains functions to load data, preprocess it, visualize it, and perform feature selection using multiple strategies including RF, RFE, RFECV, SFS and permutation importance.

Functions in this module should be called explicitly from a main script or notebook.

eda.compute_class_weights(y)

Compute class weights for imbalanced classes.

Args

y (pd.Series): Target variable.

Returns

dict: Dictionary mapping class labels to their corresponding weights.

eda.load_dataset(path)

Load dataset from CSV file.

Return type:: DataFrame

Args

path (str): Path to the CSV file.

Returns

pd.DataFrame: Loaded dataset.

eda.load_maps(log_map_path='./json/log_map.json', app_map_path='./json/app_map.json', uc_map_path='./json/uc_map.json')

Load mapping dictionaries from JSON files.

Args

log_map_path (str): Path to the log type mapping JSON file.
app_map_path (str): Path to the application mapping JSON file.
uc_map_path (str): Path to the use case mapping JSON file.

Returns

tuple: A tuple containing three dictionaries:
- log_map (dict): Mapping of log types to integers.
- app_map (dict): Mapping of applications to integers.
- uc_map (dict): Mapping of use cases to integers.

eda.permutation_importance_stable(X, y, selected_features, n_runs=10)

Calculate stable permutation importances over multiple runs.

Args

X (pd.DataFrame): Feature matrix.
y (pd.Series): Target variable.
selected_features (list): List of selected feature names.
n_runs (int): Number of runs for stability.

Returns

pd.DataFrame: DataFrame containing mean and std of importances.

eda.preprocess_data(df, log_map, app_map, uc_map)

Preprocess dataset: fill NA, map strings to ints, scale numeric columns.

Args

df (pd.DataFrame): Input DataFrame to preprocess.
log_map (dict): Mapping of log types to integers.
app_map (dict): Mapping of applications to integers.
uc_map (dict): Mapping of use cases to integers.

Returns

tuple: A tuple containing:
- X_scaled (np.ndarray): Scaled feature matrix.
- X (pd.DataFrame): Original feature matrix.
- y (pd.Series): Target variable.

eda.random_forest_importance(X_scaled, X, y)

Train Random Forest and return feature importances.

Args

X_scaled (np.ndarray): Scaled feature matrix.
X (pd.DataFrame): Original feature matrix.
y (pd.Series): Target variable.

Returns

pd.Series: Feature importances sorted in descending order.
RandomForestClassifier: Trained Random Forest model.

eda.rfe_selection(X_scaled, y, X, rf)

Recursive Feature Elimination.

Args

X_scaled (np.ndarray): Scaled feature matrix.
y (pd.Series): Target variable.
X (pd.DataFrame): Original feature matrix.
rf (RandomForestClassifier): Trained Random Forest model.

Returns

pd.Series: Boolean mask indicating selected features.

eda.rfecv_selection(X_scaled, y, X, rf)

RFECV - RFE with cross-validation.

Args

X_scaled (np.ndarray): Scaled feature matrix.
y (pd.Series): Target variable.
X (pd.DataFrame): Original feature matrix.
rf (RandomForestClassifier): Trained Random Forest model.

Returns

pd.Series: Boolean mask indicating selected features.

eda.sfs_selection(X_scaled, y, X, rf)

Sequential Feature Selector.

Args

X_scaled (np.ndarray): Scaled feature matrix.
y (pd.Series): Target variable.
X (pd.DataFrame): Original feature matrix.
rf (RandomForestClassifier): Trained Random Forest model.

Returns

pd.Series: Boolean mask indicating selected features.