Exploratory Data Analysis
EDA module for exploratory analysis of synthetic and real 5G network datasets.
This module contains functions to load data, preprocess it, visualize it, and perform feature selection using multiple strategies including RF, RFE, RFECV, SFS and permutation importance.
Functions in this module should be called explicitly from a main script or notebook.
- eda.compute_class_weights(y)
Compute class weights for imbalanced classes.
- Args
y (pd.Series): Target variable.
- Returns
dict: Dictionary mapping class labels to their corresponding weights.
- eda.load_dataset(path)
Load dataset from CSV file.
- Return type:
DataFrame
- Args
path (str): Path to the CSV file.
- Returns
pd.DataFrame: Loaded dataset.
- eda.load_maps(log_map_path='./json/log_map.json', app_map_path='./json/app_map.json', uc_map_path='./json/uc_map.json')
Load mapping dictionaries from JSON files.
- Args
log_map_path (str): Path to the log type mapping JSON file.
app_map_path (str): Path to the application mapping JSON file.
uc_map_path (str): Path to the use case mapping JSON file.
- Returns
- tuple: A tuple containing three dictionaries:
log_map (dict): Mapping of log types to integers.
app_map (dict): Mapping of applications to integers.
uc_map (dict): Mapping of use cases to integers.
- eda.permutation_importance_stable(X, y, selected_features, n_runs=10)
Calculate stable permutation importances over multiple runs.
- Args
X (pd.DataFrame): Feature matrix.
y (pd.Series): Target variable.
selected_features (list): List of selected feature names.
n_runs (int): Number of runs for stability.
- Returns
pd.DataFrame: DataFrame containing mean and std of importances.
- eda.preprocess_data(df, log_map, app_map, uc_map)
Preprocess dataset: fill NA, map strings to ints, scale numeric columns.
- Args
df (pd.DataFrame): Input DataFrame to preprocess.
log_map (dict): Mapping of log types to integers.
app_map (dict): Mapping of applications to integers.
uc_map (dict): Mapping of use cases to integers.
- Returns
- tuple: A tuple containing:
X_scaled (np.ndarray): Scaled feature matrix.
X (pd.DataFrame): Original feature matrix.
y (pd.Series): Target variable.
- eda.random_forest_importance(X_scaled, X, y)
Train Random Forest and return feature importances.
- Args
X_scaled (np.ndarray): Scaled feature matrix.
X (pd.DataFrame): Original feature matrix.
y (pd.Series): Target variable.
- Returns
pd.Series: Feature importances sorted in descending order.
RandomForestClassifier: Trained Random Forest model.
- eda.rfe_selection(X_scaled, y, X, rf)
Recursive Feature Elimination.
- Args
X_scaled (np.ndarray): Scaled feature matrix.
y (pd.Series): Target variable.
X (pd.DataFrame): Original feature matrix.
rf (RandomForestClassifier): Trained Random Forest model.
- Returns
pd.Series: Boolean mask indicating selected features.
- eda.rfecv_selection(X_scaled, y, X, rf)
RFECV - RFE with cross-validation.
- Args
X_scaled (np.ndarray): Scaled feature matrix.
y (pd.Series): Target variable.
X (pd.DataFrame): Original feature matrix.
rf (RandomForestClassifier): Trained Random Forest model.
- Returns
pd.Series: Boolean mask indicating selected features.
- eda.sfs_selection(X_scaled, y, X, rf)
Sequential Feature Selector.
- Args
X_scaled (np.ndarray): Scaled feature matrix.
y (pd.Series): Target variable.
X (pd.DataFrame): Original feature matrix.
rf (RandomForestClassifier): Trained Random Forest model.
- Returns
pd.Series: Boolean mask indicating selected features.