Feature Engineering: The Art of Creating Better Input Features
Feature Engineering: The Art of Creating Better Input Features
Feature engineering is often the difference between a mediocre and exceptional machine learning model. It's the process of using domain knowledge to create features that make machine learning algorithms work better.
Why Feature Engineering Matters
The quality of features directly impacts model performance:
Where feature quality often has the highest coefficient!
Common Feature Engineering Techniques
1. Numerical Transformations
Log Transformations
For skewed distributions, log transformation can help:
import numpy as np import pandas as pd # Original skewed feature df['income_log'] = np.log1p(df['income']) # log1p handles zeros # Box-Cox transformation for normality from scipy.stats import boxcox df['sales_boxcox'], lambda_param = boxcox(df['sales'] + 1)
Scaling and Normalization
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler # Standard scaling (z-score normalization) scaler = StandardScaler() df['age_scaled'] = scaler.fit_transform(df[['age']]) # Min-Max scaling to [0,1] minmax = MinMaxScaler() df['salary_normalized'] = minmax.fit_transform(df[['salary']]) # Robust scaling (median and IQR) robust = RobustScaler() df['score_robust'] = robust.fit_transform(df[['score']])
2. Categorical Encoding
One-Hot Encoding
# Traditional one-hot encoding df_encoded = pd.get_dummies(df, columns=['category'], prefix='cat') # Or using sklearn from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder(sparse_output=False, drop='first') encoded = encoder.fit_transform(df[['category']])
Target Encoding
def target_encode(df, cat_col, target_col, smoothing=10): """ Target encoding with smoothing to prevent overfitting. """ # Global mean global_mean = df[target_col].mean() # Category statistics cat_stats = df.groupby(cat_col)[target_col].agg(['count', 'mean']) # Smoothed encoding smoothed_mean = ( cat_stats['count'] * cat_stats['mean'] + smoothing * global_mean ) / (cat_stats['count'] + smoothing) return df[cat_col].map(smoothed_mean) df['category_encoded'] = target_encode(df, 'category', 'target')
3. Polynomial Features
Create interaction terms and polynomial features:
from sklearn.preprocessing import PolynomialFeatures # Create polynomial features up to degree 2 poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X) # Manual interaction features df['age_income_interaction'] = df['age'] * df['income'] df['education_experience'] = df['education_years'] * df['experience_years']
4. Time-Based Features
For temporal data, extract meaningful time components:
def create_time_features(df, date_col): """Extract comprehensive time-based features.""" df[date_col] = pd.to_datetime(df[date_col]) # Basic time components df['year'] = df[date_col].dt.year df['month'] = df[date_col].dt.month df['day'] = df[date_col].dt.day df['dayofweek'] = df[date_col].dt.dayofweek df['hour'] = df[date_col].dt.hour # Cyclical encoding for seasonal patterns df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12) df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12) df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7) df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7) # Business vs weekend df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int) return df
5. Binning and Discretization
# Equal-width binning df['age_binned'] = pd.cut(df['age'], bins=5, labels=['young', 'adult', 'middle', 'senior', 'elderly']) # Quantile-based binning df['income_quartiles'] = pd.qcut(df['income'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4']) # Custom binning with domain knowledge age_bins = [0, 18, 25, 35, 50, 65, 100] age_labels = ['child', 'young_adult', 'adult', 'middle_aged', 'senior', 'elderly'] df['age_category'] = pd.cut(df['age'], bins=age_bins, labels=age_labels)
Advanced Techniques
1. Feature Selection
Mathematical formulation using mutual information:
from sklearn.feature_selection import SelectKBest, mutual_info_regression from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestRegressor # Mutual information feature selection selector = SelectKBest(score_func=mutual_info_regression, k=10) X_selected = selector.fit_transform(X, y) # Recursive Feature Elimination rf = RandomForestRegressor(n_estimators=100, random_state=42) rfe = RFE(estimator=rf, n_features_to_select=10) X_rfe = rfe.fit_transform(X, y)
2. Creating Lag Features
For time series data:
def create_lag_features(df, feature_cols, lags=[1, 2, 3, 7, 14]): """Create lag features for time series.""" for col in feature_cols: for lag in lags: df[f'{col}_lag_{lag}'] = df[col].shift(lag) # Rolling statistics for col in feature_cols: df[f'{col}_rolling_mean_7'] = df[col].rolling(window=7).mean() df[f'{col}_rolling_std_7'] = df[col].rolling(window=7).std() return df
Feature Engineering Pipeline
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer def create_feature_pipeline(): """Create a comprehensive feature engineering pipeline.""" # Numerical features pipeline numeric_features = ['age', 'income', 'score'] numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) # Categorical features pipeline categorical_features = ['category', 'region'] categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(drop='first', sparse_output=False)) ]) # Combine transformers preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) return preprocessor # Usage pipeline = create_feature_pipeline() X_processed = pipeline.fit_transform(X_train)
Best Practices
- Start simple: Begin with basic transformations before complex features
- Domain knowledge: Leverage subject matter expertise
- Iterative process: Feature engineering is rarely a one-time activity
- Validation: Always validate features on holdout data
- Documentation: Keep track of feature engineering decisions
Measuring Feature Importance
# Feature importance from Random Forest rf = RandomForestRegressor() rf.fit(X_train, y_train) importance_df = pd.DataFrame({ 'feature': feature_names, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False) # Permutation importance for more reliable estimates from sklearn.inspection import permutation_importance perm_importance = permutation_importance(rf, X_test, y_test, n_repeats=10)
Conclusion
Feature engineering remains one of the most impactful skills in machine learning. While automated feature engineering tools exist, domain expertise and creative thinking still play crucial roles in creating features that capture the underlying patterns in your data.
Remember: garbage in, garbage out. Great features can make even simple algorithms perform exceptionally well!