Table of contents
Open Table of contents
What are Correlated Features?
Before we delve into how to deal with correlated features, we first need to understand what they are. In the context of machine learning, features are the input variables we use to predict our target variable. Correlation, then, is a statistical measure that describes the degree of relationship between two variables.
Correlated features, therefore, are input variables that have a strong correlation or relationship with one another. This might mean they share similar information, or one might be a derivative of the other.
For example, in a dataset about houses, you might have the features ‘number of rooms’ and ‘total area’. These two features could be correlated because houses with more rooms usually have a larger total area.
Statistical Definitions of Correlations
Correlation is a statistical technique that can show whether, and how strongly, pairs of variables are related. In this section, we’ll cover the definitions and formulas for three types of correlations: Pearson correlation, Spearman correlation, and Kendall correlation. Each correlation method is used under different conditions and measures different types of relationships.
Pearson Correlation
Pearson correlation measures the linear relationship between two datasets. The Pearson correlation coefficient (also known as Pearson’s r) is a measure of the strength and direction of association that exists between two continuous variables.
The formula for Pearson correlation is:
where:
, are the individual sample points indexed with i , are the means of and is the sum from i=1 to n
Python implementation:
def pearson_correlation(x, y):
mean_x = sum(x) / len(x)
mean_y = sum(y) / len(y)
num = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y))
den = ((sum((xi - mean_x)**2 for xi in x) * sum((yi - mean_y)**2 for yi in y))**0.5)
if den == 0: return 0
return num / den
The result of the Pearson Correlation is a value between -1 and 1. A value closer to 1 indicates a strong positive relationship while a value closer to -1 indicates a strong negative relationship. A value near 0 indicates a lack of correlation.
Spearman Correlation
Spearman’s rank correlation is a non-parametric test that is used to measure the degree of association between two variables. It’s based on the ranked values for each variable rather than the raw data. Spearman correlation is often used when the data is not normally distributed.
The formula for the Spearman correlation is:
where:
= difference between the ranks of corresponding variables = number of observations
Python implementation:
def spearman_correlation(x, y):
x_rank = [sorted(x).index(i) for i in x]
y_rank = [sorted(y).index(i) for i in y]
n = len(x_rank)
d_sq = sum((x_rank[i] - y_rank[i])**2 for i in range(n))
return 1 - (6 * d_sq) / (n * (n**2 - 1))
Like the Pearson correlation, Spearman correlation also results in a value between -1 and 1.
Kendall Correlation
Kendall’s tau correlation coefficient is another non-parametric correlation method used to measure the degree of correspondence between two rankings. It’s used when the data is ordinal, which means the values can be sorted and ranked.
The formula for the Kendall correlation is:
where:
= number of concordant pairs = number of discordant pairs = number of observations
Python implementation:
def kendall_correlation(x, y):
n = len(x)
n_c = n_d = 0
for i in range(n):
for j in range(i+1, n):
if (x[i] - x[j]) * (y[i] - y[j]) > 0:
n_c += 1
elif (x[i] - x[j]) * (y[i] - y[j]) < 0:
n_d += 1
if (n_c + n_d) == 0: return 0
return (n_c - n_d) / (n_c + n_d)
Again, like both Pearson and Spearman correlation, Kendall correlation results in a value between -1 and 1.
In the next sections, we will use these formulas to calculate the correlation between our features in a machine learning context.
Why are Correlated Features a Problem?
Correlated features can pose several problems in machine learning:
-
Overfitting: If features are highly correlated, the model may end up considering the same information twice. This redundancy can lead to overfitting, where the model performs well on the training data but poorly on unseen data.
-
Multicollinearity: In linear regression and other algorithms that use weight coefficients, multicollinearity (a situation where two or more predictors are highly correlated) can lead to unstable and unreliable estimates of the model coefficients.
-
Interpretability: Correlated features can make it difficult to interpret a model’s output. If two features are correlated, it’s hard to determine which one is contributing more to the prediction.
How to Identify Correlated Features
Python’s built-in functionalities can help you identify correlations. For instance, you can use a correlation matrix to view the correlation coefficients between different features.
Here’s a simple example of how to generate a correlation matrix using Python’s standard library:
def correlation_matrix(data):
n = len(data[0]) # number of features
corr_matrix = [[0 for _ in range(n)] for _ in range(n)]
for i in range(n):
for j in range(n):
corr_matrix[i][j] = correlation(data[i], data[j])
return corr_matrix
def correlation(x, y):
n = len(x)
sum_x = sum(x)
sum_y = sum(y)
sum_x_sq = sum([xi*xi for xi in x])
sum_y_sq = sum([yi*yi for yi in y])
prod_sum = sum([xi*yi for xi, yi in zip(x, y)])
numerator = prod_sum - (sum_x * sum_y / n)
denominator = ((sum_x_sq - sum_x**2 / n) * (sum_y_sq - sum_y**2 / n))**.5
if denominator == 0: return 0
return numerator / denominator
Handling Correlated Features
1. Feature Selection
The most straightforward way to handle correlated features is to keep one and remove the others. This method is known as feature selection. There are several strategies to do this:
-
Remove based on domain knowledge: If you have a strong understanding of the dataset, you might be able to decide which feature to keep based on your knowledge of the subject.
-
Remove based on correlation coefficient: You could calculate the correlation coefficient between each pair of features. If the coefficient is above a certain threshold (like 0.8), you remove one of the features.
2. Dimensionality Reduction
Dimensionality reduction is a method that reduces the number of features in your dataset by creating new features that capture the most important information from the old features.
The most common dimensionality reduction technique is Principal Component Analysis (PCA). It transforms the data into a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
3. Regularization
Regularization is a technique used to prevent overfitting, which can be caused by correlated features. It does this by adding a penalty to the different parameters of the machine learning model to reduce the freedom of the model and in turn minimize overfitting.
For instance, in linear models, L1 regularization (also known as Lasso regression) can help handle multicollinearity by forcing some coefficient estimates to be exactly equal to zero for highly correlated features.
Conclusion
In this post, we’ve discussed what correlated features are, why they can be problematic, and how to identify and handle them. Each approach has its strengths and weaknesses, and the best approach to use will depend on the specifics of your dataset and problem. Understanding these methods will help you make more effective use of your data and create more accurate and reliable machine learning models.