Wednesday, November 11, 2020

Principal Component Analysis

 


PCA

Principal Component Analysis is used to explain the variance-covariance structure of a set of variables through linear combinations. It is often used as a dimensionality-reduction technique.


Need for PCA in Machine Learning - 


It is used to overcome feature redundancy in the dataset. Also, it aims to capture valuable information explaining high variance which results in providing the best accuracy. It makes the data visualizations easy to handle. It decreases the complexity of the model and increases computational efficiency.


Steps for PCA-


Step 1: Standardization of data


Before proceeding with PCA, we need to perform the standardization of the data. Performing standardization is a crucial step because the original variables may have different scales. We need to bring them to a similar range to get reasonable covariance analysis.

From the sklearn library, we can use the below code to standardize the data.


from sklearn.preprocessing import StandardScaler

df_std = StandardScaler().fit_transform(df)

df_std


Step 2: Computing covariance matrix with standardized data


The covariance matrix represents the correlation between two variables. This helps us to understand which two variables are heavily dependent on each other and to capture bias and redundancy in the dataset.


If the entry in the matrix is with negative sign means they are indirectly proportional to each other. If the sign is positive means they are directly proportional.


df_cov_matrix = np.cov(df_std.T)

df_cov_matrix


Step 3: Calculating Eigenvectors and Eigenvalues on the covariance matrix


These two algebraic formulations are always computed as a pair which is also known as Eigendecomposition helps to reduce the dimension space by compressing the data. The core of the Principal Component Analysis is built on these values.


Each Eigenvector will have a corresponding Eigenvalue and the sum of all the Eigenvalues represents the overall variance in the entire dataset. It is very important to compute the Eigenvalues because it explains where the maximum variance lies in the dataset.


eig_vals, eig_vecs = np.linalg.eig(df_cov_matrix)

print(‘Eigenvectors \n%s’ %eig_vecs)

print(‘\nEigenvalues \n%s’ %eig_vals)

Step 4: Sorting the Eigenvalues list in decreasing order

After completing Eigendecomposition, we need to order the Eigenvalues in a descending order where the first value is the most significant and thus forms our first principal component.


eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

print(‘Eigenvalues in descending order:’)

for i in eig_pairs:

 print(i[0])

Step5: Selecting the number of Principal Components

The first principal component will capture most of the variance from the original variables and the second principal component captures the second highest variance and so on…


total = sum(eig_vals)

var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]

cum_var_exp = np.cumsum(var_exp)

print(“Variance captured by each component is \n”,var_exp)

print(“Cumulative variance captured as we travel with each component \n”,cum_var_exp)


We ca visualize the same through the below scree plot with a cumulative sum of the explained variance ratio.

pca = PCA().fit(df_std)

plt.plot(np.cumsum(pca.explained_variance_ratio_))

plt.xlabel(‘No of components’)

plt.ylabel(‘Cumulative explained variance’)

plt.show()




Step6: Creating Principal Components

With all the steps mentioned above, we have determined the number of components needed for our dataset is 3 considering with maximum variance.

from sklearn.decomposition import PCA

pca = PCA(n_components = 3)

pcs = pca.fit_transform(df_std)

df_new = pd.DataFrame(data=pcs, columns={‘PC1’,’PC2',’PC3'})

df_new[‘target’] = df1[‘Brand’] 

df_new.head()



This PCA technique can be applied and would be a great help if you are dealing with multicollinearity issues in your dataset.

No comments:

Post a Comment

Alpha and Beta value BFSI prediction

Beta value gives idea about how volatile fund performance has been. Lower beta implies the fund gives more predictable performance compare...