What is a Principle Component Analysis and It’s application in Ischemic Heart Disease

efinition:Ischemic means that an organ (e.g., the heart) is not getting enough blood and oxygen. Ischemic heart disease, also called coronary heart disease (CHD) or coronary artery disease, is the term given to heart problems caused by narrowed heart (coronary) arteries that supply blood to the heart muscle.

What is Principal component analysis (PCA)?

  1. It is statistical technique is applicable in the situation when the statistician is dealing with single set of variables having some sort of correlation between each other and wants to discover that what are the set of correlated variable(s) which are important in the formation of a coherent factors in such a way that there exists no correlation between these newly formed factors.
  2. Here, the set of  correlated variable(s) means they are quite uncorrelated with other subset of variables which are combined together to form a factor.
  3. The basic idea is that these newly formed factors drive the underlying process due to which the variables in the data set are supposed to correlate with each other. The specific goals of the component analysis are to summarize patterns of correlations among observed variables, to reduce a large number of observed variables to a smaller number of factors, and to provide an operational definition (a regression equation) for an underlying process by using observed variables.
  4. Since the number of factors are usually far fewer than the number of observed variables, so there is a considerable parsimony in using the factor analysis. Furthermore, when the scores of factors are estimated for each subject, they are often more reliable than scores on individual observed variables.

Steps in principal component analysis:

  1. Selection of a data set consisting of a set of variables,
  2. Preparing the correlation matrix (to perform principal component analysis),
  3. Extracting a set of factors from the correlation matrix
  4. Determining the number of factors
  5. (Probably) Rotating the factors to increase interpret ability, and finally, interpreting the results.

The following table shows the descriptive statistics of ten chemical tests of blood causing ischemic heart disease patients and control group.

Observation 1: 

It was found that the average values of cholesterol, triglyceride, Apo protein B, low density lipoprotein , phospholipids, total lipid, glucose and uric acid were higher in ischemic heart disease patients as compared to those of control group while high density lipoprotein and Apo protein A-1 were found to be higher in control group. Though the control group has the normal values but due to increase in average value of cholesterol the people are under heavy risk of increasing the ischemic heart disease.

 

Observation 2:

Correlation Analysis: It could be observed from this table that

  1. cholesterol has positively high correlation with low density lipoprotein (0.606) and it is moderately correlated with total lipid (0.421).
  2. Negative correlation between high density lipoprotein and low density lipoprotein.
  3. The observed correlation of triglyceride with total lipid (0.271) is positively weak.
  4. Apo protein A-1 has negatively weak whereas Apo protein B has positively weak correlation with low density lipoprotein.
  5. There is highly positive correlation observed between low density lipoprotein and total lipid.

Observation 3:

Initial and Extraction Communalities Analysis:

  1. In principal component analysis initial and extraction communalities for variables are the variances accounted for the factors.
  2. For variance and co-variance analysis initial communalities remain always equal to 1.0.
  3. For the extraction communalities, these values are the proportion to variance of each variable by the rest of the variables.
  4. The extraction communalities are the sum of squared loadings for a variable across factors, in Table 4, the extraction communalities for cholesterol are (0.770)2 + (0.180)2 + (0.0823)2 + (0.0877)2 = 0.639.
  5. That was Factor1+Factor2+Factor3+Factor4 accounted for 63.9% of the variance calculated for cholesterol.
  6. Variances calculated for high density lipoprotein, triglyceride, Apo protein A-1, Apo protein B, low density lipoprotein, phospholipids, total lipid, glucose and uric acid were 42.1%, 57.6%, 435.4%, 74.7%, 45.5%, 64.1%, 53.6%, 81.6% and 72.3% respectively.
  7. As already stated in the last section that those factors which have small variances do not fit in the factor solution and should be discarded from the analysis, our empirical results presented in Table 4 shows that high density lipoprotein and Apo protein B can be easily discarded from the analysis due to explaining a very small proportion of the variance. Since this proportion of variance explained by a factor is calculated by dividing the sum of square loadings with the number of variables used, hence for these two factors, it was calculated as 99.1% whereas the covariance for these factors was calculated as 100%.

The communality for a variable is the variance accounted for by the factors. It is the squared loading multiple correlation for the variable as predicted from the factor. Communality is the sum of squared loading (SSL) for a variable across factors. To find the proportion of variance explained by a factor can be calculated by dividing the sum of square loadings (SSL) with the number of variables, particularly when the rotation of factors id orthogonal. Likewise, the covariance proportion for the particular factor is obtained by dividing the sum of square loadings (SSL) for that particular factor with the sum of communalities.

Observation 4:

Component Analysis:

The eigenvalues and total variance explained for our factor solution. There are ten variables used in the sample, because the aim of factor analysis is to precise the pattern of correlation with a factor as possible. Since every eigenvalue corresponds to a different factor (usually only factor with large eigenvalues are retained). In a good factor analysis, these few factors defined the whole correlation matrix.  It is quite clear from the Table 6 that the first four components has their eigenvalues over 1 and are large enough to be retained, their variances are 26.65%, 12.02%, 11.79% and 10.19% respectively. These four components describe the 60.67% of the total variance.

Observation 5:

Component Score Coefficient Analysis:

The component loading matrix is a matrix of correlation between components and variables. The first column is the correlation between the first component and each variables, the second column is the correlation between the second component and each variable and so on. Table 6 reveals that there were few strong correlation observed between components and their respective variables. A component is interpreted from variables that are highly correlated with it-that had loadings on it. In the evident from Table 6 that the variables that had high loading are assumed to belong with the component 1, i.e., cholesterol (0.289), Apo protein B (0.205) and low density lipoprotein (0.302). Likewise in component 2 the variables had the high weights were found to be Apo protein A-1 (0.302), phospholipids (0.566) and uric acid (0.259). Similarly, triglyceride (0.239) and total lipid (0.300) belonged to component 3. In the same way in component 4 the variables with high scores were high density lipoprotein (0.330) and glucose (0.803).

 

Conclusion:

  1. The step wise principal Component Analysis was used to explore the effects of these chemical tests which are significantly affecting the patients namely cholesterol, high density lipoprotein, triglyceride, Apo protein A-1, Apo protein B, low density lipoprotein, phospholipids, total lipid, glucose and uric acid of patients who were suffering from Ischemic Heart Disease.
  2. In the case of finding correlations between variables, it was observed that Cholesterol was highly correlated with low density lipoprotein and moderately correlated with total lipid.
  3.  It was observed from communalities extractions that high density lipoprotein and Apo protein B has small variances and do not fit in the factor solution.
  4. To know the variability between the components, it was found that first four components account for exactly 60.67% of the total variability. These four components adequately describe the whole correlation matrix. Finally it was obvious to see that cholesterol, Apo protein B and low density lipoprotein had high value and belongs to the components 1. In component 2, it was found that Apo protein A-1, phospholipids and uric acid had the high scores. The variables that had high loadings in the component 3 were triglyceride and total lipid. The variables had high weight and belonged to component 4 were high density lipoprotein and glucose.
  5. The important finding of this study is that the average cholesterol level which is considered to be main factor increasing the risk of Ischemic Heart Disease is found to be higher even in control group than normal values.
  6. So, cholestrol is not the factor for contributing Ischemic Heart Disease.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s