Shadow Planets…the most controlling planets….!

🙂 Good Evening, everyone…! 🙂

It has been since a long time, I didnt have posted any blog in my personal blogsite, rather than many technical blogs in other sites….! So, today, just wanted to take a break from the mechanical life and to write a blog which will electrify me. I am an engineer by education, so I always use some engineering terms. Yes, all the job, that I do passionately which support my profession, directly or indirectly are included in my mechanical life and the other things those I do having the equal passion, electrify me to re-energize myself are really part of me which connects me with myself. :), Cooking, gardening, interior designing, clay pot designing, glass painting and the most relaxing job…the writings….! I can never spend a day without writing…! And for writing, I need to do lots of analysis on a topic, on which I have to write… Astrology is always a very interesting area for me to write…, which always brings another abundance of knowledge to be discovered on the area, which I start analyzing. Few months back, i had written about the conjunction of Venus and Saturn in one’s horoscope or the effect of the major and minor time periods of these two planets in one’s life, after analyzing my horoscope along with 5 other horoscopes to draw a conclusion about this placement. So, it needs time, energy and concentration to analyze this.

Today, I will write about a topic, which personally I never get satisfied and each time my hunger always gets more intensified, the more I discover about it…Yes, its  about the shadow planets Rahu and Ketu…. Oh My God, these are very interesting facts about the complete Universe. These are called shadow planets because we cannot see within, but which emerges sporadically, that which is called the shadow self. When shadow planets change signs, the shadow self is evoked.

There is a story in Hindu mythology about the origin of these sub-planets…According to Puranas, the birth of Rahu and Ketu dates back to earliest of times.‘Samudra Manthan’ is regarded as one of the most important events in the history of Hindu civilization. Following which, Solar and Lunar eclipse are also associated with ‘Samudra Manthan’. When the ocean was churned by the Asuras and Devas, ‘Amrit’ was produced. This Amrit was stolen by Asuras and to obtain the Amrit, Lord Vishnu took incarnation in the form of a beautiful damsel ‘Mohini’ and tried to please and distract the demons. On receiving the Amrit, Mohini came to Devas to distribute it to them. ‘Svarbhanu’, one of the asuras changed his appearance to a deva to obtain some portion of the Amrit. However, Surya (Sun) and the Chandra (Moon) realized that Svarbhanu was an Asura and not one of the devas. Knowing this, Lord Vishnu severed Svarbhanu’s head with his discus, the Sudarshan Charka.

However, even though his head and body became separate, they still remained immortal as the separate entity because before his head was served, he managed to drink a drop of the nectar from the Amrit. The Head is known as Rahu and the headless body is the Ketu. Since then Rahu and Ketu constantly chase the Sun and the Moon for revenge as they are the cause of separating the head and body of the devil Rahu. It is a popular belief that when they succeed catching Sun and Moon they swallow them causing Solar or Lunar eclipse but they can’t hold them for long and Sun and Moon emerge again intact as they also had nectar and are immortal.

Scientifically, Rahu and Ketu denote the two points of intersection of the paths of the Sun and the Moon as they move around the celestial sphere. Therefore, Rahu and Ketu are respectively called the North and the South Lunar nodes. Sometimes when the moon passes these nodes, it is aligned perfectly between the earth and the sun to create eclipses.

Rahu and Ketu are considered as two strong planets as per the principles of Vedic Astrology, although, astronomically, they do not exist. Since Rahu and Ketu are believed to have a strong impact on our lives, they are a crucial part of Jyotish Shastra and are denoted as mathematical points while making calculations in Vedic Astrology. The general explanation is that Sun represents body whereas Moon represents mind, hence, these incision points strongly affects the energies of these two parts of the bodies.

Here, when the topic of the Sun and the Moon came, one point I just want to include that the moonsign of a person, the house of the Moon’s placement says about one’s  emotions and subconscious, what he needs in relationships and how he will get along living with someone else, while ascendant, the house of the Sun’s placement in one’s horoscope says about one’s personality. I belong to Cancer moonsign and Leo ascendant. Cancer moonsign has made me very emotional and caring, which is well balanced by Leo ascendant, making me a perfect Leo, which we will discuss later…!

Coming back to Rahu and Ketu, generally, astrology mentions that Rahu represents indulgence. Rahu is also behind the instant success or failure of a person. However, if Rahu is well placed, it can bestow the native with courage and fame. Simultaneously, the negative effects of Ketu can cause diseases related to lungs, ear problems, brain disorders, problems in the intestine, etc. It represents mystic activities, wounds, sufferings, bad company, false pride, etc. Ketu also stands for moksha, sudden gains, interest in philosophical pursuits, spiritual pursuits, etc.

In Vedic Astrology, Rahu and Ketu have important functions with respect to Maya, the veil of illusion, and Karma, the repository of actions done in the past. We are all born with Maya, illusion. We cannot see our true selves, we do not know what actions we undertook in the past and how we came to be born again. Ketu is thought of as the karmic repository, the cauldron of selected past karma we have chosen to consume and expire in this life. Rahu is the principle of desire, action-reaction which causes karma to be delivered, right on time, according to the schedule. Karma is of many kinds, and has several kinds of “density” as it were. There are three kinds of Karma: fixed, mutable and movable karmas. Fixed karmas are those events which are consequences of past actions of ours that we will experience in this life. Mutable karma can often be ameliorated by wearing of certain gems in special places on the body, and offering of prayers and sacred offerings to the planets. Movable karma can be spent or reduced by way of penance, prayer, puja, offering and most especially, selfless service to any other life form in need of alms, aid, food or clothing.

In this area of past karmas, the shadow planets Rahu and Ketu get busy. They may bring rewards of past good actions – some placements of Rahu and Ketu are brilliant for this. They may may give a certain wash of temperament of character to an entire lifetime by way of the Rahu-Ketu axis in the birth chart. These planets may also have effect when Karla sarpa yoga is present in the birth chart – that is to say, all the planets are located between the positions of Rahu and Ketu. Therein emerges captivity to the shadow self of the past. That is the duty of the planets.

One has to clearly understand the contribution of these planets, sitting in different houses of a person’s horoscope, towards the destiny of a person. As these two planets depicts the head and tail, they always get placed six houses apart from each other in a person’s horoscope. in my horoscope, Rahu is placed in first house that means in ascendant and Ketu is placed, after six houses that means in seventh house…! Interesting placements….:) Most of the astrologers had already told me that this is the worst placement of Rahu and Ketu. Rahu becomes strong if it gets placed within the first six houses and Ketu becomes strong if it gets placed within the seventh to twelve house. Being these are among the maleficent planets and their strong placement brings many failures in different areas of a person’s life…. I started analyzing my horoscope and drew below conclusion after analyzing five other horoscopes even though there is a slight difference in the placements of other planets along with these shadow planets.

Rahu (North Node) is the head which want to eat it is obsessed, it is our inner present life desire with we born. Rahu (North Node) is obsession for those things where it placed in the chart. Ketu is the body without head it is our past life experience of life, it is already filled for those things, detachment, disinterest for those thing where it is placed in the chart. Although they are not the physical planet but they are the most influential forces in the chart, it is the Rahu/Ketu axis that the main force of karmic desire can be seen. So what happened when Rahu/Ketu axis fall in the 1st and 7th house. 1st house is the area of life. Rahu (North Node) placement in the house shows the area of life in which we need to develop mentally and Ketu (South Node) shows the area of life where we withdraw ourselves.

Summary:  Rahu (North Node) in 1st house is obsessed!!! about the identity and Ketu (South Node) in 7th house indicates detachment from spouse and relationships are strained.

1st house is the house of identity, physical body, appearance, overall general health. And seventh house is the house of other people, spouse, and dealing with others. Rahu is head without body. Rahu in 1st house shows the person may have big head. Yes, I have with broad forehead and well proportionate facial appearance.

Here Rahu is obsessed with their own identity and Ketu withdraw from others. Rahu (North Node) in 1st house (Rahu (North Node) in Lagna) these people think no one understand them. Here the native withdraw from others because other people criticize them too much. These people like to define their personality their own way, they don’t like someone else define their personality. Rahu is rebellious force the native don’t like other people criticism and make themselves alone and unusual in behavior and they try find the answer about their own identity. Because of this they suffer in married life, initially the native suffer due to compromise with the partner but after sometime they did not feel satisfied with their own spouse and may wander. Ketu is a planet greatly involved in the spiritualization process, and the 7th house primarily stands for spouse and relationships. Ketu in the 7th house usually indicates that the spouse, may be a difficult person with some problems and relationships are very strained. But with time these people learn how to live alone or live in a solitude, this nature make them to introspect and learn about themselves. If evolved the person is very spiritual. These people need to listen others to overcome their shortcomings, and try to self understand about their shortcomings and should improve themselves.

I have become restless to get my self identity, since last four to five years, which is the effect of Rahu, after coming out from an incident which is the effect of Ketu….We will discuss more about this in my next continuing blog…

<<To be continued>>

What is a Principle Component Analysis and It’s application in Ischemic Heart Disease

efinition:Ischemic means that an organ (e.g., the heart) is not getting enough blood and oxygen. Ischemic heart disease, also called coronary heart disease (CHD) or coronary artery disease, is the term given to heart problems caused by narrowed heart (coronary) arteries that supply blood to the heart muscle.

What is Principal component analysis (PCA)?

  1. It is statistical technique is applicable in the situation when the statistician is dealing with single set of variables having some sort of correlation between each other and wants to discover that what are the set of correlated variable(s) which are important in the formation of a coherent factors in such a way that there exists no correlation between these newly formed factors.
  2. Here, the set of  correlated variable(s) means they are quite uncorrelated with other subset of variables which are combined together to form a factor.
  3. The basic idea is that these newly formed factors drive the underlying process due to which the variables in the data set are supposed to correlate with each other. The specific goals of the component analysis are to summarize patterns of correlations among observed variables, to reduce a large number of observed variables to a smaller number of factors, and to provide an operational definition (a regression equation) for an underlying process by using observed variables.
  4. Since the number of factors are usually far fewer than the number of observed variables, so there is a considerable parsimony in using the factor analysis. Furthermore, when the scores of factors are estimated for each subject, they are often more reliable than scores on individual observed variables.

Steps in principal component analysis:

  1. Selection of a data set consisting of a set of variables,
  2. Preparing the correlation matrix (to perform principal component analysis),
  3. Extracting a set of factors from the correlation matrix
  4. Determining the number of factors
  5. (Probably) Rotating the factors to increase interpret ability, and finally, interpreting the results.

The following table shows the descriptive statistics of ten chemical tests of blood causing ischemic heart disease patients and control group.

Observation 1: 

It was found that the average values of cholesterol, triglyceride, Apo protein B, low density lipoprotein , phospholipids, total lipid, glucose and uric acid were higher in ischemic heart disease patients as compared to those of control group while high density lipoprotein and Apo protein A-1 were found to be higher in control group. Though the control group has the normal values but due to increase in average value of cholesterol the people are under heavy risk of increasing the ischemic heart disease.

 

Observation 2:

Correlation Analysis: It could be observed from this table that

  1. cholesterol has positively high correlation with low density lipoprotein (0.606) and it is moderately correlated with total lipid (0.421).
  2. Negative correlation between high density lipoprotein and low density lipoprotein.
  3. The observed correlation of triglyceride with total lipid (0.271) is positively weak.
  4. Apo protein A-1 has negatively weak whereas Apo protein B has positively weak correlation with low density lipoprotein.
  5. There is highly positive correlation observed between low density lipoprotein and total lipid.

Observation 3:

Initial and Extraction Communalities Analysis:

  1. In principal component analysis initial and extraction communalities for variables are the variances accounted for the factors.
  2. For variance and co-variance analysis initial communalities remain always equal to 1.0.
  3. For the extraction communalities, these values are the proportion to variance of each variable by the rest of the variables.
  4. The extraction communalities are the sum of squared loadings for a variable across factors, in Table 4, the extraction communalities for cholesterol are (0.770)2 + (0.180)2 + (0.0823)2 + (0.0877)2 = 0.639.
  5. That was Factor1+Factor2+Factor3+Factor4 accounted for 63.9% of the variance calculated for cholesterol.
  6. Variances calculated for high density lipoprotein, triglyceride, Apo protein A-1, Apo protein B, low density lipoprotein, phospholipids, total lipid, glucose and uric acid were 42.1%, 57.6%, 435.4%, 74.7%, 45.5%, 64.1%, 53.6%, 81.6% and 72.3% respectively.
  7. As already stated in the last section that those factors which have small variances do not fit in the factor solution and should be discarded from the analysis, our empirical results presented in Table 4 shows that high density lipoprotein and Apo protein B can be easily discarded from the analysis due to explaining a very small proportion of the variance. Since this proportion of variance explained by a factor is calculated by dividing the sum of square loadings with the number of variables used, hence for these two factors, it was calculated as 99.1% whereas the covariance for these factors was calculated as 100%.

The communality for a variable is the variance accounted for by the factors. It is the squared loading multiple correlation for the variable as predicted from the factor. Communality is the sum of squared loading (SSL) for a variable across factors. To find the proportion of variance explained by a factor can be calculated by dividing the sum of square loadings (SSL) with the number of variables, particularly when the rotation of factors id orthogonal. Likewise, the covariance proportion for the particular factor is obtained by dividing the sum of square loadings (SSL) for that particular factor with the sum of communalities.

Observation 4:

Component Analysis:

The eigenvalues and total variance explained for our factor solution. There are ten variables used in the sample, because the aim of factor analysis is to precise the pattern of correlation with a factor as possible. Since every eigenvalue corresponds to a different factor (usually only factor with large eigenvalues are retained). In a good factor analysis, these few factors defined the whole correlation matrix.  It is quite clear from the Table 6 that the first four components has their eigenvalues over 1 and are large enough to be retained, their variances are 26.65%, 12.02%, 11.79% and 10.19% respectively. These four components describe the 60.67% of the total variance.

Observation 5:

Component Score Coefficient Analysis:

The component loading matrix is a matrix of correlation between components and variables. The first column is the correlation between the first component and each variables, the second column is the correlation between the second component and each variable and so on. Table 6 reveals that there were few strong correlation observed between components and their respective variables. A component is interpreted from variables that are highly correlated with it-that had loadings on it. In the evident from Table 6 that the variables that had high loading are assumed to belong with the component 1, i.e., cholesterol (0.289), Apo protein B (0.205) and low density lipoprotein (0.302). Likewise in component 2 the variables had the high weights were found to be Apo protein A-1 (0.302), phospholipids (0.566) and uric acid (0.259). Similarly, triglyceride (0.239) and total lipid (0.300) belonged to component 3. In the same way in component 4 the variables with high scores were high density lipoprotein (0.330) and glucose (0.803).

 

Conclusion:

  1. The step wise principal Component Analysis was used to explore the effects of these chemical tests which are significantly affecting the patients namely cholesterol, high density lipoprotein, triglyceride, Apo protein A-1, Apo protein B, low density lipoprotein, phospholipids, total lipid, glucose and uric acid of patients who were suffering from Ischemic Heart Disease.
  2. In the case of finding correlations between variables, it was observed that Cholesterol was highly correlated with low density lipoprotein and moderately correlated with total lipid.
  3.  It was observed from communalities extractions that high density lipoprotein and Apo protein B has small variances and do not fit in the factor solution.
  4. To know the variability between the components, it was found that first four components account for exactly 60.67% of the total variability. These four components adequately describe the whole correlation matrix. Finally it was obvious to see that cholesterol, Apo protein B and low density lipoprotein had high value and belongs to the components 1. In component 2, it was found that Apo protein A-1, phospholipids and uric acid had the high scores. The variables that had high loadings in the component 3 were triglyceride and total lipid. The variables had high weight and belonged to component 4 were high density lipoprotein and glucose.
  5. The important finding of this study is that the average cholesterol level which is considered to be main factor increasing the risk of Ischemic Heart Disease is found to be higher even in control group than normal values.
  6. So, cholestrol is not the factor for contributing Ischemic Heart Disease.

Application of ANOVA and Regression Analysis in e-commerce business

Let’s assume that eCommerce organisations like Amazon and Flipkart would like to understand if shopping habit for a specific category has any relationship with the Gender and Income Group of the customers. As there are two factors (i.e. two independent categorical columns) which are being considered in this example, we are talking about Two-way ANOVA. So, there would be 3 hypothesis – one each for each of the independent categorical column and third to cater for the interaction effect of two independent variables.

Another example, suppose an eCommerce organisation would like to understand if page crash has anything to do with the education level of the customers. ANOVA would be the right choice to find if there is any statistical significance on the probability of page crash when measured against a single factor therefore education level.

For any business, specifically for an eCommerce organisation, conversion/purchase is the final goal. Hence to find out what impact each activity has on the sales, Regression equation with all activities like emails campaign, TV ads, Social Media broadcast, personalized communication to frequent customers, cold call , as independent variables to understand the impact on the sales with unit increase in the cost of each variable , keeping other independent variables constant.

Two Way ANOVA:

  1. In any business, Customer Satisfaction and Customer Loyalty play a vital role, for which usability of the e-commerce site, confidential protection of user information and better response time are some of the contributing factors to determine the areas of improvement in the business. Two Way ANOVA can be used to come up with the statistical significance level of these factors towards the major goal.
  2. To determine the significance level of shopping habit of a customer based on demographic factors such as Gender or Annual Income, by computing the association between and the interaction between these 2 independent factors.
  3. To come up with the significance levels of the channels used for marketing of a multi-channel marketing campaign.

We can express shopping habit as below:

Shopping Habit=a Gender + b Annual Income,

where Shopping habit is a DV and Gender and Annual Income are IVs

As there are 2 IVs and Gender has different levels, we have to perform Two Way ANOVA.

If the below is a sample of data:

Cust Id Gender Annual Income Shopping Habit
1 F >1L Occasionally
2 M 1L-3L Weekly
3 F 3L-5L Monthly
4 M >1L Occasionally
5 F 1L-3L Weekly
6 M 3L-5L Monthly
7 F >1L Occasionally
8 F 1L-3L Weekly
9 F 1L-3L Monthly

 

Hypothesis for Gender

H0: there is no effect of Gender on Shopping Habit

Ha: there is effect of Gender on Shopping Habit

Hypothesis of Annual Income

H0: there is no effect of Annual Income on Shopping Habit

Ha: there is effect of Annual Income on Shopping Habit.

Perform Normality and Homogeneity Test of the data distribution.

Considering only for Gender, we need to calculate MSwithin for both Male and Female and MSbetween for Male and Female.

Then Fratio= MSbetween/ MSwithin.

If Fratio>F0.05, then we can conclude that there is a effect of Gender on Shopping Habit.

Considering only for Annual Income, we need to calculate MSwithin for each of the 3 groups of income and MSbetween for 3 groups.

Then Fratio= MSbetween/ MSwithin.

If Fratio>F0.05, then we can conclude that there is a effect of Annual Income on Shopping Habit.

Interaction Effect: Through TukeyHSD test, we can get the interaction.

  1. Objectives of Multi Channel Marketing:
  2. Low cost marketing channel
  3. Better customer experience
  4. Better integration and interaction of channels

CRM(Customer Relationship Management) system can be a data source for this, where we can get the insight of the customers based on various marketing channels, response and customer acquisition, the most statistical significant channels(Most effective) for a group(Cluster) of customers, through linear regression and is there any increase in performance of a specific channel in association with other channel(ANOVA interaction).

  1. Several factors such as Product Description, One day Delivery option, Availability of Cash on Delivery, Quality of packaging of product, Free returns with pickup facility, do they have any significant impact on Customer buying behaviour?
  2. Factors such as Application User Interface, Information Quality, User Information Security and Service Feedback are some of the factors which can be hypothesized to come up to a conclusion to decide customer satisfaction and trust, by achieving which an e-commerce Organisation can gain Customer Loyalty.

So, here hypothesis testing can be summarised as below:

H0ui: Application User Interface does not have any impact on Customer Loyalty

Haui: Application User Interface has significant impact on Customer Loyalty

H0iq: Information Quality does not have any impact on Customer Loyalty

Haiq: Information Quality has significant impact on Customer Loyalty

H0is: User Information Security does not have any impact on Customer Loyalty

Hais: User Information Security has significant impact on Customer Loyalty

H0sf: Customer Feedback does not have any impact on Customer Loyalty

Hasf: Customer Feedback has significant impact on Customer Loyalty.

An example of Multiple ANOVA and Regression in Error Correction:

1. During Seasonal Offer, I just wonder why the Discount Sale is only for 3 days why not at least for 7 days. Below might be a reason for that:

A Manager may believe that extending Discount Sales offer duration will greatly increase sales.

Multiple ANOVA can suggest statistical significance of the maximum number of days for Discount Sales whether it is 3 or 4 or 5 days.

Regression analysis, however, may indicate that the increase in revenue might not be sufficient to support the base price of the products or rise in operating expenses due to longer support hour to handle the huge load (such as any additional IT infrastructure and support cost related to this).

Here, there is a possibility of getting lower profit due to additional support cost which might be ignored by the Manager at the time of decision making.

Hence, regression analysis along with ANOVA can provide quantitative support for decisions and prevent mistakes due to manager’s intuitions.

Failure to understand the components of correlation and regression and each of their implications and limitations can lead to poor business decisions. When applied correctly, correlation and regression analysis can be used.

Application of Statistics  in Marketing and Customer Analytics are vast, but as the topic of discussion is limited to One way, Two way ANOVA and Regression(as there is not any mention of any specific kind of Regression), I am adding few more use cases:

1. Predictive Model: Linear Regression which is a type of Predictive model can be used to enhance the Pricing Model of the e-commerce business by analyzing  historical data for different products, customer responses to past pricing trends, and evaluating competitor’s pricing model which helps to build suitable pricing models.

2. Logistic Regression can be used for fraud detection by analyzing customer behavior analytics where algorithms get used to analyze suspicious activities and find inconsistencies in the historical sets of personal data, in scenarios when a scammer breaches a user account, alters personal data, and tries to get money or goods from a retailer using this semi-fake personal information.

3. Cox Regression: Cox regression can be used for Time to Event analysis means if we want to analyze the time difference between the user account creation and the first event triggered by the user, means he did any purchase or not  or the closure of user account. So here, there are two target variables: one is the time difference and the other is occurrence of any specific event.

What is Multicollinearity, how to test it in R and how to handle the problems caused by it?

hat is Multicollinearity?

Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable.

There are certain reasons why multicollinearity occurs:

  1. It is caused by the inclusion of a variable which is computed from other variables in the data set. For ex: in the dataset of physical attributes of a person, Weight, Height, BMI are three of the variables in the dataset, where BMI=Weight/Height. So, it is a derived variable and is positively correlated with weight. So, we can say multicollinearity occurs between Weight and BMI.
  2. Multicollinearity can also result from the repetition of the same kind of variable.
  3. Generally occurs when the variables are highly correlated to each other.

How to test Multicollinearity in R? 

For a given predictor (p), multicollinearity can assessed by computing a score called the variance inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

The smallest possible value of VIF is one (absence of multicollinearity). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity..

R code:

rm=list(ls())
winequality=read.csv(“D:/sahubackup/GL/winequality-red.csv”)
winequality
library(tidyverse)
library(caret)
ind = sample(2, nrow(winequality), replace = TRUE, prob=c(0.7, 0.3))
trainset = winequality[ind == 1,]
trainset
testset = winequality[ind == 2,]
testset
model=lm(quality~., data=trainset)
predictions <- model %>% predict(testset)
predictions
data.frame(
RMSE = RMSE(predictions, testset$quality),
R2 = R2(predictions, testset$quality)
)
car::vif(model)

Output: RMSE R2
1 0.6304587 0.3623385
> car::vif(model)
fixed.acidity volatile.acidity citric.acid residual.sugar
7.842042 1.835592 3.169716 1.670375
chlorides free.sulfur.dioxide total.sulfur.dioxide density
1.410560 2.073626 2.235259 6.422401
pH sulphates alcohol
3.268406 1.417728 2.972688

So, here, fixed.acidity and density have high multicollinearity.

Problems with Multicollinearity:

Multicollinearity causes the following two basic types of problems:

  • The coefficientestimates can swing wildly making those become very sensitive to small changes in the model.
  • Multicollinearity reduces the precision of the estimate coefficients, which weakens the statistical powerof your regression model. We cannot trust on the p-values to identify independent variables that are statistically significant.
  • The partial regression coefficient due to multicollinearity may not be estimated precisely. The standard errors are likely to be high.
  • Multicollinearity results in a change in the signs as well as in the magnitudes of the partial regression coefficients from one sample to another sample.
  • Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.

In the presence of high multicollinearity, the confidence intervals of the coefficients tend to become very wide and the statistics tend to be very small. It becomes difficult to reject the null hypothesis of any study when multicollinearity is present in the data under study.

Types of Multicollinearity:

There are two basic kinds of multicollinearity:

  1. Structural multicollinearity: This type occurs when we create a model term using other terms. In other words, it’s a byproduct of the model that we specify rather than being present in the data itself. For example, in the dataset of physical attributes of a person, Weight, Height, BMI are three of the variables in the dataset, where BMI=Weight/Height. So, it is a derived variable and is positively correlated with weight. So, we can say multicollinearity occurs between Weight and BMI.
  • Data multicollinearity: This type of multicollinearity is present in the data itself rather than being an artifact of our model. Observational experiments are more likely to exhibit this kind of multicollinearity.

Fixing Multicollinearity:

Multicollinearity makes it hard to interpret the model coefficients, and it reduces the power of the model to identify independent variables that are statistically significant. These are definitely serious problems.

The need to reduce multicollinearity depends on its severity and our primary goal for the regression model by keeping the following three points in mind:

  1. The severity of the problems increases with the degree of the multicollinearity. Therefore, if we have only moderate multicollinearity, you may not need to resolve it.
  2. Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not present for the independent variables that we are particularly interested in, we may not need to resolve it. Suppose our model contains the experimental variables of interest and some control variables. If high multicollinearity exists for the control variables but not the experimental variables, then we can interpret the experimental variables without problems.
  3. Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If our primary goal is to make predictions, and we don’t need to understand the role of each independent variable, we don’t need to reduce severe multicollinearity.

How to Deal with Multicollinearity

But, what if you have severe multicollinearity in our data and you find that we must deal with it? What do we do then? Even though, there are a variety of methods that we can try, but each one has some drawbacks. We have need to use our subject-area knowledge and factor in the goals of our study to pick the solution that provides the best mix of advantages and disadvantages.

The potential solutions include the following:

  • Remove some of the highly correlated independent variables.
  • Linearly combine the independent variables, such as adding them together.
  • Perform an analysis designed for highly correlated variables, such as principal components analysis or partial least squares regression.