What is Factor Analysis? | Explained Factor Analysis
Data analysis is important for businesses because,
data-driven choices are the only way to be truly confident in business opinion.
data analysis is also important in research because,
it makes lot simpler and more accurate,
one such method is factor analysis.
1 - What is factor analysis
2 - Latent variables
3 - Assumptions in factor analysis
4 - Purpose of factor analysis
5- Types of factor analysis
6 - Issues with factor analysis
7 - Basic logic of factor analysis
1 - What is Factor Analysis
Factor analysis is a statistical technique,
that is used to reduce the large
number of variables into smaller number of factors.
for example it is possible that
variations in five observed variables mainly reflect the variation in one
unobserved variable.
It is also known as dimension
reduction.
Since it reduces the dimension or
the total number of variables in the data set.
Factor analysis is a kind of latent
variable model.
consider a job satisfaction questionnaire.
a person's satisfaction with a job
can be based on numerous factors such as satisfaction with the job role,
whether it's as per a person's qualification
supervisor which can in turn depend on appraisal or communication satisfaction
with co-workers pay etc.
Example of factor analysis say that
you are a foodie and you want to pick a restaurant to go to,
so you start checking reviews on different
restaurants you find,
that the reviews are categorized on
the aspect of six variables which are
Waiting time
Cleanliness
Staff behavior
Taste of food
Food freshness and
Food temperature.
Too many variables are making it difficult
for you to pick a particular restaurant that you would go to.
The two factors to pick a restaurant
that you really care about are let's say service and food quality.
The variables in the reviews can be broadly
categorized in these two factors as shown.
This is what factor analysis does
So service and food quality are not
really present in the data but are derived out of the data, these are called
the latent variable.
2
- Latent Variables
In statistics latent variables are variables
that are not directly observed.
This is what we call factors it's
actually difficult to measure numerically.
The mathematical model that aims to explain
observed variables in terms of latent variables are called latent variable
models.
hence factor analysis is a latent
variable model.
examples of latent variables are
Quality of life
Business confidence
Morale happiness and
Conservatism among others.
3
- Assumptions in Factor Analysis
let's have a look at these
Firstly we assume that our data is
clean there should be no outliers or missing values.
Secondly the sample size is expected
to be greater than the number of factors.
Thirdly the variables are expected
to be interrelated.
The concept of factor analysis is
based on correlation of data, so that it can be grouped together.
we can perform something called buriedtest
to analyze the correlation.
Forth matrix variables are expected that
is the variables are expected to be of numeric type it should be in an interval
of numbers and,
Lastly the data is preferred to be
normalized however multivariate normalization is not necessary.
4
- Purpose of Factor Analysis
The primary purpose of using factor analysis
is for data reduction.
Having too many related fields can
make it difficult to analyze the data,
thus factor analysis reduces the
number of variables.
factor analysis also helps in latent
variable discovery as we saw in the examples before.
Some factors such as empathy cannot
be measured but it can be formulated using other variables.
Factor analysis supports
simplification of items in the subset of concepts.
Sometimes many fields in our data signify
the same thing such as in the restaurant example delay in serving staff
behavior and cleanliness signify the same factor which is service.
Moreover with factor analysis you
can access the dimensionality and homogeneity in the data.
5-
Types of Factor Analysis
Factor analysis can be broadly classified
into two types efe and cfa.
Exploratory factor analysis is used
to discover the underlying structure in the
data using something like
correlation matrix,
it is used for getting insights out
of the data.
and confirmatory factor analysis is based
on the insights derived in efa.
So cfa is used to test those expectations
it makes use of equations for modeling the structure.
Efa is further divided into many types
the very popular pca or principal component analysis common factor analysis or
just factor analysis.
image factoring that makes use of correlation
matrix derived out of ols regression,
maximum likelihood method which is
again based on the correlation matrix and other methods such as alpha factoring
and weight square.
Out of these the most commonly used
ones are principal component analysis and common factor analysis.
6
- Issues with factor analysis
First you need to understand whether
to use principal component analysis or factor analysis,
Next you should know how to
interpret the results of your analysis and,
finally you need to figure out how
many factors to pick
let's address these issues one by
one.
Principal component analysis tries
to find the variables that are composites of observed variables.
such as in the house pricing data
set pca would identify that air quality index is closely determined by the
number of parks in the locality,
but in factor analysis we assume
that there are some latent factors,
some immeasurable factors which can
only be derived out of the given numeric variables and,
secondly in case of pca we take into
account the total variance in the data.
that is the sum of unique variance.
variance due to error and common variance.
however in factor analysis only the
common variance of shared variance is considered.
so when you want to find the latent
variable using many variables,
use factor analysis and when you
want to eliminate some variable that are having high variance use pca.
when the number of variables is more
than 30 the result of pca and fa is the same.
Next you need to address how to interpret
the results of the analysis.
for this we use something called
loading,
factor loading is basically the correlation
coefficient for the variable and factor.
let's say you have 10 variables that
you want to derive into 3 factors,
so for that you make a table to account
for how much of the variance of the variable is explained by a factor,
it ranges from 0 to 1. so if
significant amount of the correlation is explained by a factor the variable can
be denoted using that factor.
for deeper analysis you can
calculate the communality of a variable,
it is given by the horizontal sum of
squares of the values,
for example for variable 1 it would
be 0.7 square plus 0.2 square plus 0.1 square.
similarly the vertical sum of
squares of values for a factor is called eigenvalue,
for example for factor 1 eigenvalue
will be 0.7 square plus 0.4 square plus 0.7 square plus 0.1 square and so on,
also sometimes for a particular
variable it shows high correlation for more than one factor this is called
cross loading and in this scenario variable rotation should be performed.
So we know how to interpret the
results of the analysis,
now how do we know how many factors
to select.
when we talk about the sample size
the rule of thumb is to have minimum 5 observations per variable,
that is for let's say 5 variables
you should have 25 observations 10 variables 50 observations and so on,
but when it comes to deriving the
factors of the variables let's say from hundred variables how do we know how
many factors today 5 8 10 how many,
for this you can make a screen plot
and notice the bend in plot,
however this is not very intuitive
you can instead use the latest root criterion which states that for a
particular factor,
if the vertical sum of squares of
all the values called the eigenvalue is greater than 1 you should include that
factor in your analysis.
7
- Basic Logic of Factor Analysis
Factor analysis basically gives you
the items that you want to reduce.
It creates a mathematical
combination of variables,
that maximizes variance that you can
predict in all variables which is the principal component or factor.
New combination of items from
receivable variance that maximizes variance you can predict in what is left is
your second component or factor.
Continue this until all the variance
is accounted for and then select the minimal number of factors.
with that you can finally interpret the factors using rotated matrix and loadings.