text-analysis Wiki

Status: Alpha

Brought to you by: kostia76

PCA

Labels:

Principal Component Analysis

Principal Component Analysis (PCA) is a statistical method to reduce the dimensionality of data based on the singular value decomposition of a data matrix. The data matrix is usually standardized to have unitary variance and zero average. After the transformation the first few components are able to explain often more than 80% of the total variability of the phenomenon.

For example we can apply the PCA to the dataset of the food consumes in Europe (remember that we call the columns are "variables", while the rows are the "observations"):

Dataset dataset = new ConsumesDataset();
//Note that the data are standardized by default
PCA pca = new PCA(dataset);

The method pca.getLoadings() returns the loadings matrix (also called rotation matrix), a matrix whose columns contain the eigenvectors as resulting from the singular value decomposition of the data matrix:

Loadings:
9 x 9 Matrix
                  PC1       PC2       PC3       PC4       PC5       PC6       PC7       PC8       PC9 
-------------------------------------------------------------------------------------------------------
Cereals      -0,34506  -0,17348   0,30008   0,56794   0,31430  -0,36281   0,29007  -0,13582  -0,32440 
Rice         -0,38674  -0,11027  -0,21989  -0,61732   0,08136   0,02645   0,11913  -0,36114  -0,50686 
Potatos      -0,15576   0,04827  -0,82593   0,29677   0,10746   0,14487   0,34987   0,21063   0,06217 
Sugar         0,46075  -0,09044   0,13497   0,17829   0,15298   0,70030   0,23249  -0,20414  -0,34638 
Vegetables   -0,47398   0,14198   0,23867  -0,00570  -0,09063   0,37585   0,32170  -0,35291   0,56629 
Meat         -0,06099   0,58728  -0,18647   0,29702   0,07486   0,02969  -0,52267  -0,48203  -0,12791 
Milk          0,37994  -0,33982  -0,23511   0,08746  -0,32460  -0,34206   0,16914  -0,62352   0,19359 
Butter        0,33329   0,35260   0,01921  -0,28320   0,67956  -0,26269   0,29842  -0,09142   0,23621 
Eggs          0,10492   0,58512   0,11765  -0,02015  -0,52788  -0,16765   0,48048   0,10515  -0,28931 
-------------------------------------------------------------------------------------------------------

Multiplying the loadings with the data, we get the principal components. This is done by pca.getPrincipalComponents():

System.out.println(pca.getPrincipalComponents());

16 x 9 Matrix
         PC1      PC2      PC3      PC4      PC5      PC6      PC7      PC8      PC9 
--------------------------------------------------------------------------------------
BE   0,37849  1,99585 -0,33641 -0,18682  0,91013  0,55990  0,51312  0,16614 -0,22159 
DA   1,67816  1,16483  0,61827  0,87212 -0,81844 -0,19138 -0,87928 -0,01961 -0,07333 
GE   1,09153  1,94225  0,60132  0,39377  0,15590 -0,11882 -0,01142  0,81436  0,40630 
GR  -3,72041 -0,61806  0,92070  0,62920  0,14676  0,40787  0,76932 -0,04321  0,10951 
SP  -2,61001  1,43674 -0,89916 -0,43703 -1,81333  0,35335  0,23325 -0,27324 -0,01114 
FR   0,40232  2,55468 -0,00765 -0,56813  0,75972 -0,60384  0,15397 -0,26876 -0,05770 
IR   0,01593 -0,18156 -2,38952  2,31616  0,24270 -0,67552  0,14623 -0,30885  0,03887 
IT  -2,82628 -0,23894  2,22012  0,34721  0,44512 -0,56555 -0,46001 -0,29781 -0,03353 
OL   0,59225 -0,03335 -0,85894 -1,21065  0,34232  0,94398 -0,47259 -0,43417  0,38551 
PO  -2,15038 -1,86833 -1,01304 -0,24998  0,77325  0,14638 -0,79978  0,44463 -0,01240 
GB   0,28252 -1,04596 -0,57846 -0,13599  0,16043  0,50054 -0,08680  0,41699 -0,36286 
AU   0,46628  0,66384  0,34435 -0,46455 -0,41499  0,08343 -0,45497 -0,06536 -0,25601 
FI   1,08732 -1,57011 -0,22076 -1,52243  0,04219 -0,90776  0,21904 -0,75699  0,05784 
IS   3,18860 -1,47859  1,27266  1,21922  0,02909  0,86902  0,32071 -0,57305 -0,01343 
NO   0,89505 -2,02622  0,17441  0,04788 -0,81142 -0,33060  0,17350  0,65421  0,11255 
SV   1,22863 -0,69709  0,15211 -1,04998 -0,14943 -0,47101  0,63572  0,54473 -0,06859 
--------------------------------------------------------------------------------------

If we give a look at the summary, we can see how the first four components explain 85 % of the data variability (row "Cumulative Proportion"):

System.out.println(pca.getSummary());

                             PC1      PC2      PC3      PC4      PC5      PC6      PC7      PC8      PC9 
----------------------------------------------------------------------------------------------------------
Standard deviation       1,85987  1,47071  1,06743  0,96667  0,69584  0,57018  0,49004  0,46293  0,20075 
Proportion of Variance   0,38435  0,24033  0,12660  0,10383  0,05380  0,03612  0,02668  0,02381  0,00448 
Cumulative Proportion    0,38435  0,62468  0,75128  0,85511  0,90891  0,94503  0,97171  0,99552  1,00000 
----------------------------------------------------------------------------------------------------------

The first four components can be used as input for further analysis (for example to solve clustering or classification problems), reducing so the computational complexity and time but also the "noise" in the data or feature that aren't relevant.

Sometimes maybe interesting to know what variables have the most influence on the principal components. In this case the correlations between variables and the principal components helps us (method pca.computeCorrelations()):

                     PC1      PC2  
    -------------------------------
    Cereals     -0,64177 -0,25514  
    Rice        -0,71928 -0,16217 
    Potatos     -0,28969  0,07098 
    Sugar        0,85693 -0,13302  
    Vegetables  -0,88153  0,20880  
    Meat        -0,11344  0,86372 
    Milk         0,70664 -0,49978 
    Butter       0,61988  0,51857  
    Eggs         0,19513  0,86054  
    -------------------------------

We can see, how the first component is characterized by a high correlation with Sugar and Vegetables while the second is most influenced by Meat and Eggs (note that we have reported only the first two components) It's also possible to visualize the correlation in a so called Correlation Circle:

Correlation Circle

The above diagram was generated with this code (in svg format):

CorrelationCircle cc = new CorrelationCircle(getPCA());
StringWriter writer = new StringWriter();
SvgGraphics2D g2 = new SvgGraphics2D(writer);
g2.begin();
cc.paint(g2, null, 500, 550);
g2.end();
System.out.println(writer.toString());

Wiki: Home

text-analysis Wiki

PCA

Principal Component Analysis

Related