Principal Component Analysis (PCA) is a statistical method to reduce the dimensionality of data based on the singular value decomposition of a data matrix. The data matrix is usually standardized to have unitary variance and zero average. After the transformation the first few components are able to explain often more than 80% of the total variability of the phenomenon.
For example we can apply the PCA to the dataset of the food consumes in Europe (remember that we call the columns are "variables", while the rows are the "observations"):
Dataset dataset = new ConsumesDataset();
//Note that the data are standardized by default
PCA pca = new PCA(dataset);
The method pca.getLoadings()
returns the loadings matrix (also called rotation matrix), a matrix whose columns contain the eigenvectors as resulting from the singular value decomposition of the data matrix:
Loadings:
9 x 9 Matrix
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
-------------------------------------------------------------------------------------------------------
Cereals -0,34506 -0,17348 0,30008 0,56794 0,31430 -0,36281 0,29007 -0,13582 -0,32440
Rice -0,38674 -0,11027 -0,21989 -0,61732 0,08136 0,02645 0,11913 -0,36114 -0,50686
Potatos -0,15576 0,04827 -0,82593 0,29677 0,10746 0,14487 0,34987 0,21063 0,06217
Sugar 0,46075 -0,09044 0,13497 0,17829 0,15298 0,70030 0,23249 -0,20414 -0,34638
Vegetables -0,47398 0,14198 0,23867 -0,00570 -0,09063 0,37585 0,32170 -0,35291 0,56629
Meat -0,06099 0,58728 -0,18647 0,29702 0,07486 0,02969 -0,52267 -0,48203 -0,12791
Milk 0,37994 -0,33982 -0,23511 0,08746 -0,32460 -0,34206 0,16914 -0,62352 0,19359
Butter 0,33329 0,35260 0,01921 -0,28320 0,67956 -0,26269 0,29842 -0,09142 0,23621
Eggs 0,10492 0,58512 0,11765 -0,02015 -0,52788 -0,16765 0,48048 0,10515 -0,28931
-------------------------------------------------------------------------------------------------------
Multiplying the loadings with the data, we get the principal components. This is done by pca.getPrincipalComponents()
:
System.out.println(pca.getPrincipalComponents());
16 x 9 Matrix
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
--------------------------------------------------------------------------------------
BE 0,37849 1,99585 -0,33641 -0,18682 0,91013 0,55990 0,51312 0,16614 -0,22159
DA 1,67816 1,16483 0,61827 0,87212 -0,81844 -0,19138 -0,87928 -0,01961 -0,07333
GE 1,09153 1,94225 0,60132 0,39377 0,15590 -0,11882 -0,01142 0,81436 0,40630
GR -3,72041 -0,61806 0,92070 0,62920 0,14676 0,40787 0,76932 -0,04321 0,10951
SP -2,61001 1,43674 -0,89916 -0,43703 -1,81333 0,35335 0,23325 -0,27324 -0,01114
FR 0,40232 2,55468 -0,00765 -0,56813 0,75972 -0,60384 0,15397 -0,26876 -0,05770
IR 0,01593 -0,18156 -2,38952 2,31616 0,24270 -0,67552 0,14623 -0,30885 0,03887
IT -2,82628 -0,23894 2,22012 0,34721 0,44512 -0,56555 -0,46001 -0,29781 -0,03353
OL 0,59225 -0,03335 -0,85894 -1,21065 0,34232 0,94398 -0,47259 -0,43417 0,38551
PO -2,15038 -1,86833 -1,01304 -0,24998 0,77325 0,14638 -0,79978 0,44463 -0,01240
GB 0,28252 -1,04596 -0,57846 -0,13599 0,16043 0,50054 -0,08680 0,41699 -0,36286
AU 0,46628 0,66384 0,34435 -0,46455 -0,41499 0,08343 -0,45497 -0,06536 -0,25601
FI 1,08732 -1,57011 -0,22076 -1,52243 0,04219 -0,90776 0,21904 -0,75699 0,05784
IS 3,18860 -1,47859 1,27266 1,21922 0,02909 0,86902 0,32071 -0,57305 -0,01343
NO 0,89505 -2,02622 0,17441 0,04788 -0,81142 -0,33060 0,17350 0,65421 0,11255
SV 1,22863 -0,69709 0,15211 -1,04998 -0,14943 -0,47101 0,63572 0,54473 -0,06859
--------------------------------------------------------------------------------------
If we give a look at the summary, we can see how the first four components explain 85 % of the data variability (row "Cumulative Proportion"):
System.out.println(pca.getSummary());
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
----------------------------------------------------------------------------------------------------------
Standard deviation 1,85987 1,47071 1,06743 0,96667 0,69584 0,57018 0,49004 0,46293 0,20075
Proportion of Variance 0,38435 0,24033 0,12660 0,10383 0,05380 0,03612 0,02668 0,02381 0,00448
Cumulative Proportion 0,38435 0,62468 0,75128 0,85511 0,90891 0,94503 0,97171 0,99552 1,00000
----------------------------------------------------------------------------------------------------------
The first four components can be used as input for further analysis (for example to solve clustering or classification problems), reducing so the computational complexity and time but also the "noise" in the data or feature that aren't relevant.
Sometimes maybe interesting to know what variables have the most influence on the principal components. In this case the correlations between variables and the principal components helps us (method pca.computeCorrelations()
):
PC1 PC2
-------------------------------
Cereals -0,64177 -0,25514
Rice -0,71928 -0,16217
Potatos -0,28969 0,07098
Sugar 0,85693 -0,13302
Vegetables -0,88153 0,20880
Meat -0,11344 0,86372
Milk 0,70664 -0,49978
Butter 0,61988 0,51857
Eggs 0,19513 0,86054
-------------------------------
We can see, how the first component is characterized by a high correlation with Sugar and Vegetables while the second is most influenced by Meat and Eggs (note that we have reported only the first two components) It's also possible to visualize the correlation in a so called Correlation Circle:
The above diagram was generated with this code (in svg format):
CorrelationCircle cc = new CorrelationCircle(getPCA());
StringWriter writer = new StringWriter();
SvgGraphics2D g2 = new SvgGraphics2D(writer);
g2.begin();
cc.paint(g2, null, 500, 550);
g2.end();
System.out.println(writer.toString());