Principal Component Analysis (PCA) is a statistical method to reduce the dimensionality of data based on the singular value decomposition of a data matrix. The data matrix is usually standardized to have unitary variance and zero average. After the transformation the first few components are able to explain often more than 80% of the total variability of the phenomenon.
For example we can apply the PCA to the dataset of the food consumes in Europe (remember that we call the columns are "variables", while the rows are the "observations"):
Dataset dataset = new ConsumesDataset(); //Note that the data are standardized by default PCA pca = new PCA(dataset);
The method pca.getLoadings()
returns the loadings matrix (also called rotation matrix), a matrix whose columns contain the eigenvectors as resulting from the singular value decomposition of the data matrix:
Loadings: 9 x 9 Matrix PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 ------------------------------------------------------------------------------------------------------- Cereals -0,34506 -0,17348 0,30008 0,56794 0,31430 -0,36281 0,29007 -0,13582 -0,32440 Rice -0,38674 -0,11027 -0,21989 -0,61732 0,08136 0,02645 0,11913 -0,36114 -0,50686 Potatos -0,15576 0,04827 -0,82593 0,29677 0,10746 0,14487 0,34987 0,21063 0,06217 Sugar 0,46075 -0,09044 0,13497 0,17829 0,15298 0,70030 0,23249 -0,20414 -0,34638 Vegetables -0,47398 0,14198 0,23867 -0,00570 -0,09063 0,37585 0,32170 -0,35291 0,56629 Meat -0,06099 0,58728 -0,18647 0,29702 0,07486 0,02969 -0,52267 -0,48203 -0,12791 Milk 0,37994 -0,33982 -0,23511 0,08746 -0,32460 -0,34206 0,16914 -0,62352 0,19359 Butter 0,33329 0,35260 0,01921 -0,28320 0,67956 -0,26269 0,29842 -0,09142 0,23621 Eggs 0,10492 0,58512 0,11765 -0,02015 -0,52788 -0,16765 0,48048 0,10515 -0,28931 -------------------------------------------------------------------------------------------------------
Multiplying the loadings with the data, we get the principal components. This is done by pca.getPrincipalComponents()
:
System.out.println(pca.getPrincipalComponents()); 16 x 9 Matrix PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 -------------------------------------------------------------------------------------- BE 0,37849 1,99585 -0,33641 -0,18682 0,91013 0,55990 0,51312 0,16614 -0,22159 DA 1,67816 1,16483 0,61827 0,87212 -0,81844 -0,19138 -0,87928 -0,01961 -0,07333 GE 1,09153 1,94225 0,60132 0,39377 0,15590 -0,11882 -0,01142 0,81436 0,40630 GR -3,72041 -0,61806 0,92070 0,62920 0,14676 0,40787 0,76932 -0,04321 0,10951 SP -2,61001 1,43674 -0,89916 -0,43703 -1,81333 0,35335 0,23325 -0,27324 -0,01114 FR 0,40232 2,55468 -0,00765 -0,56813 0,75972 -0,60384 0,15397 -0,26876 -0,05770 IR 0,01593 -0,18156 -2,38952 2,31616 0,24270 -0,67552 0,14623 -0,30885 0,03887 IT -2,82628 -0,23894 2,22012 0,34721 0,44512 -0,56555 -0,46001 -0,29781 -0,03353 OL 0,59225 -0,03335 -0,85894 -1,21065 0,34232 0,94398 -0,47259 -0,43417 0,38551 PO -2,15038 -1,86833 -1,01304 -0,24998 0,77325 0,14638 -0,79978 0,44463 -0,01240 GB 0,28252 -1,04596 -0,57846 -0,13599 0,16043 0,50054 -0,08680 0,41699 -0,36286 AU 0,46628 0,66384 0,34435 -0,46455 -0,41499 0,08343 -0,45497 -0,06536 -0,25601 FI 1,08732 -1,57011 -0,22076 -1,52243 0,04219 -0,90776 0,21904 -0,75699 0,05784 IS 3,18860 -1,47859 1,27266 1,21922 0,02909 0,86902 0,32071 -0,57305 -0,01343 NO 0,89505 -2,02622 0,17441 0,04788 -0,81142 -0,33060 0,17350 0,65421 0,11255 SV 1,22863 -0,69709 0,15211 -1,04998 -0,14943 -0,47101 0,63572 0,54473 -0,06859 --------------------------------------------------------------------------------------
If we give a look at the summary, we can see how the first four components explain 85 % of the data variability (row "Cumulative Proportion"):
System.out.println(pca.getSummary()); PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 ---------------------------------------------------------------------------------------------------------- Standard deviation 1,85987 1,47071 1,06743 0,96667 0,69584 0,57018 0,49004 0,46293 0,20075 Proportion of Variance 0,38435 0,24033 0,12660 0,10383 0,05380 0,03612 0,02668 0,02381 0,00448 Cumulative Proportion 0,38435 0,62468 0,75128 0,85511 0,90891 0,94503 0,97171 0,99552 1,00000 ----------------------------------------------------------------------------------------------------------
The first four components can be used as input for further analysis (for example to solve clustering or classification problems), reducing so the computational complexity and time but also the "noise" in the data or feature that aren't relevant.
Sometimes maybe interesting to know what variables have the most influence on the principal components. In this case the correlations between variables and the principal components helps us (method pca.computeCorrelations()
):
PC1 PC2 ------------------------------- Cereals -0,64177 -0,25514 Rice -0,71928 -0,16217 Potatos -0,28969 0,07098 Sugar 0,85693 -0,13302 Vegetables -0,88153 0,20880 Meat -0,11344 0,86372 Milk 0,70664 -0,49978 Butter 0,61988 0,51857 Eggs 0,19513 0,86054 -------------------------------
We can see, how the first component is characterized by a high correlation with Sugar and Vegetables while the second is most influenced by Meat and Eggs (note that we have reported only the first two components) It's also possible to visualize the correlation in a so called Correlation Circle:
The above diagram was generated with this code (in svg format):
CorrelationCircle cc = new CorrelationCircle(getPCA()); StringWriter writer = new StringWriter(); SvgGraphics2D g2 = new SvgGraphics2D(writer); g2.begin(); cc.paint(g2, null, 500, 550); g2.end(); System.out.println(writer.toString());