 efavdb.com Go back Open original

# Linear compression in python: PCA vs unsupervised feature selection

By Jonathan Landy

We illustrate the application of two linear compression algorithms in python: Principal component analysis (PCA) and least-squares feature selection. Both can be used to compress a passed array, and they both work by stripping out redundant columns from the array. The two differ in that PCA operates in a particular rotated frame, while the feature selection solution operates directly on the original columns. As we illustrate below, PCA always gives a stronger compression. However, the feature selection solution is often comparably strong, and its output has the benefit of being relatively easy to interpret — a virtue that is important for many applications.

We use our python package linselect to carry out efficient feature selection-based compression below — this is available on pypi (pip install linselect) and GitHub.

## Linear compression algorithms To compress a data array having $n$ columns, linear compression algorithms begin by fitting a $k$-dimensional line, or hyperplane, to the data (with $k < n$). Any point in the hyperplane can be uniquely identified using a basis of $k$ components. Marking down each point’s projected location in the hyperplane using these components then gives a $k$-column, compressed representation of the data. This idea is illustrated in Fig. 1 at right, where a line is fit to some two-component data. Projecting the points onto the line and then marking down how far along the line each projected point sits, we obtain a one-column compression. Carrying out this process can be useful if storage space is at a premium or if any operations need to be applied to the array (usually operations will run much faster on the compressed format). Further, compressed data is often easier to interpret and visualize, thanks to its reduced dimension.

In this post, we consider two automated linear compression algorithms: principal component analysis (PCA) and least-squares unsupervised feature selection. These differ because they are obtained from different hyperplane fitting strategies: The PCA approach is obtained from the $k$-dimensional hyperplane fit that minimizes the data’s total squared-projection error. In general, the independent variables of this fit — i.e., the $k$ components specifying locations in the fit plane — end up being some linear combinations of the original $x_i$’s. In contrast, the feature selection strategy intelligently picks a subset of the original array columns as predictors and then applies the usual least-squares fit to the others for compression . These approaches are illustrated in the left and right panels of Fig. 2 below. The two fit lines there look very similar, but the encodings returned by these strategies differ qualitatively: The 1-d compression returned by PCA is how far along the $PCA_1$ direction a point sits (this is some linear combination of $x_1$ and $x_2$ — see figure), while the feature selection solution simply returns each point’s $x_1$ value. One of our goals here is to explain why this difference can favor the feature selection approach in certain applications.

Our post proceeds as follows: In the next section, we consider two representative applications in python: (1) The compression of a data set of tech-sector stock price quotes, and (2) the visualization of some economic summary statistics on the G20 nations. Working through these applications, we are able to familiarize ourselves with the output of the two algorithms, and also through contrast to highlight their relative virtues. The discussion section summarizes what we learn. Finally, a short appendix covers some of the formal mathematics of compression. There, we prove that linear compression-decompression operators are always projections. Fig. 2. A cartoon illustrating the projection that results when applying PCA (left) and unsupervised feature selection — via linselect (right): The original 2-d big dots are replaced by their small dot, effectively-1-d approximations — a projection.

## Applications

Both data sets explored below are available on our Github, here.

### Stock prices

In this section, we apply our algorithms to a prepared data set of one year’s worth of daily percentage price lifts on 50 individual tech stocks . We expect these stocks to each be governed by a common set of market forces, motivating the idea that a substantial compression might be possible. This is true, and the compressed arrays that result may be more efficiently operated on, as noted above. In addition, we’ll see below that we can learn something about the full data set by examining the compression outputs.

The code below loads our data, smooths it over a running 30 day window (to remove idiosyncratic noise that is not of much interest), prints out the first three rows, compresses the data using our two methods, and then finally prints out the first five PCA components and the top five selected stocks.

 import pandas as pd import numpy as np from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from linselect import FwdSelect # CONSTANTS KEEP = 5 # compression dimension WINDOW_SIZE = 30 # smoothing window size # LOAD AND SMOOTH THE DATA df = pd.read_csv('stocks.csv') df = df.rolling(WINDOW_SIZE).mean().iloc[WINDOW_SIZE:] print df.iloc[:3] TICKERS = df.iloc[:, 1:].columns.values X = df.iloc[:, 1:].values # PCA COMPRESSION s = StandardScalar() pca = PCA(n_components=KEEP) pca.fit(s.fit_transform(X)) X_compressed_pca = pca.transform(s.fit_transform(X)) # FEATURE SELECTION COMPRESSION selector = FwdSelect() selector.fit(X) X_compressed_linselect = X[:, selector.ordered_features[:KEEP]] # PRINT OUT FIRST FIVE PCA COMPONENTs, TOP FIVE STOCKS print pca.components_[:KEEP] print TICKERS[selector.ordered_features][:KEEP]

The output of the above print statements:

 # The first three rows of the data frame: date AAPL ADBE ADP ADSK AMAT AMZN \ 30 2017-05-31 0.002821 0.002994 0.000248 0.009001 0.006451 0.003237 31 2017-06-01 0.003035 0.002776 0.000522 0.008790 0.005487 0.003450 32 2017-06-02 0.003112 0.002964 -0.000560 0.008573 0.005523 0.003705 ASML ATVI AVGO ... T TSLA TSM \ 30 0.000755 0.005933 0.003988 ... -0.001419 0.004500 0.003590 31 0.002174 0.006369 0.003225 ... -0.001125 0.003852 0.004279 32 0.001566 0.006014 0.005343 ... -0.001216 0.004130 0.004358 TWTR TXN VMW VZ WDAY WDC ZNGA 30 0.008292 0.001467 0.001984 -0.001741 0.006103 0.002916 0.007811 31 0.008443 0.001164 0.002026 -0.001644 0.006303 0.003510 0.008379 32 0.007796 0.000637 0.001310 -0.001333 0.006721 0.002836 0.008844 # PCA top components: [[ 0.10548148, 0.20601986, -0.0126039 , 0.20139121, ...], [-0.11739195, 0.02536787, -0.2044143 , 0.08462741, ...], [ 0.03251305, 0.10796197, -0.00463919, -0.17564998, ...], [ 0.08678107, 0.1931497 , -0.16850867, 0.16260134, ...], [-0.0174396 , 0.01174769, -0.11617622, -0.01036602, ...]] # Feature selector output: ['WDAY', 'PYPL', 'AMZN', 'LRCX', 'HPQ']

Lines 22 and 27 in the first code block above are the two compressed versions of the original data array, line 16. For each row, the first compression stores the amplitude of that date’s stock changes along each of the first five PCA components (printed below line 17 of second code block), while the second compression is simply equal to the five columns of the original array corresponding to the stocks picked out by the selector (printed below line 24 of the second code block).

#### Exploring the encodings

Working with the compressed arrays obtained above provides some immediate operational benefits: Manipulations of the compressed arrays can be carried out more quickly and they require less memory for storage. Here, we review how valuable insight can also obtained from our compressions — via study of the compression components.

First, we consider the PCA components. It turns out that these components are the eigenvectors of the correlation matrix of our data set ($X^T \cdot X$) — that is, they are the collective, fluctuation modes present in the data set (for those who have studied classical mechanics, you can imagine the system as one where the different stocks are masses that are connected by springs, and these eigenvectors are the modes of the system). Using this fact, one can show that the components evolve in an uncorrelated manner. Further, one can show that projecting the data set down onto the top $k$ modes gives the minimum squared projection error of all possible $k$-component projections. The first component then describes the largest amplitude fluctuation pattern exhibited in the data. From line 18 above, this is $[ 0.105, 0.206, -0.012, 0.201, …$]. These coefficients tell us that when the first stock (AAPL) goes up by some amount, the second (ADBE) typically goes up by about twice as much (this follows from fact that 0.206 is about twice as big as 0.105), etc. This isn’t the full story of course, because each day’s movements are a superposition (sum) of the amplitudes along each of PCA components. Including more of these components in a compression allows one to capture more of the detailed correlation patterns exhibited in the data. However, each additional PCA component provides progressively less value as one moves down the ranking — it is this fact that allows a good compression to be obtained using only a minority of these modes.

Whereas the PCA components directly encode the collective, correlated fluctuations exhibited in our data, the feature selection solution attempts to identify a minimally-redundant subset of the original array’s columns — one that is representative of the full set. This strategy is best understood in the limit where the original columns fall into a set of discreet clusters (in our example, we might expect the businesses operating in a particular sub-sector to fall into a single cluster). In such cases, a good compression is obtained by selecting one representative column from each cluster: Once the representatives are selected, each of the other members of a given cluster can be approximately reconstructed using its selected representative as a predictor. In the above, we see that our automated feature selector has worked well, in that the companies selected (‘WDAY’, ‘PYPL’, ‘AMZN’, ‘LRCX’, and ‘HPQ’) each operate in a different part of the tech landscape . In general, we can expect the feature selector to attempt to mimic the PCA approach, in that it will seek columns that fluctuate in a nearly orthogonal manner. However, whereas the PCA components highlight which columns fluctuate together, the feature selector attempts to throw out all but one of the columns that fluctuate together — a sort-of dual approach.

#### Compression strength

To decide how many compression components are needed for a given application, one need only consider the variance explained as a function of the compression dimension — this is equal to one minus the average squared error of the projections that result from the compressions (see footnote  for a visualization of the error that results from compression here). In the two python packages we’re using, one can access these values as follows:

 >> print np.cumsum(pca.explained_variance_ratio_) [ 0.223 0.367 0.493 0.598 0.696] >> print [var / 50.0 for var in selector.ordered_cods[:KEEP]] [ 0.169 0.316 0.428 0.530 0.612]

The printed lines above show that both algorithms capture more than $50\%$ of the variance exhibited in the data using only 4 of the 50 stocks. The PCA compressions are stronger in each dimension because PCA is unconstrained — it can use any linear combination of the initial features for compression components, whereas the feature selector is constrained to use a subset of the original features.

A plot of the values above across all compression dimensions is shown in Fig. 3 below. Looking at this plot, we see an elbow somewhere between $5$ and $10$ retained components. This implies that our $50$-dimensional data set mostly lies within a subspace of dimension $k \in (5, 10)$. Using any $k$ in that interval will provide a decent compression, and a satisfying large dimensional reduction — a typical result of applying these algorithms to large, raw data sets. Again, this is useful because it allows one to stop tracking redundant columns that offers little incremental value. Fig. 3. Plots of the compression strength (coefficient of determination or $r^2$) for our two compression algorithms versus compression dimension. We see two things: (1) PCA gives a slightly stronger compression at each dimension, and (2) The full data set spans 50 dimensions, but the elbow in the plots suggests the data largely sits in a subspace having dimension between 5 to 10.