Today one of my followers in Twitter provided a link to an excellent article in Medium written by a student of machine learning by the title “Stacked Neural Networks for Prediction.” The article was well-written and demonstrated good knowledge of the subject. In addition, the author also understands the important of expert knowledge in feature engineering. As I have written in several articles in the past, for example here, ML cannot find gold where there is none. Feature engineering is the key to success.
The author of the referenced paper used Wavelet Transform to denoise data and extracted features via the use of Stacked Autoencoders. Then trained LSTM using the extracted features. Stack Autoencoders is basically a compression method that attempts to identify key features of the data. Do not forget that also classical TA is a lossy compression method essentially.
After several steps involving choice of optimizer, regularization and using dropouts, a model was trained on data from 01/2000 to 12/2008 with the following prediction for CVX stock in a validation sample starting on 01/2016 (as inferred from the chart of the stock below).
The author used MSE as metric (equal to 2.11 in the CVX case) and made the following comment in the article:
It is evident that the results of using this neural network architecture is impressive and can be profitable if implemented into a strategy.
To start with, this was an excellent article coming from a student (A+ student of course) but the conclusion about the usefulness of machine learning as a prediction tool in the particular application is premature. Let us see why.
Although visually it appears that the model can track closely actual prices in unseen data (hopefully unseen), a 3-day moving average would accomplish the same objective with about the same MSE, as shown below:
In the above chart price (red) is tracked closely by the 3-day simple moving average ((black) with MSE 2.38 versus 2.11 for the ML model. In fact, changing to a 2-day moving average brings MSE down to 2.03.
Therefore, what the ML model did essentially was finding a low-pass filter but in a very complicated way. Said in a different way, the complicated ML model reinvented the moving average in a sense.
This is the first problem with ML applications in financial market forecasting. The other problem is even if we have an abstract model developed via ML, how do we build an actual trading model? Low MSE, or any error metric for that purpose, is not sufficient condition for profitability. A metric that is more relevant in trading is Sharpe ratio.
For example, in the case of a simple forecasting method using moving averages, in the above example a 3-day moving average, at every time step n, the value of the moving average MA3(n) is the 1-step ahead forecast of price P(n+1). The error at step n is e(t) = P(n) — MA3(n). But notice that in MSE the error is squared. Therefore, that metric does not offer any indication of how to use the predictor to trade. This is a practical problem and its solution requires going beyond ML and understanding the dynamics of price action.
Specifically, traders usually go long when e(t) > 0 and short when e(t) < 0. Below is the backtest of this model and I have included $0.01 commission per share for fully invested equity.
This is a disaster and points to an inverse model, i.e., long when e(t) < 0 and short when e(t) > 0. Below is the backtest of the inverse with same commission:
CAGR is 36% versus 13.8% for buy and hold and Sharpe is 1.82 versus 0.68 for buy and hold. But this is like a mean-reversion model: we buy when the predictor overshoots price and sell when it undershoots price. If mean-reversion changes to momentum, then the first model could apply but not always.
Is advanced machine learning a very complicated way of reproducing the results of trivial forecasting methods? In many cases it appears this is the case. The value of ML in my experience in working with successful hedge fund managers is in its use as added layer after features are engineered that have economic value. Abstract models that construct features by compressing data usually default to trivial forecasting methods. For example, one of our customers uses features engineered by our software DLPAL LS with SVM and Hidden Markov models as an added layer to determine the best mix of securities for a long/short models on top of the initial rank based on our features. In such cases, ML can provide significant advantage and higher returns. ML is extremely useful but it can also turn into an exercise in futility.