Getting Returns from RL Algorithms

Using algos from robotics and videogames to tackle the stock market

Recently I tried out a very interesting PyTorch challenge: getting two robots to play tennis against each other.

Of course, as entertianing as this is, not many people are working in videogame AI. If you’re exploring reinforcement learning, chances are you’re one of the many people fantasizing about one thing in particular: using Reinforcement learning algorithms for timing the stock market.

If you’re reading an article on doing this, chances are you’ve already seen all the articles warning you about rookie mistakes. You may have come across statistics pointing to the average career length for a day-trader being only 3 weeks. Perhaps after all that, you’re still undeterred. After all, you work with numbers all the time. How hard could it be?

Since you’ve probably gone through all the stages of seeing those stock-trading horror stories, I’m probably wasting my breath trying to add further disclaimers. With that in mind, let’s see just how easy (or hard) it is to apply reinforcement learning to securities trading.

For the sake of our experiments, let’s suppose you have \$10,000 to trade. Let’s suppose you have the ability to buy and sell a given stock with negligible commissions paid to a broker. Let’s also suppose we’re leaving the capital gains tax hairball to a later time. Let’s also suppose that the closing prices we have to work with are all released at exactly the closing time from the stock exchange, and that there’s no movement happening based on these numbers coming out earlier. These are all pretty big assumptions, but let’s see how well our approaches work in this extremely, incredibly, almost insultingly simplified playground environment.

For each RL agent that we’re considering, we’re limiting each agent to be able to buy or sell just 1 share per transaction (no more than 1 share, and no fractional shares). We’ll pick a few blue chips from Nasdaq like MSFT, NVDA, AMD, and INTC, with the reasoning that trading volumes are usually high and that the price is slightly less likely to swing due to things like pump-and-dump schemes or insider trading.

Let’s review the agents we’re going to be using:

Non-RL and Dynamic-programming-based strategies

Classical RL Algorithms

Expansions on Q-Learning-Based RL

Expansions on Evolutionary Strategy

Let’s get started…

Non-RL and Dynamic-programming-based strategies

Before getting into fancy machine learning algorithms, we shouldn’t discount far simpler tools (even if only to use them as control groups).

It doesn’t get much simpler than Turtle-trading. This strategy revolves around buy a stock as close to the start of a breakout, and sell just as quickly selling at the start of a drawdown.

Our own turtle-trading agent tracks two main signals, the 40-day maximum, and the 40-day minimum. Buy/Sell signals are dependent on whether the price has passed these. We apply this strategy to our four Nasdaq stocks for a full year.

Hmmmm…we can do better than this.

2. Moving-Average Agent

The popularity of this strategy is probably on par with Turtle trading. Moving Average (MA) analysis is exactly what it sounds like. Buy and Sell signals are based around price data that’s been smoothed out according to the average price in a given time window (this could be 30 minutes, 20 days, 10 weeks, 5 months, or whatever time interval the trader decides on). This average price is constantly updated as old data exits the time window and new data joins.

Our own moving-average agent is set to a 20-day moving average across the price data.

3. Signal Rolling Agent

In keeping with the theme of the previous agent, a signal-rolling agent works by looking for given buy/sell signals, and evaluates whether those signals sill hold after a given delay. If the signals do not hold, then we switch the buy/sell signal.

Our version of a signal-rolling agent checks similar signals as the turtle-trader, but combines this with checking the movemet of the market over 4 days.

4. ABCD Strategy Agent

I’ll be honest, I personally find this to be one of the most annoying dynamic-programming-based strategies. This isn’t actually due to the algorithm itself, or the process of coding it. It’s annoying, because this is basically what so much of the “Quit your job and become a day trader” clickbait ads are hawking. These scammers basically charge you hundreds of dollars (or thousands) to teach you what is essentially a basic strategy you can find in many introductory algorithmic finance textbook (or even freshman-level algorithms classes). The worst part is that, while this algorithm is completely automatable, these creeps dwelling in the parts of webpages usually cut out with adblock insist on getting people to manually implement this algorithm by sitting in front of a computer screen all day and manually buying and selling (and of course this way they can claim “Oh, there’s nothing wrong with our system, you’re just not cut out for the day-trader life”). Even if you actually manage to earn money doing this manually, you will likely loose many days of your life to this drudgery that you’re not getting back (and that’s not even counting the premature aging all this stress will cause you).

And to top it all off, this ABCD strategy is the primary source of perhaps >90% of the low-effort technical analysis memes out there:

Anyway…

Our ABCD-strategy agent checks similar signals as the turtle-trader, but combines this with checking the movemet of the market over 4 days.

It’s worth pointing out that, with 4 nested for loops, this strategy has an incredibly inefficient runtime complexity of $O(n^4)$. In fact, just running this backtest on only 365 data points took a full 2.45 seconds on my CPU. If you try to run this for longer, especially if this is being done on a stream of data, you can see how the electricity costs alone would probably start to eat up any monetary gains. By comparisons, RL algorithms with constant input-sizes and near-constant inference times seem like a welcome change. Of course, the returns on this strategy are also an example of why one shouldn’t copletely discount the simple, non-machine-learning-based approaches. It will be a lot easier to justify using reinforcement learning if we can get better returns with it than we can with a 20-lines-of-python strategy.

With that in mind, let’s get into the reinforcement learning approaches to BUY/SELL signals.

Classical RL algorithms

While this is an oversimplified taxonomy, we can focus on three main areas of reinforcement learning: Policy-gradient agents, Q-learning agents, and evolutionary algorithms.

Now we’re getting to our first truly trainable agents.

Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent.

Lilian wang gives a really good overview of policy-gradient methods.

Of course, the concept of a policy gradient is not unique to any single sub-area of reinforcement learning (see this giant list of policy gradient algorithms). For our experiment, we create an agent that looks at the price data within given windows, and learns through gradient descent to predict the reward landscape of various actions (which in this case are just buy, sell, or do nothing). We also include terms in our traning algorithms to properly discount rewards further in the future.

Our policy-gradient agent behaves as follows.

6. Q-Learning Agent

Our previous policy gradient agent worked by learning to predict rewards at various states. While an improvement on a lot of the dynamic programming strategies we started with, this is still a little short-sighted. In fact, a lot of people were frustrated by this problem in Reinforcement learning for a long time until Q-learning was introduced by Chris Watkins in 1989 as part of his PhD thesis (a convergence proof was later presented by Watkins and Dayan in 1992). Q-learning improves upon our agent’s abilities enormously. Instead of calculating expectations of state-values, we learn to predict action-values (or “Q-values”, with Q standing for quality). We use this to anticipate the best actions while also summing across future Q-values.

Q-learning for reinforcement learning took off even more with the advent of Deep Q-Networks (DQNs), in which deep neural networks make use of Q-learning. This approach was famously developed by DeepMind in 2015, and made headlines by solving a wide range of Atari games (some to superhuman level) by combining classical Q-Learning with deep neural networks,and a newer technique called experience replay. While we’re not interested in playing Atari games in this context, this is still a technique we can readily apply to our stock values. Our Q-learning agent behaves as follows.

7. Evolution-Strategy Agent

As long as we’re touching on the early days of Reinforcement learning, it’s important to bring evolutionary strategies (ES) to attention. While Q-learning became dominant thanks to DQNs, evolutionary strategies never completely fell out of use. There are certainly clear benefits. ES is simpler to implement (and you don’t need backpropagation). There are few hyperparameters to tune. It’s super simple to scale among distributed computing clusters. In fact, OpenAI demonstrated that ES can still compete with DQN-based methods on many complex RL tasks (including making videogame characters walk), especially on tasks where rewards are very sparse.

Much like the name implies, the evolution strategy involves setting our agent’s bevavior according to a “genetic code”. This code is randomly mutated, with new versions in the subsequent generation being evaluated on a fitness function (in our case, stock-market performance). Our evolution-strategy agent behaves as follows.

Expansions on Q-learning-based RL

As mentioned before, Q-learning really took off when Deepmind demonstrated combining it with deep learning. Since then there have been plenty of upgrades to this technique, including double, recurrent, duel, and curiosity Q-learning, in addition to combining Q-learning with policy-gradient methods (also known as Actor-critic methods). In fact, it may be worthwhile to just keep a checklist of these various features, since most of these can be combined with others in some form.

8. Double Q-Learning Agent

double ✅, recurrent ❌, duel ❌, curiosity ❌, actor-critic ❌

As mentioned previously, Q-learning agents struggle compared to ES in environments or on tasks that require handling sparse rewards. More specifically, the Q-learning algorithm is commonly known to suffer from the overestimation of the value function. This overestimation can propagate through the training iterations and negatively affect the policy. This property directly motivated double Q-learning: the action selection and Q-value update are decoupled by using two value networks.

Our double Q-learning agent behaves as follows.

9. Recurrent Q-Learning Agent

double ❌, recurrent ✅, duel ❌, curiosity ❌, actor-critic ❌

By now we’ve established how flexible deep reinforcement learning is, especially Q-learning algorithms. Still, our DQNs still are limited in the fact that they’re only realy consuming information right at the decision point. We’d like to extend the memory of our agent so it can better learn to integrate past price movements. While not being applied to stock prices (the authors demonstrated this on Atari games), this was generally the idea behind “Deep Recurrent Q-Learning for Partially Observable MDPs”. This paper improved upon DQNs by replacing the first post-convolutional fully-connected layer with a recurrent LSTM. While we’re not using convolutional networks in our architecture (the authors were benchmarking this technique on Atari games), this can still be useful to us.

Our recurrent Q-learning agent behaves as follows.

10. Double Recurrent Q-Learning Agent

double ✅, recurrent ✅, duel ❌, curiosity ❌, actor-critic ❌

If we have two advancements for Q-learning agents, one involving recurrent layers and the other involving adding a second network, then the next logical step is to see what happens when we combine these tricks. Our double recurrent Q-learning agent behaves as follows.

11. Duel Q-Learning Agent

double ❌, recurrent ❌, duel ✅, curiosity ❌, actor-critic ❌

Like with double Q-learning agents, duelling DQNs were motivated by decoubling the action selection and Q-value updating. Despite the name, duel DQNs are actually a completely different technique from double DQNs. To reiterate, double DQNs make use of two networks to avoid overly optimistic Q-values. Dueling separates the estimator using two new streams, value and advantage. The two streams are then aggregated.

Our duel Q-learning agent behaves as follows.

12. Double Duel Q-Learning Agent

double ✅, recurrent ❌, duel ✅, curiosity ❌, actor-critic ❌

As a further illustration of how double DQNs and dueling DQNs represent separate upgrades to the architecture, we can combine the “double” and “duel” aspects into a double duel Q-learning agent. Like the name suggests, this agent makes use of two networks to better judge q-values. Within each of these two networks, the estimator for value and advantage is divided into designated value-estimators and advantage-estimators before being recombined.

Our double duel Q-learning agent behaves as follows.

13. Duel Recurrent Q-Learning Agent

double ❌, recurrent ✅, duel ✅, curiosity ❌, actor-critic ❌

Since both the recurrent layers and the duelling architecture influence how our network deals with time-weighted representations of value and advantage, it’s worth checking to make sure they’re capable of working together. Our duel recurrent Q-learning agent behaves as follows.

14. Double Duel Recurrent Q-Learning Agent

double ✅, recurrent ✅, duel ✅, curiosity ❌, actor-critic ❌

It is now time to combine all 3 Q-learning upgrades: Q-value estimation done via 2 networks, with each having separate value/advantage estimators as well as recurrent layers. Our double duel recurrent Q-learning agent behaves as follows.

15. Actor-Critic Agent

double ❌, recurrent ❌, duel ❌, curiosity ❌, actor-critic ✅

Actor-critic continues the trend of double-network strategies. In our actor-critic setup, we have a critic that updates the value function parameters (could be action-value or state-value) and an actor that updates the policy parameters in the direction suggested by the critic. One could think of the actor-critic architecture as an asymmetric version of the Double DQN.

Our actor-critic agent behaves as follows.

16. Actor-Critic Duel Agent

double ❌, recurrent ❌, duel ✅, curiosity ❌, actor-critic ✅

Like with the more basic Q-learning agents, we can incorporate “dueling” behavior in our actor-critic setup. Our actor-critic duel agent behaves as follows.

17. Actor-Critic Recurrent Agent

double ❌, recurrent ✅, duel ❌, curiosity ❌, actor-critic ✅

Just as with previous DQNs, we can upgrade our actor-critic architecture with recurrent layers. Our actor-critic recurrent agent behaves as follows.

18. Actor-Critic Duel Recurrent Agent

double ❌, recurrent ✅, duel ✅, curiosity ❌, actor-critic ✅

It only makes sense to test duelling behavior and recurrent layers together in our new actor-critic architecture, just like we did with the previous DQNs. Our actor-critic duel recurrent agent behaves as follows.

19. Curiosity Q-Learning Agent

double ❌, recurrent ❌, duel ❌, curiosity ✅, actor-critic ❌

Reinforcement learning algorithms have been created to find reward signals where feedback may be sparse. But in some cases, like navigating an enormous maze or playing a very long game, rewards may be very sparse. There might be nothing dissuading an agent from running in circles in a maze, or doing absolutely nothing when being given multi-year time horizons to invest over. Our saving grace might simply be that curiosity compels us to seek out stimuli or environments that are unfamiliar. This has been the main idea behind incorporating curiosity into reinforcement learning. In our case, we’re focusing on the curiosity formulation described in “Curiosity-driven Exploration by Self-supervised Prediction” (also known as curiosity through prediction-based surprise, or the ICM method).

Our curiosity Q-learning agent behaves as follows.

20. Recurrent Curiosity Q-Learning Agent

double ❌, recurrent ✅, duel ❌, curiosity ✅, actor-critic ❌

Like our previous agents, we can upgrade our curiosity Q-learning agent by adding temporally-sensitive recurrent layers to our networks. Our recurrent curiosity Q-learning agent behaves as follows.

21. Duel Curiosity Q-Learning Agent

double ❌, recurrent ❌, duel ✅, curiosity ✅, actor-critic ❌

Curiosity offers a new way to determine what information is valuable in a reward-sparse environment. We can combine it with the previous way we saw of determining what value predictions are unnecessary in an information rich environment: the dueling architecture. Our duel curiosity Q-learning agent behaves as follows.

Expansions on Evolutionary strategy

22. Neuro-Evolution Agent

We’ve gone a long way with upgrading the DQN architecture we’re using, but we can also make changes to our fundamental training strategy by incorporating evolutionary algorithms.

Our neuro-evolution agent behaves as follows.

23. Neuro-Evolution with Novelty Search Agent

Novelty search is something we can add to our evolutionary agent to make it more likely to latch onto novel signals. Theoretically, this is similar purpose behind our earlier addition of curiosity to the Q-learning agent training. The biological analogy behind this algorithm is that individuals are judged by a fitness function in an environment. However, the environment isn’t immune to change, and by extension the fitness function can change as well. Therefore, our evolution strategy should also be able to account for behaviors that can also be adapted for unusual environment conditions.

Our neuro-evolution with novelty search agent behaves as follows.

We can also improve upon the NES algorithm even further, by creating monte-carlo-based simulations of our four picks and training the novelty search even further. It may also help if we extend the maximum trading volume our agent is allowed to perform.

What do we take away from all this?

Aside from code-snippets and demos, there are some other important takeaways here.

Backtesting is hard

With securities-trading strategies, we often assume that if our algorithm performs well on a set of historical data, then it should be all good for the future. If you’re not entirely sold on this assumption, then you’re probably thinking how it would be nice if backtesting were just some sort of rookie technique that you could graduate beyond. While backtesting is based on some assumptions that are quickly tested and refuted in real life, unfortunately finding good alternatives to this is an unsolved problem in algorithmic finance. For evolutionary algorithms, or any other algorithm type that imroves with more training data, creating monte-carlo simulations of the stock in question can help somewhat. This can bring its own problems, as if you don’t put enough effort into distinguishing the stock’s behavior from random noise you’ll learn the meaning of “Garbage in, Garbage out” the hard way.

Humans are bad at processing high-dimensional signals; Leave that to machines

For most of these experiments, we processed the close price data in much the same way as a human trader (especially a novice) would look at it. While this approach might have worked well half a century ago, we saw how difficult and computationally intensive it can be to analyze one data feed and get any positive return. This is why you need plenty of other data sources besides the price of the stock you’re interested in. If you don’t have additional data that’s correlated, anticorrelated, or just related to the price movements of the stock you’re trading, you need to spend a lot more time researching the stock than researching RL.

Consider value investing (or anything focused on longer time-horizons)

We put a lot of effort into our stock buying-and-selling simulations. Of course in real life many of our gains would probably be eaten away by capital gains taxes. With that in mind, perhaps it’s best to leave the day-to-day, week-to-week, and intra-day swings to the bots. While he might not beat the market as much as he used to, Warren Buffet might have had the right idea behind promoting value investing (its longer time-horizons are certainly less stressful that all of this nonsense we just covered).

References & Resources

Cited as:

@article{mcateer2018rlstocks,
title   = "Getting Returns from RL",
author  = "McAteer, Matthew",
journal = "matthewmcateer.me",
year    = "2018",
url     = "https://matthewmcateer.me/blog/getting-returns-from-rl/"
}

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!