Getting Returns from RL Algorithms

Using algos from robotics and videogames to tackle the stock market

UPDATE 7/3/2019: Added more about curiosity mechanisms.

Recently I tried out a very interesting PyTorch challenge: getting two robots to play tennis against each other.

Of course, as entertianing as this is, not many people are working in videogame AI. If you’re exploring reinforcement learning, chances are you’re one of the many people fantasizing about one thing in particular: using Reinforcement learning algorithms for timing the stock market.

If you’re reading an article on doing this, chances are you’ve already seen all the articles warning you about rookie mistakes. You may have come across statistics pointing to the average career length for a day-trader being only 3 weeks. Perhaps after all that, you’re still undeterred. After all, you work with numbers all the time. How hard could it be?

Listen to Randall Munroe of XKCD. His comic is supposed to be a warning not a guide.

Since you’ve probably gone through all the stages of seeing those stock-trading horror stories, I’m probably wasting my breath trying to add further disclaimers. With that in mind, let’s see just how easy (or hard) it is to apply reinforcement learning to securities trading.

For the sake of our experiments, let’s suppose you have $10,000 to trade. Let’s suppose you have the ability to buy and sell a given stock with negligible commissions paid to a broker. Let’s also suppose we’re leaving the capital gains tax hairball to a later time. Let’s also suppose that the closing prices we have to work with are all released at exactly the closing time from the stock exchange, and that there’s no movement happening based on these numbers coming out earlier. These are all pretty big assumptions, but let’s see how well our approaches work in this extremely, incredibly, almost insultingly simplified playground environment.

For each RL agent that we’re considering, we’re limiting each agent to be able to buy or sell just 1 share per transaction (no more than 1 share, and no fractional shares). We’ll pick a few blue chips from Nasdaq like MSFT, NVDA, AMD, and INTC, with the reasoning that trading volumes are usually high and that the price is slightly less likely to swing due to things like pump-and-dump schemes or insider trading.

Let’s review the agents we’re going to be using:

Non-RL and Dynamic-programming-based strategies

  1. Turtle-Trading Agent
  2. Moving-Average Agent
  3. Signal Rolling Agent
  4. ABCD Strategy Agent

Classical RL Algorithms

  1. Policy-gradient Agent
  2. Q-Learning Agent
  3. Evolution-strategy Agent

Expansions on Q-Learning-Based RL

  1. Double Q-Learning Agent
  2. Recurrent Q-Learning Agent
  3. Double Recurrent Q-Learning Agent
  4. Duel Q-Learning Agent
  5. Double Duel Q-Learning Agent
  6. Duel Recurrent Q-Learning Agent
  7. Double Duel Recurrent Q-Learning Agent
  8. Actor-Critic Agent
  9. Actor-Critic Duel Agent
  10. Actor-Critic Recurrent Agent
  11. Actor-Critic Duel Recurrent Agent
  12. Curiosity Q-Learning Agent
  13. Recurrent Curiosity Q-Learning Agent
  14. Duel Curiosity Q-Learning Agent

Expansions on Evolutionary Strategy

  1. Neuro-Evolution Agent
  2. Neuro-Evolution with Novelty Search Agent

Let’s get started…

Non-RL and Dynamic-programming-based strategies

1. Turtle-Trading Agent

Before getting into fancy machine learning algorithms, we shouldn’t discount far simpler tools (even if only to use them as control groups).

It doesn’t get much simpler than Turtle-trading. This strategy revolves around buy a stock as close to the start of a breakout, and sell just as quickly selling at the start of a drawdown.

There are plenty of “trend-following strategies out there, but Turtle-trading is probably the most famous. This is in no small part due to Richard Dennis and William Eckhardt’s experiment in the 1980s, where they bet on how easy it would be to teach even unexperienced people options trading. After placing an ad in the Wall Street Journal and getting thousands of applications, only 14 traders were ultimately selected (based in large part on their responses to four true/false questions). For the traders in these options-trading classes, they were given a selection of very simple rules to follow when trading options (detailed in the book “The Complete TurtleTrader: The Legend, the Lessons, the Results” (2007)). Some of the rules in our turtle-trading agent are a bit different from Dennis’s original rules (we don’t need to worry about getting distracted from prices by TV news, because we’re only working with prices), but the agent is still revolving its strategy around breakouts and drawdowns.

Our own turtle-trading agent tracks two main signals, the 40-day maximum, and the 40-day minimum. Buy/Sell signals are dependent on whether the price has passed these. We apply this strategy to our four Nasdaq stocks for a full year.

MSFT with turtle-trading agent

NVDA with turtle-trading agent

AMD with turtle-trading agent

INTC with turtle-trading agent

Hmmmm…we can do better than this.

2. Moving-Average Agent

The popularity of this strategy is probably on par with Turtle trading. Moving Average (MA) analysis is exactly what it sounds like. Buy and Sell signals are based around price data that’s been smoothed out according to the average price in a given time window (this could be 30 minutes, 20 days, 10 weeks, 5 months, or whatever time interval the trader decides on). This average price is constantly updated as old data exits the time window and new data joins.

Our own moving-average agent is set to a 20-day moving average across the price data.

MSFT with moving-average agent

NVDA with moving-average agent

AMD with moving-average agent

INTC with moving-average agent

3. Signal Rolling Agent

In keeping with the theme of the previous agent, a signal-rolling agent works by looking for given buy/sell signals, and evaluates whether those signals sill hold after a given delay. If the signals do not hold, then we switch the buy/sell signal.

Our version of a signal-rolling agent checks similar signals as the turtle-trader, but combines this with checking the movemet of the market over 4 days.

MSFT with signal-rolling agent

NVDA with signal-rolling agent

AMD with signal-rolling agent

INTC with signal-rolling agent

4. ABCD Strategy Agent

I’ll be honest, I personally find this to be one of the most annoying dynamic-programming-based strategies. This isn’t actually due to the algorithm itself, or the process of coding it. It’s annoying, because this is basically what so much of the “Quit your job and become a day trader” clickbait ads are hawking. These scammers basically charge you hundreds of dollars (or thousands) to teach you what is essentially a basic strategy you can find in many introductory algorithmic finance textbook (or even freshman-level algorithms classes). The worst part is that, while this algorithm is completely automatable, these creeps dwelling in the parts of webpages usually cut out with adblock insist on getting people to manually implement this algorithm by sitting in front of a computer screen all day and manually buying and selling (and of course this way they can claim “Oh, there’s nothing wrong with our system, you’re just not cut out for the day-trader life”). Even if you actually manage to earn money doing this manually, you will likely loose many days of your life to this drudgery that you’re not getting back (and that’s not even counting the premature aging all this stress will cause you).

And to top it all off, this ABCD strategy is the primary source of perhaps >90% of the low-effort technical analysis memes out there:

Going with another one of Randall Munroe’s classics here. Like I said before, you should see these comics of his as warnings more often than they set examples for you.

Anyway…

Our ABCD-strategy agent checks similar signals as the turtle-trader, but combines this with checking the movemet of the market over 4 days.

MSFT with ABCD agent

NVDA with ABCD agent

AMD with ABCD agent

INTC with ABCD agent

It’s worth pointing out that, with 4 nested for loops, this strategy has an incredibly inefficient runtime complexity of O(n4)O(n^4). In fact, just running this backtest on only 365 data points took a full 2.45 seconds on my CPU. If you try to run this for longer, especially if this is being done on a stream of data, you can see how the electricity costs alone would probably start to eat up any monetary gains. By comparisons, RL algorithms with constant input-sizes and near-constant inference times seem like a welcome change. Of course, the returns on this strategy are also an example of why one shouldn’t copletely discount the simple, non-machine-learning-based approaches. It will be a lot easier to justify using reinforcement learning if we can get better returns with it than we can with a 20-lines-of-python strategy.

With that in mind, let’s get into the reinforcement learning approaches to BUY/SELL signals.

Classical RL algorithms

While this is an oversimplified taxonomy, we can focus on three main areas of reinforcement learning: Policy-gradient agents, Q-learning agents, and evolutionary algorithms.

5. Policy-Gradient Agent

Now we’re getting to our first truly trainable agents.

Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent.

Lilian wang gives a really good overview of policy-gradient methods.

Of course, the concept of a policy gradient is not unique to any single sub-area of reinforcement learning (see this giant list of policy gradient algorithms). For our experiment, we create an agent that looks at the price data within given windows, and learns through gradient descent to predict the reward landscape of various actions (which in this case are just buy, sell, or do nothing). We also include terms in our traning algorithms to properly discount rewards further in the future.

Our policy-gradient agent behaves as follows.

MSFT with policy-gradient agent

NVDA with policy-gradient agent

AMD with policy-gradient agent

INTC with policy-gradient agent

6. Q-Learning Agent

Our previous policy gradient agent worked by learning to predict rewards at various states. While an improvement on a lot of the dynamic programming strategies we started with, this is still a little short-sighted. In fact, a lot of people were frustrated by this problem in Reinforcement learning for a long time until Q-learning was introduced by Chris Watkins in 1989 as part of his PhD thesis (a convergence proof was later presented by Watkins and Dayan in 1992). Q-learning improves upon our agent’s abilities enormously. Instead of calculating expectations of state-values, we learn to predict action-values (or “Q-values”, with Q standing for quality). We use this to anticipate the best actions while also summing across future Q-values.

Q-learning for reinforcement learning took off even more with the advent of Deep Q-Networks (DQNs), in which deep neural networks make use of Q-learning. This approach was famously developed by DeepMind in 2015, and made headlines by solving a wide range of Atari games (some to superhuman level) by combining classical Q-Learning with deep neural networks,and a newer technique called experience replay. While we’re not interested in playing Atari games in this context, this is still a technique we can readily apply to our stock values. Our Q-learning agent behaves as follows.

MSFT with Q-learning agent

NVDA with Q-learning agent

AMD with Q-learning agent

INTC with Q-learning agent

7. Evolution-Strategy Agent

As long as we’re touching on the early days of Reinforcement learning, it’s important to bring evolutionary strategies (ES) to attention. While Q-learning became dominant thanks to DQNs, evolutionary strategies never completely fell out of use. There are certainly clear benefits. ES is simpler to implement (and you don’t need backpropagation). There are few hyperparameters to tune. It’s super simple to scale among distributed computing clusters. In fact, OpenAI demonstrated that ES can still compete with DQN-based methods on many complex RL tasks (including making videogame characters walk), especially on tasks where rewards are very sparse.

Much like the name implies, the evolution strategy involves setting our agent’s bevavior according to a “genetic code”. This code is randomly mutated, with new versions in the subsequent generation being evaluated on a fitness function (in our case, stock-market performance). Our evolution-strategy agent behaves as follows.

MSFT with evolution-strategy agent

NVDA with evolution-strategy agent

AMD with evolution-strategy agent

INTC with evolution-strategy agent

Expansions on Q-learning-based RL

As mentioned before, Q-learning really took off when Deepmind demonstrated combining it with deep learning. Since then there have been plenty of upgrades to this technique, including double, recurrent, duel, and curiosity Q-learning, in addition to combining Q-learning with policy-gradient methods (also known as Actor-critic methods). In fact, it may be worthwhile to just keep a checklist of these various features, since most of these can be combined with others in some form.

8. Double Q-Learning Agent

double ✅, recurrent ❌, duel ❌, curiosity ❌, actor-critic ❌

As mentioned previously, Q-learning agents struggle compared to ES in environments or on tasks that require handling sparse rewards. More specifically, the Q-learning algorithm is commonly known to suffer from the overestimation of the value function. This overestimation can propagate through the training iterations and negatively affect the policy. This property directly motivated double Q-learning: the action selection and Q-value update are decoupled by using two value networks.

Our double Q-learning agent behaves as follows.

MSFT with Double Q-learning agent

NVDA with Double Q-learning agent

AMD with Double Q-learning agent

INTC with Double Q-learning agent

9. Recurrent Q-Learning Agent

double ❌, recurrent ✅, duel ❌, curiosity ❌, actor-critic ❌

By now we’ve established how flexible deep reinforcement learning is, especially Q-learning algorithms. Still, our DQNs still are limited in the fact that they’re only realy consuming information right at the decision point. We’d like to extend the memory of our agent so it can better learn to integrate past price movements. While not being applied to stock prices (the authors demonstrated this on Atari games), this was generally the idea behind “Deep Recurrent Q-Learning for Partially Observable MDPs”. This paper improved upon DQNs by replacing the first post-convolutional fully-connected layer with a recurrent LSTM. While we’re not using convolutional networks in our architecture (the authors were benchmarking this technique on Atari games), this can still be useful to us.

Our recurrent Q-learning agent behaves as follows.

MSFT with recurrent Q-learning agent

NVDA with recurrent Q-learning agent

AMD with recurrent Q-learning agent

INTC with recurrent Q-learning agent

10. Double Recurrent Q-Learning Agent

double ✅, recurrent ✅, duel ❌, curiosity ❌, actor-critic ❌

If we have two advancements for Q-learning agents, one involving recurrent layers and the other involving adding a second network, then the next logical step is to see what happens when we combine these tricks. Our double recurrent Q-learning agent behaves as follows.

MSFT with double recurrent Q-learning agent

NVDA with double recurrent Q-learning agent

AMD with double recurrent Q-learning agent

INTC with double recurrent Q-learning agent

11. Duel Q-Learning Agent

double ❌, recurrent ❌, duel ✅, curiosity ❌, actor-critic ❌

Like with double Q-learning agents, duelling DQNs were motivated by decoubling the action selection and Q-value updating. Despite the name, duel DQNs are actually a completely different technique from double DQNs. To reiterate, double DQNs make use of two networks to avoid overly optimistic Q-values. Dueling separates the estimator using two new streams, value and advantage. The two streams are then aggregated.

Our duel Q-learning agent behaves as follows.

MSFT with duel Q-learning agent

NVDA with duel Q-learning agent

AMD with duel Q-learning agent

INTC with duel Q-learning agent

12. Double Duel Q-Learning Agent

double ✅, recurrent ❌, duel ✅, curiosity ❌, actor-critic ❌

As a further illustration of how double DQNs and dueling DQNs represent separate upgrades to the architecture, we can combine the “double” and “duel” aspects into a double duel Q-learning agent. Like the name suggests, this agent makes use of two networks to better judge q-values. Within each of these two networks, the estimator for value and advantage is divided into designated value-estimators and advantage-estimators before being recombined.

Our double duel Q-learning agent behaves as follows.

MSFT with double duel Q-learning agent

NVDA with double duel Q-learning agent

AMD with double duel Q-learning agent

INTC with double duel Q-learning agent

13. Duel Recurrent Q-Learning Agent

double ❌, recurrent ✅, duel ✅, curiosity ❌, actor-critic ❌

Since both the recurrent layers and the duelling architecture influence how our network deals with time-weighted representations of value and advantage, it’s worth checking to make sure they’re capable of working together. Our duel recurrent Q-learning agent behaves as follows.

MSFT with duel recurrent Q-learning agent

NVDA with duel recurrent Q-learning agent

AMD with duel recurrent Q-learning agent

INTC with duel recurrent Q-learning agent

14. Double Duel Recurrent Q-Learning Agent

double ✅, recurrent ✅, duel ✅, curiosity ❌, actor-critic ❌

It is now time to combine all 3 Q-learning upgrades: Q-value estimation done via 2 networks, with each having separate value/advantage estimators as well as recurrent layers. Our double duel recurrent Q-learning agent behaves as follows.

MSFT with double duel recurrent Q-learning agent

NVDA with double duel recurrent Q-learning agent

AMD with double duel recurrent Q-learning agent

INTC with double duel recurrent Q-learning agent

15. Actor-Critic Agent

double ❌, recurrent ❌, duel ❌, curiosity ❌, actor-critic ✅

Actor-critic continues the trend of double-network strategies. In our actor-critic setup, we have a critic that updates the value function parameters (could be action-value or state-value) and an actor that updates the policy parameters in the direction suggested by the critic. One could think of the actor-critic architecture as an asymmetric version of the Double DQN.

Our actor-critic agent behaves as follows.

MSFT with actor-critic agent

NVDA with actor-critic agent

AMD with actor-critic agent

INTC with actor-critic agent

16. Actor-Critic Duel Agent

double ❌, recurrent ❌, duel ✅, curiosity ❌, actor-critic ✅

Like with the more basic Q-learning agents, we can incorporate “dueling” behavior in our actor-critic setup. Our actor-critic duel agent behaves as follows.

MSFT with actor-critic duel agent

NVDA with actor-critic duel agent

AMD with actor-critic duel agent

INTC with actor-critic duel agent

17. Actor-Critic Recurrent Agent

double ❌, recurrent ✅, duel ❌, curiosity ❌, actor-critic ✅

Just as with previous DQNs, we can upgrade our actor-critic architecture with recurrent layers. Our actor-critic recurrent agent behaves as follows.

MSFT with actor-critic recurrent agent

NVDA with actor-critic recurrent agent

AMD with actor-critic recurrent agent

INTC with actor-critic recurrent agent

18. Actor-Critic Duel Recurrent Agent

double ❌, recurrent ✅, duel ✅, curiosity ❌, actor-critic ✅

It only makes sense to test duelling behavior and recurrent layers together in our new actor-critic architecture, just like we did with the previous DQNs. Our actor-critic duel recurrent agent behaves as follows.

MSFT with actor-critic duel recurrent agent

NVDA with actor-critic duel recurrent agent

AMD with actor-critic duel recurrent agent

INTC with actor-critic duel recurrent agent

19. Curiosity Q-Learning Agent

double ❌, recurrent ❌, duel ❌, curiosity ✅, actor-critic ❌

Reinforcement learning algorithms have been created to find reward signals where feedback may be sparse. But in some cases, like navigating an enormous maze or playing a very long game, rewards may be very sparse. There might be nothing dissuading an agent from running in circles in a maze, or doing absolutely nothing when being given multi-year time horizons to invest over. Our saving grace might simply be that curiosity compels us to seek out stimuli or environments that are unfamiliar. This has been the main idea behind incorporating curiosity into reinforcement learning. In our case, we’re focusing on the curiosity formulation described in “Curiosity-driven Exploration by Self-supervised Prediction” (also known as curiosity through prediction-based surprise, or the ICM method).

Our curiosity Q-learning agent behaves as follows.

MSFT with curiosity Q-learning agent

NVDA with curiosity Q-learning agent

AMD with curiosity Q-learning agent

INTC with curiosity Q-learning agent

20. Recurrent Curiosity Q-Learning Agent

double ❌, recurrent ✅, duel ❌, curiosity ✅, actor-critic ❌

Like our previous agents, we can upgrade our curiosity Q-learning agent by adding temporally-sensitive recurrent layers to our networks. Our recurrent curiosity Q-learning agent behaves as follows.

MSFT with Recurrent Curiosity Q-learning agent

NVDA with Recurrent Curiosity Q-learning agent

AMD with Recurrent Curiosity Q-learning agent

INTC with Recurrent Curiosity Q-learning agent

21. Duel Curiosity Q-Learning Agent

double ❌, recurrent ❌, duel ✅, curiosity ✅, actor-critic ❌

Curiosity offers a new way to determine what information is valuable in a reward-sparse environment. We can combine it with the previous way we saw of determining what value predictions are unnecessary in an information rich environment: the dueling architecture. Our duel curiosity Q-learning agent behaves as follows.

MSFT with duel curiosity Q-learning agent

NVDA with duel curiosity Q-learning agent

AMD with duel curiosity Q-learning agent

INTC with duel curiosity Q-learning agent

Expansions on Evolutionary strategy

22. Neuro-Evolution Agent

We’ve gone a long way with upgrading the DQN architecture we’re using, but we can also make changes to our fundamental training strategy by incorporating evolutionary algorithms.

Our neuro-evolution agent behaves as follows.

MSFT with neuro-evolution agent

NVDA with neuro-evolution agent

AMD with neuro-evolution agent

INTC with neuro-evolution agent

23. Neuro-Evolution with Novelty Search Agent

Novelty search is something we can add to our evolutionary agent to make it more likely to latch onto novel signals. Theoretically, this is similar purpose behind our earlier addition of curiosity to the Q-learning agent training. The biological analogy behind this algorithm is that individuals are judged by a fitness function in an environment. However, the environment isn’t immune to change, and by extension the fitness function can change as well. Therefore, our evolution strategy should also be able to account for behaviors that can also be adapted for unusual environment conditions.

Our neuro-evolution with novelty search agent behaves as follows.

MSFT with neuro-evolution with Novelty search agent

NVDA with neuro-evolution with Novelty search agent

AMD with neuro-evolution with Novelty search agent

INTC with neuro-evolution with Novelty search agent

We can also improve upon the NES algorithm even further, by creating monte-carlo-based simulations of our four picks and training the novelty search even further. It may also help if we extend the maximum trading volume our agent is allowed to perform.

What do we take away from all this?

Aside from code-snippets and demos, there are some other important takeaways here.

There are plenty of bigger fish

When you started out your RL stock-trading adventure, you probably assumed that your engineering chops would give you an enormous leg-up over all the day-traders or Robinhood Gamblers. You may have also been suprised that our more convoluted trading algorithms fell apart. Unfortunately, your out-of-the-box reinforcement learning algorithms have the Herculean task before them of meta-gaming entire algorithmic trading firms. Unlike large tech companies or university CS departments, these organizations are often much less forthcoming about the algorithms they use. After all, if a trading strategy is too frequently use, it will cease to beat the market and instead become part of the market itself (another warning for those of you who read this post and were expecting some kind of free-lunch stock-trading alorithm).

Backtesting is hard

With securities-trading strategies, we often assume that if our algorithm performs well on a set of historical data, then it should be all good for the future. If you’re not entirely sold on this assumption, then you’re probably thinking how it would be nice if backtesting were just some sort of rookie technique that you could graduate beyond. While backtesting is based on some assumptions that are quickly tested and refuted in real life, unfortunately finding good alternatives to this is an unsolved problem in algorithmic finance. For evolutionary algorithms, or any other algorithm type that imroves with more training data, creating monte-carlo simulations of the stock in question can help somewhat. This can bring its own problems, as if you don’t put enough effort into distinguishing the stock’s behavior from random noise you’ll learn the meaning of “Garbage in, Garbage out” the hard way.

Humans are bad at processing high-dimensional signals; Leave that to machines

For most of these experiments, we processed the close price data in much the same way as a human trader (especially a novice) would look at it. While this approach might have worked well half a century ago, we saw how difficult and computationally intensive it can be to analyze one data feed and get any positive return. This is why you need plenty of other data sources besides the price of the stock you’re interested in. If you don’t have additional data that’s correlated, anticorrelated, or just related to the price movements of the stock you’re trading, you need to spend a lot more time researching the stock than researching RL.

Consider value investing (or anything focused on longer time-horizons)

We put a lot of effort into our stock buying-and-selling simulations. Of course in real life many of our gains would probably be eaten away by capital gains taxes. With that in mind, perhaps it’s best to leave the day-to-day, week-to-week, and intra-day swings to the bots. While he might not beat the market as much as he used to, Warren Buffet might have had the right idea behind promoting value investing (its longer time-horizons are certainly less stressful that all of this nonsense we just covered).

References & Resources

  1. Turtle Trading: A Market Legend (Investopedia)
  2. Breakout Definition and Example (Investopedia)
  3. Drawdown Definition and Example (Investopedia)
  4. How to Use a Moving Average to Buy Stocks (Investopedia)
  5. Warren Buffett: How He Does It (Investopedia)
  6. Watkins, Christopher John Cornish Hellaby. “Learning from delayed rewards.” (1989).
  7. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … & Petersen, S. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533.
  8. Weng, Lilian. “Policy gradient algorithms.” (2019).
  9. Covel, M. (2007). The Complete TurtleTrader: The Legend, the Lessons, the Results. Collins.
  10. Watkins, Christopher JCH, and Peter Dayan. “Q-learning.” Machine learning 8.3-4 (1992): 279-292.
  11. Sham Kakade. “A Natural Policy Gradient.”. NIPS. 2002.
  12. John Schulman, et al. “High-dimensional continuous control using generalized advantage estimation.” ICLR 2016.
  13. Hasselt, H. V. (2010). Double Q-learning. In Advances in neural information processing systems (pp. 2613-2621).
  14. Ziyu Wang, et al. “Sample efficient actor-critic with experience replay.” ICLR 2017.
  15. Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
  16. Thomas Degris, Martha White, and Richard S. Sutton. “Off-policy actor-critic.” ICML 2012.
  17. David Silver, et al. “Deterministic policy gradient algorithms.” ICML. 2014.
  18. Timothy P. Lillicrap, et al. “Continuous control with deep reinforcement learning.” arXiv preprint arXiv:1509.02971 (2015).
  19. John Schulman, et al. “Trust region policy optimization.” ICML. 2015.
  20. Mnih, Volodymyr, et al. “Asynchronous methods for deep reinforcement learning.” ICML. 2016.
  21. Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. “Safe and efficient off-policy reinforcement learning” NIPS. 2016.
  22. Hausknecht, M., & Stone, P. (2015). Deep recurrent q-learning for partially observable mdps. arXiv preprint arXiv:1507.06527.
  23. Van Hasselt, H., Guez, A., & Silver, D. (2015). Deep reinforcement learning with double q-learning. arXiv preprint arXiv:1509.06461.
  24. Ryan Lowe, et al. “Multi-agent actor-critic for mixed cooperative-competitive environments.” NIPS. 2017.
  25. Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 16-17).
  26. Yuhuai Wu, et al. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation.” NIPS. 2017.
  27. “Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients.” - Seita’s Place, Mar 2017.
  28. “Notes on the Generalized Advantage Estimation Paper.” - Seita’s Place, Apr, 2017.
  29. Tuomas Haarnoja, et al. “Soft Actor-Critic Algorithms and Applications.” arXiv preprint arXiv:1812.05905 (2018).
  30. Scott Fujimoto, Herke van Hoof, and Dave Meger. “Addressing Function Approximation Error in Actor-Critic Methods.” arXiv preprint arXiv:1802.09477 (2018).
  31. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv preprint arXiv:1801.01290 (2018).
  32. Gabriel Barth-Maron, et al. “Distributed Distributional Deterministic Policy Gradients.” ICLR 2018 poster.
  33. Doncieux, S., Laflaquière, A., & Coninx, A. (2019, July). Novelty search: a theoretical perspective. In Proceedings of the Genetic and Evolutionary Computation Conference (pp. 99-106).

Cited as:

@article{mcateer2018rlstocks,
  title   = "Getting Returns from RL",
  author  = "McAteer, Matthew",
  journal = "matthewmcateer.me",
  year    = "2018",
  url     = "https://matthewmcateer.me/blog/getting-returns-from-rl/"
}

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!



This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.



This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.