Wolfram Computation Meets Knowledge

Wolfram Summer School


Shan Huang

Science and Technology

Class of 2018


Shan Huang received his BSc in physics at Nanjing University, China. He started pursuing his PhD degree in physics at Boston University in 2015. His main research field is condensed matter theory, particularly in phase transition and the nucleation process. Most of his research is based on data obtained by numerical simulation. Currently, he is interested in implementing machine learning in analyzing the data.

Computational Essay

Marriage Equality and LGBT Movement »

Project: Reinforcement Learning with Policy Gradient Methods


Deep reinforcement learning (RL) is a problem of training a neural network to act optimally in an environment. One basic environment frequently used to study reinforcement learning is the CartPole problem. Our goal is to train a neural net to play simple games like CartPole using policy gradient methods. We are also aiming for improved performance of a simple policy gradient by implementing a method such as Advantage Actor Critic (A2C). We will also try to train a network to play Atari games.

Main Results in Detail

I implemented a simple policy gradient (PG) RL in the Wolfram Language and used it to train a simple problem like CartPole. I get an average survival time of over 195 on CartPole, which is where people consider the problem to be solved. I implemented an A2C-style PG in the Wolfram Language and tested it on CartPole. The result shows that it gets a converged network much faster than a simple PG, despite the fact that its performance is worse and less stable than the PG. I tested the PG on the Atari game Pong, but more time is needed to train the net to get a meaningful result. Finally, I wrote a Wolfram Language package that contains both the PG and A2C, which can be easily used on any OpenAI Gym environment.

Future Work

  • 1. Do more PG training on Atari game environments.
  • 2. Improve the performance of the A2C-style PG.
  • 3. Try a more complicated RL method, such as proximal policy optimization.