This library lets you organize TensorFlow machine learning projects.
It is based on a bunch of utility functions and classes I wrote while trying some machine learning algorithms. I recently made it to a separate repo because I’ve been reusing them on different projects, and it was easier to keep track of them as a single project. Most of this is only about two weeks old so it’ll have to go through a lot of improvements.
What does it do?
- It keeps checkpoints and TensorBoard summaries and logs organized
- It helps keep track of experiments were with reference to git commits
- Produce pretty console outputs
- Maintains and writes histograms and moving averages
- Monitor time taken for different sections of code
- Estimate time remaining for experiments to run
- Help make code more readable
Why I made it?
I started coding existing reinforcement learning algorithms to play Atari games for fun. It was not easy to keep track of things when I started trying variations, fixing bugs etc. This library help organize your experiments. It organizes the folders of the checkpoints, logs and TensorBoard summaries by each experiment. It also keeps track of the git commits when each experiment was run, so if some other change in code, affected the results of a experiment you can easily track what caused it.
I also wrote a logger to display pretty results on screen and to make it easy to write TensorBoard summaries. It also keeps track of training times which makes it easy to spot what’s taking up most resources. Here’s the output of this sample program (sample.py):
Time module is used to sleep so that you can see how this sample works when you run it.
import time import tensorflow as tf from lab.experiment import Experiment
You should keep the project level lab details defined in a python file at the top of the project. Here’s the example lab_globals.py used for this sample.
from lab_globals import lab
Create the sample experiment
EXPERIMENT = Experiment(lab=lab, name="Sample", run_file=__file__, comment="Sample lab experiment", check_repo_dirty=False)
Get a reference to logger
logger = EXPERIMENT.logger
This is sample monitored section. I use monitored sections to keep track of what’s going on from the console output. It is also useful to organize the code into sections, when separating them into functions is difficult
with logger.monitor("Create model") as m:
Indicate that this section failed. You don’t have to set this if it is successful.
logger.info(one=1, two=2, string="string")
Episode is a queue; this is useful when you want to track the moving average of something.
By default everything is a set of values and will create a TensorBoard histogram
We specify that
fps is a scalar.
If you store multiple values for this it will output the mean.
This will produce a histogram
Create a TensorFlow session
with tf.Session() as session:
Start the experiment from scratch, without loading from a
This will clear all the old checkpoints and summaries for this
If you start with the continued non-zero
the experiment will load from the last saved checkpoint.
Create monitored iterator
monitor = logger.iterator(range(1000))
This is the main training loop of this project.
for global_step in monitor:
A sample monitored section inside iterator
with monitor.section("sample"): time.sleep(0.5)
An unmonitored section is used only to organize code. It produces no output
logger.store( episode=global_step, fps=12 )
Store a collection of values
for i in range(global_step, global_step + 10): logger.store(loss=i)
Another monitored section
with monitor.section("process_samples"): time.sleep(0.5)
A third monitored section to make it real
Let it run for multiple iterations. We’ll track the progress of that too
iterations = 100 progress = logger.progress(iterations) for i in range(100): time.sleep(0.01)
Progress is tracked manually unlike in the top level iterator The progress updates do not have to be sequential.
Clears the progress when complete
Log stored values. This will output to the console and write TensorBoard summaries.
Show iterator progress. This will display how much time is remaining.
We will overwrite the same console line, and produce a new line after ten such overwrites. This helps keep the console output concise.
logger.clear_line(reset=(global_step + 1) % 10 != 0)
There are some other tools built around this that I need to cleanup and include.
This lets you start TensorBoard without having to type in all the log paths. For instance, so that you can start it with
tb ppo ppo_transformed_bellman
Custom analysis of TensorBoard summaries
TensorBoard is nice, but sometimes you need custom charts to debug algorithms. Following is an example of a custom chart:
And sometime TensorBoard is not even doing a good job; for instance lets say you have a histogram, with 90% of data points between 1 and 2 whilst there are a few outliers at 1000 - you won’t be able to see the distribution between 1 and 2 because the graph is scaled to 1000.
I think TensorBoard will develop itself to handle these. And the main reason behind these tooling I’ve written is for custom charts, and because it’s not that hard to do it.