Parsing Histograms

This page demonstrates the parsing process for histogram events.

Preparing Sample Event Logs 

First, let’s import some libraries and prepare the environment for our sample event logs:

>>> import os
>>> import tempfile
>>> import numpy as np
>>> # Define some constants
>>> RND_STATE = 1234
>>> N_EVENTS = 10
>>> N_PARTICLES = 1000
>>> MU = 0
>>> SIGMA = 2
>>> # Prepare temp dirs for storing event files
>>> tmpdirs = {}

Before parsing a event file, we need to generate it first. The sample event files are generated by three commonly used event log writers.

We can generate the events by PyTorch:

>>> tmpdirs['torch'] = tempfile.TemporaryDirectory()
>>> from torch.utils.tensorboard import SummaryWriter
>>> log_dir = tmpdirs['torch'].name
>>> writer = SummaryWriter(log_dir)
>>> rng = np.random.RandomState(RND_STATE)
>>> for i in range(N_EVENTS):
...   x = rng.normal(MU, SIGMA, size=N_PARTICLES)
...   writer.add_histogram('dist', x + i, i)
>>> writer.close()

and quickly check the results:

>>> from tbparse import SummaryReader
>>> SummaryReader(log_dir, pivot=True).histograms.columns
Index(['step', 'dist/counts', 'dist/limits'], dtype='object')

We can generate the events by TensorFlow2 / Keras:

>>> tmpdirs['tensorflow'] = tempfile.TemporaryDirectory()
>>> import tensorflow as tf
>>> log_dir = tmpdirs['tensorflow'].name
>>> writer = tf.summary.create_file_writer(log_dir)
>>> writer.set_as_default()
>>> rng = np.random.RandomState(RND_STATE)
>>> for i in range(N_EVENTS):
...   x = rng.normal(MU, SIGMA, size=N_PARTICLES)
...   assert tf.summary.histogram('dist', x + i, i)
>>> writer.close()

and quickly check the results:

>>> from tbparse import SummaryReader
>>> SummaryReader(log_dir, pivot=True).tensors.columns
Index(['step', 'dist'], dtype='object')

Warning

In the new versions of TensorFlow, the histogram method actually stores the events as tensors events inside the event file. Thus, you should perform an extra step with tensor_to_histogram() beforehand if the event file is generated by TensorFlow2. (An example is shown later)

We can generate the events by TensorboardX:

>>> tmpdirs['tensorboardX'] = tempfile.TemporaryDirectory()
>>> from tensorboardX import SummaryWriter
>>> log_dir = tmpdirs['tensorboardX'].name
>>> writer = SummaryWriter(log_dir)
>>> rng = np.random.RandomState(RND_STATE)
>>> for i in range(N_EVENTS):
...   x = rng.normal(MU, SIGMA, size=N_PARTICLES)
...   writer.add_histogram('dist', x + i, i)
>>> writer.close()

and quickly check the results:

>>> from tbparse import SummaryReader
>>> SummaryReader(log_dir, pivot=True).histograms.columns
Index(['step', 'dist/counts', 'dist/limits'], dtype='object')

The event logs can be easily read in 2 lines of code as shown above (1 for importing tbparse, 1 for reading the events).

Parsing Event Logs 

In different use cases, we will want to read the event logs in different styles. We further show different configurations of the tbparse.SummaryReader class.

Load Event File / Run Directory 

>>> from tbparse import SummaryReader
>>> log_dir = tmpdirs['torch'].name
>>> # Long Format
>>> df = SummaryReader(log_dir).histograms
>>> df.columns
Index(['step', 'tag', 'counts', 'limits'], dtype='object')
>>> # Wide Format
>>> df = SummaryReader(log_dir, pivot=True).histograms
>>> df.columns
Index(['step', 'dist/counts', 'dist/limits'], dtype='object')

>>> from tbparse import SummaryReader
>>> log_dir = tmpdirs['tensorflow'].name
>>> # Long Format
>>> df = SummaryReader(log_dir).tensors
>>> df.columns
Index(['step', 'tag', 'value'], dtype='object')
>>> hist_dict_arr = df['value'].apply(SummaryReader.tensor_to_histogram)
>>> df['counts'] = hist_dict_arr.apply(lambda x: x['counts'])
>>> df['limits'] = hist_dict_arr.apply(lambda x: x['limits'])
>>> df.drop(columns=['value'], inplace=True)
>>> df.columns
Index(['step', 'tag', 'counts', 'limits'], dtype='object')
>>> # Wide Format
>>> df = SummaryReader(log_dir, pivot=True).tensors
>>> df.columns
Index(['step', 'dist'], dtype='object')
>>> hist_dict_arr = df['dist'].apply(SummaryReader.tensor_to_histogram)
>>> df['dist/counts'] = hist_dict_arr.apply(lambda x: x['counts'])
>>> df['dist/limits'] = hist_dict_arr.apply(lambda x: x['limits'])
>>> df.drop(columns=['dist'], inplace=True)
>>> df.columns
Index(['step', 'dist/counts', 'dist/limits'], dtype='object')

>>> from tbparse import SummaryReader
>>> log_dir = tmpdirs['tensorboardX'].name
>>> # Long Format
>>> df = SummaryReader(log_dir).histograms
>>> df.columns
Index(['step', 'tag', 'counts', 'limits'], dtype='object')
>>> # Wide Format
>>> df = SummaryReader(log_dir, pivot=True).histograms
>>> df.columns
Index(['step', 'dist/counts', 'dist/limits'], dtype='object')

Warning

When accessing SummaryReader.histograms, the events stored in each event file are collected internally. The best practice is to store the returned results in a DataFrame as shown in the samples, instead of repeatedly accessing SummaryReader.histograms.

import matplotlib.pyplot as plt
from tbparse import SummaryReader

log_dir = tmpdirs['torch'].name
reader = SummaryReader(log_dir, pivot=True)
df = reader.histograms
df.set_index('step', inplace=True)
counts0 = df.at[0, 'dist/counts']
limits0 = df.at[0, 'dist/limits']
# draw PDF
x = np.linspace(limits0[0], limits0[-1], 11)
x, y = SummaryReader.histogram_to_pdf(counts0, limits0, x)
plt.plot(x, y)
# draw CDF
x = np.linspace(limits0[0], limits0[-1], 1000)
y = SummaryReader.histogram_to_cdf(counts0, limits0, x)
plt.plot(x, y)
plt.title('dist0')
plt.show()

The data from tensorboard event logs:

import matplotlib.pyplot as plt
from tbparse import SummaryReader

log_dir = tmpdirs['tensorflow'].name
reader = SummaryReader(log_dir, pivot=True)
df = reader.tensors
buckets0 = df.at[0, 'dist']
hist_dict0 = SummaryReader.tensor_to_histogram(buckets0)
counts0 = hist_dict0['counts']
limits0 = hist_dict0['limits']
# draw PDF
x = np.linspace(limits0[0], limits0[-1], 11)
x, y = SummaryReader.histogram_to_pdf(counts0, limits0, x)
plt.plot(x, y)
# draw CDF
x = np.linspace(limits0[0], limits0[-1], 1000)
y = SummaryReader.histogram_to_cdf(counts0, limits0, x)
plt.plot(x, y)
plt.title('dist0')
plt.show()

The data from tensorboard event logs:

import matplotlib.pyplot as plt
from tbparse import SummaryReader

log_dir = tmpdirs['tensorboardX'].name
reader = SummaryReader(log_dir, pivot=True)
df = reader.histograms
df.set_index('step', inplace=True)
counts0 = df.at[0, 'dist/counts']
limits0 = df.at[0, 'dist/limits']
# draw PDF
x = np.linspace(limits0[0], limits0[-1], 11)
x, y = SummaryReader.histogram_to_pdf(counts0, limits0, x)
plt.plot(x, y)
# draw CDF
x = np.linspace(limits0[0], limits0[-1], 1000)
y = SummaryReader.histogram_to_cdf(counts0, limits0, x)
plt.plot(x, y)
plt.title('dist0')
plt.show()

The ground truth data:

import scipy.stats
import matplotlib.pyplot as plt
from tbparse import SummaryReader

rng = np.random.RandomState(RND_STATE)
x = rng.normal(MU, SIGMA, size=N_PARTICLES)
counts, limits = np.histogram(x)
hist = (counts, limits)
hist_dist = scipy.stats.rv_histogram(hist)
centers = (limits[1:]+limits[:-1])/2
pdf = hist_dist.pdf(centers)
cdf = hist_dist.cdf(centers)
plt.plot(centers, pdf)
plt.plot(centers, cdf)
plt.hist(x, density=True)
plt.title('dist0')
plt.show()

Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_histogram.html

Plotting Multiple (Stacked) Distributions 

import seaborn as sns
import matplotlib.pyplot as plt
log_dir = tmpdirs['torch'].name
reader = SummaryReader(log_dir, pivot=True)
df = reader.histograms
# Set background
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
# Choose color palettes for the distributions
pal = sns.color_palette("Oranges", 20)[5:-5]
# Initialize the FacetGrid object (stacking multiple plots)
g = sns.FacetGrid(df, row='step', hue='step', aspect=15, height=.4, palette=pal)

def plot_subplots(x, color, label, data):
  ax = plt.gca()
  ax.text(0, .08, label, fontweight="bold", color=color,
          ha="left", va="center", transform=ax.transAxes)
  counts = data['dist/counts'].iloc[0]
  limits = data['dist/limits'].iloc[0]
  x = np.linspace(limits[0], limits[-1], 15)
  x, y = SummaryReader.histogram_to_pdf(counts, limits, x)
  # Draw the densities in a few steps
  sns.lineplot(x=x, y=y, clip_on=False, color="w", lw=2)
  ax.fill_between(x, y, color=color)
# Plot each subplots with df[df['step']==i]
g.map_dataframe(plot_subplots, None)

# Add a bottom line for each subplot
# passing color=None to refline() uses the hue mapping
g.refline(y=0, linewidth=2, linestyle="-", color=None, clip_on=False)
# Set the subplots to overlap (i.e., height of each distribution)
g.figure.subplots_adjust(hspace=-.9)
# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[], xlabel="", ylabel="")
g.despine(bottom=True, left=True)

import seaborn as sns
import matplotlib.pyplot as plt
log_dir = tmpdirs['tensorflow'].name
reader = SummaryReader(log_dir, pivot=True)
df = reader.tensors
# Set background
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
# Choose color palettes for the distributions
pal = sns.color_palette("Oranges", 20)[5:-5]
# Initialize the FacetGrid object (stacking multiple plots)
g = sns.FacetGrid(df, row='step', hue='step', aspect=15, height=.4, palette=pal)

def plot_subplots(x, color, label, data):
  ax = plt.gca()
  ax.text(0, .08, label, fontweight="bold", color=color,
          ha="left", va="center", transform=ax.transAxes)
  buckets = data['dist'].iloc[0]
  hist_dict = SummaryReader.tensor_to_histogram(buckets)
  counts = hist_dict['counts']
  limits = hist_dict['limits']
  x = np.linspace(limits[0], limits[-1], 15)
  x, y = SummaryReader.histogram_to_pdf(counts, limits, x)
  # Draw the densities in a few steps
  sns.lineplot(x=x, y=y, clip_on=False, color="w", lw=2)
  ax.fill_between(x, y, color=color)
# Plot each subplots with df[df['step']==i]
g.map_dataframe(plot_subplots, None)

# Add a bottom line for each subplot
# passing color=None to refline() uses the hue mapping
g.refline(y=0, linewidth=2, linestyle="-", color=None, clip_on=False)
# Set the subplots to overlap
# Set the subplots to overlap (i.e., height of each distribution)
g.figure.subplots_adjust(hspace=-.9)
# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[], xlabel="", ylabel="")
g.despine(bottom=True, left=True)

import seaborn as sns
import matplotlib.pyplot as plt
log_dir = tmpdirs['tensorboardX'].name
reader = SummaryReader(log_dir, pivot=True)
df = reader.histograms
# Set background
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
# Choose color palettes for the distributions
pal = sns.color_palette("Oranges", 20)[5:-5]
# Initialize the FacetGrid object (stacking multiple plots)
g = sns.FacetGrid(df, row='step', hue='step', aspect=15, height=.4, palette=pal)

def plot_subplots(x, color, label, data):
  ax = plt.gca()
  ax.text(0, .08, label, fontweight="bold", color=color,
          ha="left", va="center", transform=ax.transAxes)
  counts = data['dist/counts'].iloc[0]
  limits = data['dist/limits'].iloc[0]
  x = np.linspace(limits[0], limits[-1], 15)
  x, y = SummaryReader.histogram_to_pdf(counts, limits, x)
  # Draw the densities in a few steps
  sns.lineplot(x=x, y=y, clip_on=False, color="w", lw=2)
  ax.fill_between(x, y, color=color)
# Plot each subplots with df[df['step']==i]
g.map_dataframe(plot_subplots, None)

# Add a bottom line for each subplot
# passing color=None to refline() uses the hue mapping
g.refline(y=0, linewidth=2, linestyle="-", color=None, clip_on=False)
# Set the subplots to overlap (i.e., height of each distribution)
g.figure.subplots_adjust(hspace=-.9)
# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[], xlabel="", ylabel="")
g.despine(bottom=True, left=True)

Reference: https://seaborn.pydata.org/examples/kde_ridgeplot.html

Plotting Multiple (Stacked) Histograms 

import seaborn as sns
import matplotlib.pyplot as plt
log_dir = tmpdirs['torch'].name
reader = SummaryReader(log_dir, pivot=True)
df = reader.histograms
# Set background
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
# Choose color palettes for the distributions
pal = sns.color_palette("Oranges", 20)[5:-5]
# Initialize the FacetGrid object (stacking multiple plots)
g = sns.FacetGrid(df, row='step', hue='step', aspect=15, height=.4, palette=pal)

def plot_subplots(x, color, label, data):
  ax = plt.gca()
  ax.text(0, .08, label, fontweight="bold", color=color,
          ha="left", va="center", transform=ax.transAxes)
  counts = data['dist/counts'].iloc[0]
  limits = data['dist/limits'].iloc[0]
  x, y = SummaryReader.histogram_to_bins(counts, limits, limits[0], limits[-1], 15)
  # Draw the densities in a few steps
  sns.lineplot(x=x, y=y, clip_on=False, color="w", lw=2)
  ax.fill_between(x, y, color=color)
# Plot each subplots with df[df['step']==i]
g.map_dataframe(plot_subplots, None)

# Add a bottom line for each subplot
# passing color=None to refline() uses the hue mapping
g.refline(y=0, linewidth=2, linestyle="-", color=None, clip_on=False)
# Set the subplots to overlap (i.e., height of each distribution)
g.figure.subplots_adjust(hspace=-.9)
# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[], xlabel="", ylabel="")
g.despine(bottom=True, left=True)

import seaborn as sns
import matplotlib.pyplot as plt
log_dir = tmpdirs['tensorflow'].name
reader = SummaryReader(log_dir, pivot=True)
df = reader.tensors
# Set background
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
# Choose color palettes for the distributions
pal = sns.color_palette("Oranges", 20)[5:-5]
# Initialize the FacetGrid object (stacking multiple plots)
g = sns.FacetGrid(df, row='step', hue='step', aspect=15, height=.4, palette=pal)

def plot_subplots(x, color, label, data):
  ax = plt.gca()
  ax.text(0, .08, label, fontweight="bold", color=color,
          ha="left", va="center", transform=ax.transAxes)
  buckets = data['dist'].iloc[0]
  hist_dict = SummaryReader.tensor_to_histogram(buckets)
  counts = hist_dict['counts']
  limits = hist_dict['limits']
  x, y = SummaryReader.histogram_to_bins(counts, limits, limits[0], limits[-1], 15)
  # Draw the densities in a few steps
  sns.lineplot(x=x, y=y, clip_on=False, color="w", lw=2)
  ax.fill_between(x, y, color=color)
# Plot each subplots with df[df['step']==i]
g.map_dataframe(plot_subplots, None)

# Add a bottom line for each subplot
# passing color=None to refline() uses the hue mapping
g.refline(y=0, linewidth=2, linestyle="-", color=None, clip_on=False)
# Set the subplots to overlap
# Set the subplots to overlap (i.e., height of each distribution)
g.figure.subplots_adjust(hspace=-.9)
# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[], xlabel="", ylabel="")
g.despine(bottom=True, left=True)

import seaborn as sns
import matplotlib.pyplot as plt
log_dir = tmpdirs['tensorboardX'].name
reader = SummaryReader(log_dir, pivot=True)
df = reader.histograms
# Set background
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
# Choose color palettes for the distributions
pal = sns.color_palette("Oranges", 20)[5:-5]
# Initialize the FacetGrid object (stacking multiple plots)
g = sns.FacetGrid(df, row='step', hue='step', aspect=15, height=.4, palette=pal)

def plot_subplots(x, color, label, data):
  ax = plt.gca()
  ax.text(0, .08, label, fontweight="bold", color=color,
          ha="left", va="center", transform=ax.transAxes)
  counts = data['dist/counts'].iloc[0]
  limits = data['dist/limits'].iloc[0]
  x, y = SummaryReader.histogram_to_bins(counts, limits, limits[0], limits[-1], 15)
  # Draw the densities in a few steps
  sns.lineplot(x=x, y=y, clip_on=False, color="w", lw=2)
  ax.fill_between(x, y, color=color)
# Plot each subplots with df[df['step']==i]
g.map_dataframe(plot_subplots, None)

# Add a bottom line for each subplot
# passing color=None to refline() uses the hue mapping
g.refline(y=0, linewidth=2, linestyle="-", color=None, clip_on=False)
# Set the subplots to overlap (i.e., height of each distribution)
g.figure.subplots_adjust(hspace=-.9)
# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[], xlabel="", ylabel="")
g.despine(bottom=True, left=True)

SummaryReader.histogram_to_bins aims to reproduce the visualization in tensorboard dashboard.

Reference: https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/histogram/tf_histogram_dashboard/histogramCore.ts#L83

Parsing Histograms

Preparing Sample Event Logs 

Parsing Event Logs 

Load Event File / Run Directory 

Extra Columns 

Plotting Events 

Plotting a Distribution 

Plotting Multiple (Stacked) Distributions 

Plotting Multiple (Stacked) Histograms 