TensorFlow Debugger : Command-Line-Interface Tutorial: MNIST

TensorFlow Debugger (tfdbg) Command-Line-Interface Tutorial: MNIST

(Under development, subject to change)

This tutorial showcases the features of TensorFlow Debugger (tfdbg) command-line interface. It contains an example of how to debug a frequently encountered problem in TensorFlow model development: bad numerical values (nans and infs) causing training to fail.

To observe such an issue, run the following code without the debugger:

bazel build -c opt tensorflow/python/debug:debug_mnist && \
    bazel-bin/tensorflow/python/debug/debug_mnist

This code trains a simple NN for MNIST digit image recognition. Notice that the accuracy increases slightly after the first training step, but then gets stuck at a low (near-chance) level:

Accuracy at step 0: 0.1113
Accuracy at step 1: 0.3183
Accuracy at step 2: 0.098
Accuracy at step 3: 0.098
Accuracy at step 4: 0.098
Accuracy at step 5: 0.098
Accuracy at step 6: 0.098
Accuracy at step 7: 0.098
Accuracy at step 8: 0.098
Accuracy at step 9: 0.098

Scratching your head, you suspect that certain nodes in the training graph generated bad numeric values such as infs and nans. The computation-graph paradigm of TensorFlow makes it hard to debug such model internal states with general-purpose debuggers such as Python's pdb. tfdbg specializes in diagnosing these types of issues and pinpointing the exact node where the problem first surfaced.

Adding tfdbg to TensorFlow Sessions

To add support for tfdbg in our example, we just need to add the following three lines of code, which wrap the Session object with a debugger wrapper when the --debug flag is provided:

if FLAGS.debug:
  sess = tf_debug.LocalCLIDebugWrapperSession(sess)
  sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)

This wrapper has the same interface as Session, so enabling debugging requires no other changes to the code. But the wrapper provides additional features including:

  • Bringing up a terminal-based user interface (UI) before and after each run() call, to let you control the execution and inspect the graph's internal state.
  • Allowing you to register special "filters" for tensor values, to facilitate the diagnosis of issues.

In this example, we are registering a tensor filter called "has_nan_or_inf", which simply determines if there are any nan or inf values in any intermediate tensor of the graph. (This filter is a common enough use case that we ship it with the debug_data module.)

def has_inf_or_nan(datum, tensor):  return np.any(np.isnan(tensor)) or np.any(np.isinf(tensor))

TIP: You can also write your own custom filters. See tensorflow/python/debug/debug_data.py for additional filter examples.

Debugging Model Training with tfdbg

Let's try training the model again with debugging enabled. Execute the command from above, this time with the --debug flag added:

bazel build -c opt tensorflow/python/debug:debug_mnist && \
    bazel-bin/tensorflow/python/debug/debug_mnist --debug

The debug wrapper session will prompt you when it is about to execute the first run() call, with information regarding the fetched tensor and feed dictionaries displayed on the screen.

--- run-start: run #1: fetch: accuracy/accuracy/Mean:0; 2 feeds
======================================
About to enter Session run() call #1:

Fetch(es):
  accuracy/accuracy/Mean:0

Feed dict(s):
  input/x-input:0
  input/y-input:0
======================================

Select one of the following commands to proceed ---->
  run:
      Execute the run() call with the debug tensor-watching
  run -n:
      Execute the run() call without the debug tensor-watching
  run -f <filter_name>:
      Keep executing run() calls until a dumped tensor passes
      a given, registered filter emerge. Registered filter(s):
        * has_inf_or_nan
--- Scroll: 0.00% ----------------------------------------------
tfdbg>

This is what we refer to as the run-start UI. If the screen size is too small to display the content of the message in its entirety, you can use the PageUp / PageDown / Home / End keys to navigate the screen output.

As the screen output indicates, the first run() call calculates the accuracy using a test data set—i.e., a forward pass on the graph. You can enter the command run to launch the run() call. This will bring up another screen right after the run() call has ended, which will display all dumped intermedate tensors from the run. (These tensors can also be obtained by running the command lt after you executed run.) This is called the run-end UI:

--- run-end: run #1: fetch: accuracy/accuracy/Mean:0; 2 feeds --
21 dumped tensor(s):

[0.000 ms] accuracy/correct_prediction/ArgMax/dimension:0
[0.000 ms] softmax/biases/Variable:0
[0.013 ms] hidden/biases/Variable:0
[0.112 ms] softmax/weights/Variable:0
[1.953 ms] hidden/weights/Variable:0
[4.566 ms] accuracy/accuracy/Const:0
[5.188 ms] accuracy/correct_prediction/ArgMax_1:0
[6.901 ms] hidden/biases/Variable/read:0
[9.140 ms] softmax/biases/Variable/read:0
[11.145 ms] softmax/weights/Variable/read:0
[19.563 ms] hidden/weights/Variable/read:0
[171.189 ms] hidden/Wx_plus_b/MatMul:0
[209.433 ms] hidden/Wx_plus_b/add:0
[245.143 ms] hidden/Relu:0
--- Scroll: 0.00% ----------------------------------------------
tfdbg>

Try the following commands at the tfdbg> prompt (referencing the code at third_party/tensorflow/python/debug/examples/debug_mnist.py):

Print the value of the tensor hidden/Relu:0.

In this first run() call, there happen to be no problematic numerical values. You can exit the run-end UI by entering the command exit. Then you will be at the second run-start UI:

--- run-start: run #2: fetch: train/Adam; 2 feeds --------------
======================================
About to enter Session run() call #2:

Fetch(es):
  train/Adam

Feed dict(s):
  input/x-input:0
  input/y-input:0
======================================
...

Instead of entering run repeatedly and manually searching for nans and infs in the run-end UI after every run()call, you can use the following command to let the debugger repeatedly execute run() calls without stopping at the run-start or run-end prompt, until the first nan or inf value shows up in the graph:

tfdbg> run -f has_inf_or_nan

NOTE: This works because we have previously registered a filter for nans and infs called has_inf_or_nan (as explained previously). If you have registered any other filters, you can let tfdbg run till any tensors pass that filter as well, e.g.,

# In python code:
sess.add_tensor_filter('my_filter', my_filter_callable)

# Run at tfdbg run-start prompt:tfdbg> run -f my_filter

After you enter run -f has_inf_or_nan, you will see the following screen with a red-colored title line indicating tfdbg stopped immediately after a run() call generated intermediate tensors that passed the specified filter has_inf_or_nan:

--- run-end: run #4: fetch: train/Adam; 2 feeds ----------------30 dumped tensor(s) passing filter "has_inf_or_nan":[13.255 ms] cross_entropy/Log:0
[13.499 ms] cross_entropy/mul:0
[14.426 ms] train/gradients/cross_entropy/mul_grad/mul:0
[14.681 ms] train/gradients/cross_entropy/mul_grad/Sum:0
[14.885 ms] train/gradients/cross_entropy/Log_grad/Inv:0
[15.239 ms] train/gradients/cross_entropy/Log_grad/mul:0
[15.378 ms] train/gradients/softmax/Softmax_grad/mul:0
--- Scroll: 0.00% ----------------------------------------------
tfdbg>

As the screen display indicates, the has_inf_or_nan filter is first passed during the fourth run() call: an Adam optimizer forward-backward training pass on the graph. In this run, 30 (out of the total 87) intermediate tensors contain nan or inf values. These tensors are listed in chronological order, with their timestamps displayed on the left. At the top of the list, you can see the first tensor in which the bad numerical values first surfaced: cross_entropy/Log:0.

To view the value of the tensor, run

tfdbg> pt cross_entropy/Log:0

Scroll down a little and you will notice some scattered inf values. If the instances of inf and nan are difficult to spot by eye, you can use the following command to perform a regex search and highlight the output:

tfdbg> /inf

Or, alternatively:

tfdbg> /(inf|nan)

To go back to the list of "offending" tensors, use the up-arrow key to navigate to the following command, and hit Enter:

tfdbg> lt -f has_inf_or_nan

To further debug, display more information about cross_entropy/Log:

tfdbg> ni cross_entropy/Log
--- run-end: run #4: fetch: train/Adam; 2 feeds ---
Node cross_entropy/Log

  Op: Log
  Device: /job:localhost/replica:0/task:0/cpu:0

  1 input(s) + 0 control input(s):
    1 input(s):
      [Softmax] softmax/Softmax

  3 recipient(s) + 0 control recipient(s):
    3 recipient(s):
      [Mul] cross_entropy/mul
      [Shape] train/gradients/cross_entropy/mul_grad/Shape_1
      [Mul] train/gradients/cross_entropy/mul_grad/mul

You can see that this node has the op type Log and that its input is the node softmax/Softmax. Run the following command to take a closer look at the input tensor:

tfdbg> pt softmax/Softmax:0

Examine the values in the input tensor, and search to see if there are any zeros:

tfdbg> /0\.000

Indeed, there are zeros. Now it is clear that the origin of the bad numerical values is the node cross_entropy/Logtaking logs of zeros. You can go back to the source code in debug_mnist.py and infer that the culprit line is:

diff = y_ * tf.log(y)

Apply a value clipping on the input to tf.log to resolve this problem:

diff = y_ * tf.log(tf.clip_by_value(y, 1e-8, 1.0))

Now, try training again with --debug:

bazel build -c opt tensorflow/python/debug:debug_mnist && \
    bazel-bin/tensorflow/python/debug/debug_mnist --debug

Enter run -f has_inf_or_nan at the tfdbg> prompt and confirm that no tensors are flagged as containing nan or inf values, and accuracy no longer gets stuck. Success!

Other Features of the tfdbg Diagnostics CLI:

  • Navigation through command history using the Up and Down arrow keys. Prefix-based navigation is also supported.
  • Tab completion of commands and some command arguments.

Frequently Asked Questions

Q: Do the timestamps on the left side of the lt output reflect actual performance in a non-debugging session?

A: No. The debugger inserts additional special-purpose debug nodes to the graph to record the values of intermediate tensors. These nodes certainly slow down the graph execution. If you are interested in profiling your model, check out tfprof and other profiling tools for TensorFlow.

Q: How do I link tfdbg against my Session in Bazel?

A: In your BUILD rule, declare the dependency: "//tensorflow:tensorflow_py". In your Python file, add:

from tensorflow.python import debug as tf_debug# Then wrap your TensorFlow Session with the local-CLI wrapper.sess = tf_debug.LocalCLIDebugWrapperSession(sess)

Q: Can I use tfdbg if I am using tf-learn Estimators, instead of managing my own Session objects?

A: Currently, tfdbg can only debug the fit() method of tf-learn Estimators. Support for debugging evaluate()will come soon. To debug Estimator.fit(), create a monitor and supply it as an argument. For example:

from tensorflow.python import debug as tf_debug# Create a local CLI debug hook and use it as a monitor when calling fit().classifier.fit(x=training_set.data,               y=training_set.target,               steps=1000,               monitors=[tf_debug.LocalCLIDebugHook()])

For a detailed example based on tf-learn's iris tutorial, run:

bazel build -c opt tensorflow/python/debug:debug_tflearn_iris && \
    bazel-bin/tensorflow/python/debug/debug_tflearn_iris

Q: Does tfdbg help debugging runtime errors such as shape mismatches?

A: Yes. tfdbg intercepts errors generated by ops during runtime and presents the errors with some debug instructions to the user in the CLI. See examples:

# Debugging shape mismatch during matrix multiplication.
bazel build -c opt tensorflow/python/debug:debug_errors && \
    bazel-bin/tensorflow/python/debug/debug_errors \
        -error shape_mismatch --debug

# Debugging uninitialized variable.
bazel build -c opt tensorflow/python/debug:debug_errors && \
    bazel-bin/tensorflow/python/debug/debug_errors \
    -error uninitialized_variable --debug

本文由zdx3578推荐。

原文发布于微信公众号 - CreateAMind(createamind)

原文发表时间:2016-11-15

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏hbbliyong

使用WPF教你一步一步实现连连看(三)

这次首先对以前的结构进行了调整: 第一步:把MyButton按钮的属性独立成一个类,放在一个单独的MyButton.cs中,把图片的初始化也放到里面。 调整代...

35770
来自专栏数据结构与算法

1570. [POJ3461]乌力波

★☆   输入文件:oulipo.in   输出文件:oulipo.out 简单对比 时间限制:1 s   内存限制:256 MB 【题目描述】 法国作家乔治·...

31570
来自专栏计算机视觉与深度学习基础

Leetcode 275. H-Index II

Follow up for H-Index: What if the citations array is sorted in ascending orde...

19560
来自专栏数据结构与算法

Day3晚笔记

DEV C++扩展栈空间 ? -Wl,--stack=64000000000 带权二分图匹配 建一个超级源点S,超级汇点T 把左边的点的点权作为权值,连一条S到...

31240
来自专栏落影的专栏

iOS开发-OpenGL ES入门教程1

前言 这里是一篇新手教程,环境是Xcode7+OpenGL ES 2.0,目标写一个OpenGL ES的hello world。 OpenGL ES系列教程在...

34490
来自专栏聊聊技术

原 "二分查找(Binary Search

438110
来自专栏每日一篇技术文章

OpenGLES_实战04_教你绘制球体

写这篇文章主要给初学者一个绘制球体的思路,苹果给我们封装的类,帮助我们简化了不少代码,如果纯OpenGL 做这样一个练习代码量还是挺多的。

24210
来自专栏lhyt前端之路

从MDN上的canvas例子受到的启发0.前言1.面向对象编程的实践2.相互纠缠的现象3.解决方案4.模拟核裂变5.大鱼吃小鱼

在面对碰撞检测后还有后续动作的情况,必须考虑一下相互纠缠的问题: 如果两个小球被检测到碰撞的时候,而且加上他们的速度下一步还是处于碰撞范围内,就像引力一样无法脱...

16020
来自专栏编程

数控宏程序的编程及应用

1. 什么场合会用到宏程序编程? 其实说起来宏就是用公式来加工零件,比如说椭圆,如果没有宏的话,我们要逐点算出曲线上的点,然后慢慢来用直线逼近,如果是个光洁度要...

21180
来自专栏Code_iOS

OpenGL ES 2.0 (iOS)[03]:熟练图元绘制,玩转二维图形

文章的大前提是,你得有《OpenGL ES 2.0 (iOS): 一步从一个小三角开始》的基础知识。

22210

扫码关注云+社区

领取腾讯云代金券