SAVE_RESTORE_MODEL

保存和恢复模型, 分为两个部分: 1.模型的结构(即创建模型的代码), 2.模型的训练权重(即参数), 所以, 不能总只是记着参数而没有结构. 另, 保存TF代码中的模型有很多方法(取决于API), 本文中使用的是tf.keras, 是TF中的”高阶API”. (这会区别于TF中”原始”的保存和加载代码) P.S. 看官方的东西, 踏实的有点过分.

此演示DEMO的大致步骤
1.加载MNIST数据集, 并定义好一个用于演示的模型;
2.定义好一个`cp_callback`回调, 并作为`model.fit()`中的`callbacks`参数; (告诉KERAS/PY在每个epoch结束时保存一次) 此步, 会在`training_1/cp.ckpt`所在的文件夹下生成一系列相关的”保存文件”;
3.在需要使用原模型的参数时, 恢复模型, 如: `model.load_weights(cp_path)`, 这一步, 模型的参数是训练过的参数了;
其它内容
1.回调选项(设置), 2.手动保存权重, 3.保存整个模型(`.h5`文件形式(HDF5标准), 或`saved_model`形式).

下面附上代码: (无代码 => 空谈空谈)

# 这坑爹的, 从'.ipynb'中转出来的代码居然设置了这个, 难怪每次都会莫名其妙的少模块: 原来它使用了默认的环境;
#!/usr/bin/env python
# coding: utf-8
# In[ ]:
get_ipython().system('pip install h5py pyyaml')
get_ipython().system('pip install tf_nightly')

from __future__ import absolute_import, division, print_function, unicode_literals
import os
import tensorflow as tf
from tensorflow import keras

# ##############################################################
# 步骤1.加载MNIST数据集, 并定义好一个用于演示的模型;
# 注: 1.使用MNIST数据集进行Save/Load演示, 2.只使用前1000数据以加快速度;
# #############################################
# API中已经内置load()函数了;
(train_images,train_labels),(test_images,test_labels) = tf.keras.datasets.mnist.load_data()
# 取前1000个数据, 并flatten(不使用Conv建模)和归一化;
train_labels = train_labels[:1000]
test_labels  = test_labels[:1000]
train_images = train_images[:1000].reshape(-1, 28 * 28) / 255.0
test_images  = test_images[:1000].reshape(-1, 28 * 28) / 255.0
# 定义模型(short sequential model), 用以演示其参数如何被保存和加载使用;
# 由于数据复杂性和FashionMNIST差不多, 所以, 它们的模型形式也差不多, 都是神经网络也都没有用到CNN;
def create_model():
    model = tf.keras.models.Sequential([
        keras.layers.Dense(512, activation=tf.keras.activations.relu, input_shape=(784,)),
        keras.layers.Dropout(0.2),
        keras.layers.Dense(10, activation=tf.keras.activations.softmax)
    ])
    model.compile(optimizer=tf.keras.optimizers.Adam(),
                loss=tf.keras.losses.sparse_categorical_crossentropy,
                metrics=['accuracy'])
    return model
# Create a basic model instance;
model = create_model()
# model.summary()

# ############################################################
# 步骤2.定义好一个`cp_callback`回调, 并作为`model.fit()`中`callbacks`的参数; (告诉KERAS/PY在每个epoch结束时保存一次) 此步, 会在`training_1/cp.ckpt`所在的文件夹下生成一系列相关的"保存文件";
# #############################################
# Checkpoint回调的作用是, 在训练期间或训练结束时自动保存检查点, 以: 
# 1.使用经过训练的模型, 而无需重新训练, 2.从上次暂停的地方继续训练.
# 训练模型, 并将`ModelCheckpoint`回调传递给该模型(model.fit()中)：
checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir  = os.path.dirname(checkpoint_path)
# 生成一个Keras中的checkpoint的callback, 并同时配置/设置好;
cp_callback = tf.keras.callbacks.ModelCheckpoint(
				checkpoint_path, save_weights_only=True, verbose=1)
model = create_model()
# 这次, 模型的fit()方法比FashionMNIST中的复杂了(参数需自行设置, 而不是默认不需设置的了);
# 以下代码创建一个TF检查点文件集合(它们在每个epoch周期结束时更新), 而不是生成新的检查点文件;
# 注册回调信息: fit()中注册, 告诉它训练时回调哪个检查点函数;
model.fit(train_images, train_labels, epochs = 10,
          validation_data = (test_images,test_labels),
          callbacks = [cp_callback])  # Pass callback to training;
# 此处, 已完成检查点文件的生成了, 下面可以进行对比测试了;

# 下面进行对比测试(新模型以及加载权重参数后的对比);
# 创建一个未经训练的全新模型(仅通过权重恢复模型时, 新模型必须与原始模型架构相同(尽管是不同的模型实例);
# 新的未训练的模型的表现有很大的偶然性(准确率约为10%);
model     = create_model()  # 另一个相同结构的模型;
loss, acc = model.evaluate(test_images, test_labels)
print("Untrained model, accuracy: {:5.2f}%".format(100*acc))
# 1000/1000 [==============================] - 0s 117us/sample - loss: 2.3292 - acc: 0.0990
# Untrained model, accuracy:  9.90%
# 接下来, 从检查点中加载weights, 并重新测试准确率:
model.load_weights(checkpoint_path)
loss, acc = model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))
# 1000/1000 [==============================] - 0s 40us/sample - loss: 0.4149 - acc: 0.8670
# Restored model, accuracy: 86.70%

# 此代码块关于检查点回调的一些configuration演示;
# 检查点回调选项(options): resulting checkpoints unique names, checkpointing frequency;
# 以下: 训练一个新模型, 每隔5个周期保存一次检查点并设置唯一名称;
# Include the epoch in the file name. (uses `str.format`)
checkpoint_path = "training_2/cp-{epoch:04d}.ckpt"
checkpoint_dir  = os.path.dirname(checkpoint_path)
cp_callback     = tf.keras.callbacks.ModelCheckpoint(
    checkpoint_path, verbose=1, save_weights_only=True,
    period=5)	# Save weights, every 5-epochs.
# 开始这一轮的Demo;
model = create_model()
model.save_weights(checkpoint_path.format(epoch=0))
model.fit(train_images, train_labels, epochs = 50, callbacks = [cp_callback],
          validation_data = (test_images,test_labels), verbose=0)
# Now, look at the resulting checkpoints and choose the latest one:
latest = tf.train.latest_checkpoint(checkpoint_dir)
latest
# 输出: 'training_2/cp-0050.ckpt'
# 重新加载weights, latest这个;
model = create_model()
model.load_weights(latest)
loss, acc = model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))
# 1000/1000 [==============================] - 0s 271us/sample - loss: 0.4796 - acc: 0.8780
# Restored model, accuracy: 87.80%

至此, 模型的权重保存和加载演示完毕, 但上述演示只是权重(weights)的保存和加载, 下面介绍整个模型的保存和加载, 以及保存文件的格式的介绍.

关于检查点文件 The above code stores the weights to a collection of checkpoint-formatted files that contain only the trained weights in a binary format. Checkpoints contain: 1.One or more shards(分片) that contain your model’s weight, 2.An index file(索引文件) that indicates which weights are stored in a which shard. If you are only training a model on a single machine, you’ll have one shard with the suffix: ‘.data-00000-of-00001’.

以下介绍手动保存权重(manually save weights):

# Manually saving the weights is just as simple, use the `Model.save_weights` method.
model.save_weights('./checkpoints/my_checkpoint')		# 手动保存权重;
# Restore the weights
model = create_model()
model.load_weights('./checkpoints/my_checkpoint')
loss,acc = model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))
# 1000/1000 [==============================] - 0s 242us/sample - loss: 0.4796 - acc: 0.8780
# Restored model, accuracy: 87.80%

以下, 介绍保存整个模型(save the entire model): 整个模型可以保存到一个文件中, 其中包含权重值, 模型配置乃至优化器配置. 这样可以为模型设置检查点, 并稍后从完全相同的状态继续训练, 而无需访问原始(original)代码.

# 保存整个模型1/2: 保存成HDF5文件;
# Keras provides a basic save format using the HDF5 standard. For our purposes, the saved model can be treated as a single binary blob(二进制blob).
# Ref: https://en.wikipedia.org/wiki/Hierarchical_Data_Format;
model = create_model()
model.fit(train_images, train_labels, epochs=5)
# Save entire model to a HDF5 file;
model.save('my_model.h5')
# 从该文件重新创建一模一样的模型: 包括weights和optimizer;
new_model = keras.models.load_model('my_model.h5')
new_model.summary()
# Check its accuracy:
loss, acc = new_model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))
# 1000/1000 [==============================] - 0s 258us/sample - loss: 0.4280 - acc: 0.8670
# Restored model, accuracy: 86.70%
#
# 以上技术(technique)保存所有的东西:
# * The weight values, 权重值;
# * The model's configuration(architecture), 模型配置(架构);
# * The optimizer configuration, 优化器配置;
# Keras通过检查架构来保存模型, 但它无法保存TensorFlow优化器(来自tf.train). 使用此类优化器时, 需要在加载模型后对其进行重新编译(因为没保存优化器的状态信息);

# 保存整个模型2/2: 保存成'saved_model';
# Caution: This method of saving a `tf.keras` model is experimental and may change in future versions.
# Build a fresh model:
model = create_model()
model.fit(train_images, train_labels, epochs=5)
# Create a `saved_model`:
import time
saved_model_path = "./saved_models/"+str(int(time.time()))
tf.contrib.saved_model.save_keras_model(model, saved_model_path)
# Have a look in the directory:
get_ipython().system('ls {saved_model_path}')
# Reload a fresh keras model from the saved model.
new_model = tf.contrib.saved_model.load_keras_model(saved_model_path)
new_model.summary()
# Run the restored model.
# The model has to be compiled before evaluating, 用这种方法前必须重新compile模型;
# This step is not required if the saved model is only being deployed.
new_model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss=tf.keras.losses.sparse_categorical_crossentropy,
              metrics=['accuracy'])
# Evaluate the restored model.
loss, acc = new_model.evaluate(test_images, test_labels)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))
# 1000/1000 [==============================] - 0s 338us/sample - loss: 0.4217 - acc: 0.8580
# Restored model, accuracy: 85.80%

后记

其实模型保存也不是什么大聪明才智的事情, 关键是我们说的概念仅仅是”保存模型”这个蓝图, 而这个概念的具体实在是: 1.保存时那些生成的文件是什么格式的(如二进制块, blob), 2.保存时API的参数设置可以是怎么样的, 等等. 如此的问题都会具体影响到”保存模型”这个概念: 绝知此事要躬行.

Reference

保存和恢复模型, Learn and use ML, [tensorflow.google.cn].