tf.data四种迭代器
日常使用中单次迭代器应该是是用最多的,一般情况下数据量都是比较大,遍历一遍就搞定了。还是需要了解一下其他的迭代器,其实也是有相应的场合会需要这么去处理。
MNIST的经典例子
本篇博客结合 mnist 的经典例子,针对不同的源数据:csv数据和tfrecord数据,分别运用 tf.data.TextLineDataset() 和 tf.data.TFRecordDataset() 创建不同的 Dataset 并运用四种不同的 Iterator ,分别是 单次,可初始化,可重新初始化,以及可馈送迭代器 的方式实现对源数据的预处理工作。
- make_one_shot_iterator
- make_initializable_iterator
- Reinitializable iterator
- Feedable iterator
tf.data.TFRecordDataset() & make_one_shot_iterator()
tf.data.TFRecordDataset() 输入参数直接是后缀名为tfrecords
的文件路径,正因如此,即可解决数据量过大,导致无法单机训练的问题。本篇博客中,文件路径即为/Users/***/Desktop/train_output.tfrecords
,此处是我自己电脑上的路径,大家可以 根据自己的需要修改为对应的文件路径。 make_one_shot_iterator() 即为单次迭代器,是最简单的迭代器形式,仅支持对数据集进行一次迭代,不需要显式初始化。
单次迭代表示数据集是只迭代一次,但是你仍然可以将数据重复多个epoch的,只是重复之后也是只遍历一次完成数据的处理。
配合 MNIST数据集以及tf.data.TFRecordDataset(),实现代码如下。
<code><em># Validate tf.data.TFRecordDataset() using make_one_shot_iterator()</em><strong>import</strong> tensorflow <strong>as</strong> tf <strong>import</strong> numpy <strong>as</strong> np num_epochs <strong>=</strong> 2 num_class <strong>=</strong> 10 sess <strong>=</strong> tf<strong>.</strong>Session() <em># Use `tf.parse_single_example()` to extract data from a `tf.Example`</em><em># protocol buffer, and perform any additional per-record preprocessing.</em><strong>def</strong> <strong>parser</strong>(record): keys_to_features <strong>=</strong> { "image_raw": tf<strong>.</strong>FixedLenFeature((), tf<strong>.</strong>string, default_value<strong>=</strong>""), "pixels": tf<strong>.</strong>FixedLenFeature((), tf<strong>.</strong>int64, default_value<strong>=</strong>tf<strong>.</strong>zeros([], dtype<strong>=</strong>tf<strong>.</strong>int64)), "label": tf<strong>.</strong>FixedLenFeature((), tf<strong>.</strong>int64, default_value<strong>=</strong>tf<strong>.</strong>zeros([], dtype<strong>=</strong>tf<strong>.</strong>int64)), } parsed <strong>=</strong> tf<strong>.</strong>parse_single_example(record, keys_to_features) <em># Parse the string into an array of pixels corresponding to the image</em>images <strong>=</strong> tf<strong>.</strong>decode_raw(parsed["image_raw"],tf<strong>.</strong>uint8) images <strong>=</strong> tf<strong>.</strong>reshape(images,[28,28,1]) labels <strong>=</strong> tf<strong>.</strong>cast(parsed['label'], tf<strong>.</strong>int32) labels <strong>=</strong> tf<strong>.</strong>one_hot(labels,num_class) pixels <strong>=</strong> tf<strong>.</strong>cast(parsed['pixels'], tf<strong>.</strong>int32) <strong>print</strong>("IMAGES",images) <strong>print</strong>("LABELS",labels) <strong>return</strong> {"image_raw": images}, labels filenames <strong>=</strong> ["/Users/***/Desktop/train_output.tfrecords"] <em># replace the filenames with your own path</em>dataset <strong>=</strong> tf<strong>.</strong>data<strong>.</strong>TFRecordDataset(filenames) <strong>print</strong>("DATASET",dataset) <em># Use `Dataset.map()` to build a pair of a feature dictionary and a label</em><em># tensor for each example.</em>dataset <strong>=</strong> dataset<strong>.</strong>map(parser) <strong>print</strong>("DATASET_1",dataset) dataset <strong>=</strong> dataset<strong>.</strong>shuffle(buffer_size<strong>=</strong>10000) <strong>print</strong>("DATASET_2",dataset) dataset <strong>=</strong> dataset<strong>.</strong>batch(32) <strong>print</strong>("DATASET_3",dataset) dataset <strong>=</strong> dataset<strong>.</strong>repeat(num_epochs) <strong>print</strong>("DATASET_4",dataset) iterator <strong>=</strong> dataset<strong>.</strong>make_one_shot_iterator() <em># `features` is a dictionary in which each value is a batch of values for</em><em># that feature; `labels` is a batch of labels.</em>features, labels <strong>=</strong> iterator<strong>.</strong>get_next() <strong>print</strong>("FEATURES",features) <strong>print</strong>("LABELS",labels) <strong>print</strong>("SESS_RUN_LABELS \n",sess<strong>.</strong>run(labels))</code>
tf.data.TFRecordDataset() & Initializable iterator
make_initializable_iterator()
为可初始化迭代器,运用此迭代器首先需要先运行显式 iterator.initializer
操作,然后才能使用。并且,可运用 可初始化迭代器实现训练集和验证集的切换。 配合 MNIST数据集 实现代码如下。
这里公用了一套dataset处理流程,对于数据处理方式一样的数据集的确可以使用一套方式来处理,可初始化的迭代器表示可以使用不同的数据源来初始化该迭代器并且实现数据迭代的功能。
这里的迭代器可初始化意思就是你可以借助placeholder传参,每次传递不同的参数就得到不同的数据,数据处理的方式都是一致的,一定程度上有一定的可定制化。
<code><em># Validate tf.data.TFRecordDataset() using make_initializable_iterator() # In order to switch between train and validation data</em>num_epochs <strong>=</strong> 2 num_class <strong>=</strong> 10 <strong>def</strong> <strong>parser</strong>(record): keys_to_features <strong>=</strong> { "image_raw": tf<strong>.</strong>FixedLenFeature((), tf<strong>.</strong>string, default_value<strong>=</strong>""), "pixels": tf<strong>.</strong>FixedLenFeature((), tf<strong>.</strong>int64, default_value<strong>=</strong>tf<strong>.</strong>zeros([], dtype<strong>=</strong>tf<strong>.</strong>int64)), "label": tf<strong>.</strong>FixedLenFeature((), tf<strong>.</strong>int64, default_value<strong>=</strong>tf<strong>.</strong>zeros([], dtype<strong>=</strong>tf<strong>.</strong>int64)), } parsed <strong>=</strong> tf<strong>.</strong>parse_single_example(record, keys_to_features) <em># Parse the string into an array of pixels corresponding to the image </em>images <strong>=</strong> tf<strong>.</strong>decode_raw(parsed["image_raw"],tf<strong>.</strong>uint8) images <strong>=</strong> tf<strong>.</strong>reshape(images,[28,28,1]) labels <strong>=</strong> tf<strong>.</strong>cast(parsed['label'], tf<strong>.</strong>int32) labels <strong>=</strong> tf<strong>.</strong>one_hot(labels,10) pixels <strong>=</strong> tf<strong>.</strong>cast(parsed['pixels'], tf<strong>.</strong>int32) <strong>print</strong>("IMAGES",images) <strong>print</strong>("LABELS",labels) <strong>return</strong> {"image_raw": images}, labels filenames <strong>=</strong> tf<strong>.</strong>placeholder(tf<strong>.</strong>string, shape<strong>=</strong>[None]) dataset <strong>=</strong> tf<strong>.</strong>data<strong>.</strong>TFRecordDataset(filenames) dataset <strong>=</strong> dataset<strong>.</strong>map(parser) <em># Parse the record into tensors# print("DATASET",dataset)</em>dataset <strong>=</strong> dataset<strong>.</strong>shuffle(buffer_size<strong>=</strong>10000) dataset <strong>=</strong> dataset<strong>.</strong>batch(32) dataset <strong>=</strong> dataset<strong>.</strong>repeat(num_epochs) <strong>print</strong>("DATASET",dataset) iterator <strong>=</strong> dataset<strong>.</strong>make_initializable_iterator() features, labels <strong>=</strong> iterator<strong>.</strong>get_next() <strong>print</strong>("ITERATOR",iterator) <strong>print</strong>("FEATURES",features) <strong>print</strong>("LABELS",labels) <em># Initialize `iterator` with training data. </em>training_filenames <strong>=</strong> ["/Users/honglan/Desktop/train_output.tfrecords"] <em># replace the filenames with your own path </em>sess<strong>.</strong>run(iterator<strong>.</strong>initializer,feed_dict<strong>=</strong>{filenames: training_filenames}) <strong>print</strong>("TRAIN\n",sess<strong>.</strong>run(labels)) <em># print(sess.run(features))# Initialize `iterator` with validation data. </em>validation_filenames <strong>=</strong> ["/Users/honglan/Desktop/val_output.tfrecords"] <em># replace the filenames with your own path </em>sess<strong>.</strong>run(iterator<strong>.</strong>initializer, feed_dict<strong>=</strong>{filenames: validation_filenames}) <strong>print</strong>("VAL\n",sess<strong>.</strong>run(labels))</code>
tf.data.TextLineDataset() & Reinitializable iterator
可重复初始化的迭代器,这个与之前的可初始化的迭代器有什么区别?
可重新初始化迭代器可以通过多个不同的 Dataset 对象进行初始化。例如,您可能有一个训练输入管道,它会对输入图片进行随机扰动来改善泛化;还有一个验证输入管道,它会评估对未修改数据的预测。这些管道通常会使用不同的 Dataset 对象,这些对象具有相同的结构(即每个组件具有相同类型和兼容形状)。
<code># Define training and validation datasets with the same structure. training_dataset = tf.data.Dataset.range(100).map( lambda x: x + tf.random_uniform([], -10, 10, tf.int64)) validation_dataset = tf.data.Dataset.range(50) # A reinitializable iterator is defined by its structure. We could use the # `output_types` and `output_shapes` properties of either `training_dataset` # or `validation_dataset` here, because they are compatible. iterator = tf.data.Iterator.from_structure(training_dataset.output_types, training_dataset.output_shapes) next_element = iterator.get_next() training_init_op = iterator.make_initializer(training_dataset) validation_init_op = iterator.make_initializer(validation_dataset) # Run 20 epochs in which the training dataset is traversed, followed by the # validation dataset. for _ in range(20): # Initialize an iterator over the training dataset. sess.run(training_init_op) for _ in range(100): sess.run(next_element) # Initialize an iterator over the validation dataset. sess.run(validation_init_op) for _ in range(50): sess.run(next_element)</code>
从上面可以看出在循环的20次中,验证集的迭代器一直在不断的初始化迭代,可以理解为每运行一个epoch然后需要遍历所有的验证集,然后验证相应的效果。
tf.data.TextLineDataset() & Feedable iterator.
可馈送的迭代器,这个算是最复杂的迭代器,与之前的介绍的可重复初始化的迭代器不一样,这个在切换数据的时候不需要初始化操作,下面的例子就是可以充分的说明。我们定义了一个无线循环的训练集,然后我们需要遍历它,迭代方式只要迭代一次就行了,因为它是无线循环的嘛,所以在下面的while循环中就直接使用run不断的获取下一个数据即可
在我看来这种不需要初始化的情况一般都是在于大批量数据处理的情况下(无论是原始数据或者是经过epoch重复处理之后的),这个在训练的时候只要依此遍历就好了,不需要重复初始化。从下面的代码里面可以看出,这个可馈送的迭代器实际上充分利用了之前描述的迭代器,比如训练集使用的是单次迭代器,验证集使用的可初始化迭代器。所以这种数据集切换无需重新初始化只是一个相对的概念。
<code># Define training and validation datasets with the same structure. training_dataset = tf.data.Dataset.range(100).map( lambda x: x + tf.random_uniform([], -10, 10, tf.int64)).repeat() validation_dataset = tf.data.Dataset.range(50) # A feedable iterator is defined by a handle placeholder and its structure. We # could use the `output_types` and `output_shapes` properties of either # `training_dataset` or `validation_dataset` here, because they have # identical structure. handle = tf.placeholder(tf.string, shape=[]) iterator = tf.data.Iterator.from_string_handle( handle, training_dataset.output_types, training_dataset.output_shapes) next_element = iterator.get_next() # You can use feedable iterators with a variety of different kinds of iterator # (such as one-shot and initializable iterators). training_iterator = training_dataset.make_one_shot_iterator() validation_iterator = validation_dataset.make_initializable_iterator() # The `Iterator.string_handle()` method returns a tensor that can be evaluated # and used to feed the `handle` placeholder. training_handle = sess.run(training_iterator.string_handle()) validation_handle = sess.run(validation_iterator.string_handle()) # Loop forever, alternating between training and validation. while True: # Run 200 steps using the training dataset. Note that the training dataset is # infinite, and we resume from where we left off in the previous `while` loop # iteration. for _ in range(200): sess.run(next_element, feed_dict={handle: training_handle}) # Run one pass over the validation dataset. sess.run(validation_iterator.initializer) for _ in range(50): sess.run(next_element, feed_dict={handle: validation_handle})</code>