Training TensorFlow models with big tabular datasets (ii)

April 25, 2022 · 1 min · 159 words Share on: X · HN

In my last post I talked about how I used TensorFlow datasets to speed up the training phase. Today I’ve discovered another game changer: the prefetch method.

Whith this method, your dataset is going to prefetch (aka prepare before needed) some batches while the current element is being processed. Therefore, we improve latency and throughput at the cost of consuming more memory. Also, according to TensorFlow documentation:

Most dataset input pipelines should end with a call to prefetch

To call the prefetch method you need to specify the buffer_size, this is the maximum number of elements that will be buffered when prefetching. If you want to set this value dinamycally to the “optimal” one ¹ you can just use tf.data.experimental.AUTOTUNE.

Finally, just by adding the line

dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

I’ve reduced by 2 the training time of my model.

I didn’t find any TensorFlow documentation regarding how the optimal value is computed. I’ll research this and write a post about it in the future. ↩