Multi-GPU training of Large, Sparse-Matrix on Wide #NeuralNetwork

Author: Preetham V Vishwanatha, VP DataSciences ~ Deep Learning

Multi-GPU training of neural network on TensorFlow (v0.12 as of this blog) is a pain. It is a pain if you get off the beaten path that is. I found out a bit early that the SparseTensors in the contrib.learn package does not play well on GPUs. This is a tale of re-writing a separate utility on Keras/TensorFlow that plays well for a very large sparse matrix on multiple GPUs.

This is a Sparse-Matrix !!!

I expected the contrib.learn packages to scale on GPUs. But found out, that is not the case. You can check the issue reported here >> contrib.learn.Estimator multi-gpu issue

The specific piece of code that did not work for me is as follows:

Note that a large portion of input data, which is loaded onto SparseTensor, being not available on GPU defies the purpose of a multi-GPU setup. Hence, the issue is raised.

My Setup: I tested this on a (CUDA 8) — 3units of Titan-X Pascal on Ubuntu 16.04 with TensorFlow v0.12. (Nvidia Driver Version: 370.28)

The Solution: Large, Sparse-Matrix on Multi-GPU

After several attempts to help fix the SparseTensor in the TensorFlow library, in the interest of time, I resorted to write a sparse tensor utility myself for a large sparse matrix. When I say large, I have tested the utility on 1.5 billion samples in input data, with about 20+ categorical features where some of the features have about 100K values per feature. I have tested this on 3 Tesla-X GPUs, and I am thoroughly satisfied with the results.

The link to the code is available here >> Wide Multi-GPU Network with Sparse Matrix. Feel free to use this code as you deem fit as a copy-left.

There are mainly two parts to this code:

  1. A quick-hash and sparse function for categorical features
  2. A multi-GPU parallelization routine

Sparse Hash

The objective of the sparse hash is two-fold as shown in the code section:

  1. Create a hash look-up table for unique features values per feature.
  2. Create a hash to sparse matrix conversion routine.

Hashing Code:

I used a non-cryptographic hash to speed up on large payloads. xx Hash is an extremely fast hash function with throughputs of 5.4GB/s or above.

Control the bucket size:

I have used a reduction of SQRT of the maximum length of the total possible feature values in the sample space per feature. This is a good thumb rule that reduces hash collision as I have observed it over the years. I need four such buckets per hash in the 4-hash-tuple received from the four_bit_split_hash procedure.

The 4-bit-split (or 4 hot encoding) is a good compressor for large sparse datasets while keeping hash collision in mind. I could not see a single collision for the buckets on 4-bit-split.

Sparse Conversion Code:

This is a simple routine that converts the hash to sparse-matrix. I used the row-wise csr_matrix from scipy.sparse module to do the trick. You can easily replace this with column-wise if your downstream operations are sharded differently. I presume most Neural Network data is row-wise sharded on key-codes per sample space (Unless you are resorting to Model Parallelism).

Also, the hstack procedure is used to concatenate the splits in the hash, which are converted to its own sparse array. Just for simplicity, I have also added a one-hot function.

At runtime, you have to convert the sparse matrix to a dense matrix to feed it to the Keras model (unfortunately, Keras does not yet support sparse feature as I understand). You can do so using the following procedure.

Note that I use the toarray() function on the sparse matrix to convert this to dense.

Multi-GPU Parallelization

The multi-GPU parallelization can be done in three different ways:

  1. Tower parallelization: Here you load the data, cost functions and gradients and regularizers all on the same GPU. This allows you to run an exact replica of the entire model on multiple GPUs and then you can aggregate the results using a mixing routine or a simple mean.
  2. Data parallelization: You can strip the input data into ’N’ parts (N being the number of GPUs you want to run on) and merge the resulting output into a single stream. I have used Keras for this exact reason as merging is fairly simpler in this library. This is similar to tower model, except that the data on different GPUs shall be different sample slices of the original data space. Each GPU gets the same model parameters, but different data-slice. All other neural network functions will be a replica.
  3. Model parallelization: The idea here is to modularize the entire model-parameters into different section and spread it out into different GPU, but send the same input dataset to different model-parameters on different GPU. I have stayed away from this for now as it can get quite dirty and complex pretty soon and hard to debug on an unstable backend (I am aware that TensorFlow released a rc 1.0. I haven’t checked the stability on that yet as of this blog).

Tower Parallelization:

Line #24 is where I am loading the exact same replica of the model across multiple GPU.

Line #27 is a Keras Lambda layer that allows the mix of outputs as defined in the prediction_mean function (line #15). I am just taking the mean of the outputs for now. This can be converted to a gaussian mix as explained in the mixture of experts in the past post titled : Committee of Intelligent Machines.

Data Parallelization:

On Line #34, we split the batch to multiple slices to fit as many GPUs as suggested.

Line #60 merges all the outputs from different GPU into a single output post the final layer of the model.

CAUTION: I am not able to successfully run this code on a linear regressor consistently without getting a ‘nan’ on the regression loss function during model evaluation (NOTE: This is a problem during evaluation and NOT during training and is not consistent). I believe this is a Keras issue and have raised the concern here >> Loss turns into ‘nan’ when running on GPU

Here is a sample template to use the utility. It’s incomplete for brevity…

On Line #62 in the figure, creates a feature dictionary with all the features and feature_length (to be used by the hashing and sparse functions). The feature length is the maximum sample space per feature.

Line #71, obtain the model by passing the feature dictionary along with type of parallelization needed (Defaults to Tower).

Line #73, obtain the hash lookup for all uniques.

Line #75, get the sparse matrix and convert it to dense matrix with categorical encoding.

Line #77, train the model with the dense matrix.

Model Usage

Note that the same model architecture should be used if you are planning to save_weights and load_weights of the trained model. Also, needless to state, change in feature lengths shall change the model architecture by changing the input size drastically.

Hope this helps reduce your workloads and improve model throughputs. Specifically for my training, I could reduce 48hr model training workloads to nearly 10–12 hours. I shall take that any-day !

Now, set those GPUs on fire people ! figuratively…