Comparing TensorFlow with MapReduce

During last month’s Innovation Day, a group of colleagues took the opportunity to play around with TensorFlow.

Google introduced its own distributed computing engine TensorFlow as an open-source software library for Machine Intelligence. We started with implementing Linear Regression in TensorFlow and tried to understand the principles. I was even more interested to find out that Google implemented TensorFlow in their Deep Learning Neural Networks.

Insights after linear regression
TensorFlow combines open-ended machine learning research with system engineering and Google-scale computing resources. We decided to have a closer look into the computational model compared with familiar technologies like MapReduce.

We found out that TensorFlow and MapReduce achieve parallelization by decomposing computation into parallelizable basic blocks. MapReduce uses the notions of pure function and commutative monoid (binary, associative, commutative function) as building blocks, while TensorFlow uses the notion of computational graph, where the nodes of the graph are tensors (multidimensional matrixes), or operations on tensors (addition, multiplication, etc.).

In order to run some computational procedure in parallel using one of the described models, one needs to represent his procedure in terms of the corresponding building blocks, depending on the selected computational model. In both cases operating with basic building blocks is too low-level and not very handy, and both technologies provide higher-level primitives to operate with.

More TensorFlow, MapReduce and Spark observations
In case of MapReduce there are lots of open-source frameworks providing higher level APIs on top of MapReduce, where Spark is one of the most popular. TensorFlow provides higher level API’s (for example, a computational graph node that performs a gradient-base convex optimization, which can be used to train your linear regression models) but the TensorFlow community is less mature compared to MapReduce.

Tensorflow Anchormen blog

Our brainstorm notes

MapReduce can run in distributed mode in a cluster using resource managers like YARN or Mesos. Spark allows you to run a MapReduce computational model on a single machine (even on a laptop), using the machine’s cores for parallelization. TensorFlow runs on multiple machines without additional external resource manager. TensorFlow can also be executed on a single machine using CPU’s or even GPU’s for parallelization.

One of the known weaknesses of MapReduce for Machine Learning use cases is IO overhead when dealing with iterative processes (most AI training procedures are iterative processes). This problem is resolved in Spark by using in-memory computational model which minimized IO operations. It seems that TensorFlow does not have this problem and one of its main use case is Machine Learning.

MapReduce supports a broad range of programming languages, currently TensorFlow supports C and Python only. TensorFlow has support for the most standard Machine Learning techniques. It seems that Spark has more versatile support for Machine Learning techniques comparing to TensorFlow due to a more mature community.

Specifically on TensorFlow there is more additional information provided on Google Cloud Platform or contact Borys to share your knowledge. At Anchormen we will keep following how TensorFlow and related technology is developing.