## Map/Reduce in Python

My interest in Grid Computing over the last weeks begins to show. After reading the Google MapReduce paper, I tried my fingers on a client side toy problem.

For formatting purposes, I was interested in the size of the longest string in a sequence. There are lots of ways to do this, but I wanted to try Python’s `map` and `reduce` functions.

This is my sample input sequence:

```  values = ('abc', 'xy', 'abcdef', 'xyz')
```

The first step is to transform the `values` sequence into a new list which contains the lengths of each string:

```  mapped = map(lambda x: len(x), values)
```

After that, the `mapped` list is reduced to a single value:

```  max_len = reduce(lambda x, y: max(x, y), mapped)
```

Both user-defined map and reduce functions are associative and commutative, so this task can be parallelized easily. In fact, I could in theory hand each map operation off to a different node in a grid. Because of the `max` function’s properties, the reduce could be executed in multiple steps on a grid, too.

BTW: While map/reduce was a nice experiment, I used a different implementation in the end, using list comprehension and the powerful `max()` function:

```  max( [len(x) for x in values] )
```

If I had been interested in the longest string itself, I would have used the following (python-2.5 only):

```  max(values, key=lambda x: len(x))
```
This entry was posted in python and tagged , . Bookmark the permalink.

### 4 Responses to Map/Reduce in Python

1. It also can be done in one simple line:

reduce(max, map(len, values))

2. mafr says:

Nice, thank you!

3. V says:

This is an old post, but just passing by:

You don’t have to use square brackets in your max line, this works too:

>>> max(len(x) for x in values)

and you don’t have to use lambda in the next:

>>> max(values, key=len)

4. I enjoy all the tips helping me to think “pythonically” (I’ve been coding in perl for the last 13 years, got into Python only in the last 6 months and am enjoying it) but isn’t the hadoop (or original google cluster) map/reduce concept more generic as well?

For example I recently wrote a hadoop-capable apache weblog parser (in python). I parse literally thousands of logs a day. The process takes some 20 hours on 5 machines. The idea here is to be able to hand a single log to each machine in the hadoop map array and the resulting output (using Python’s shelve) to a smaller set of reduction machines.

The “map step” itself consists of parsing a log into a rather complex dictionary (consisting of other dictionaries and lists) which is shelved. The “reduction step” takes the shelved dictionary of dictionaries as input and generates statements suitable for handing off to a DBMS bulk load.

Is this a legitimate way to think of map/reduce?