Going through old CACM issues I discovered a paper (PDF) on stream processing. A common problem in this field is to find frequent items in a data stream when you only get one pass through the data and you need answers in real time. This is interesting in situations where you don’t have enough memory to store counters for each distinct item you see in the stream. One example is keeping track of the most frequent destinations in a high traffic network router.
Among others, the authors present the classic Majority algorithm that finds the majority item in a stream. An item is the majority item if it occurs more than 50% of the time.
Here’s an implementation in Python with an example stream of integers:
stream = (1, 2, 3, 1, 3, 3, 1, 3, 2, 3, 3) majority = stream counter = 1 for item in stream[1:]: if item == majority: counter += 1 else: if counter == 0: majority = item counter = 1 else: counter -= 1 print majority
One drawback of this and the other presented algorithms is that the algorithm always flags one item as the majority. But without a second pass you can’t actually tell if it’s really a majority. The algorithm just guarantees that if there is a majority item, then it will give it to you. This sounds like a big drawback, but there are cases where you know the frequency distribution of your items and you can tell beforehand that there will be a majority.
The other algorithms presented in the paper are generalizations that find all k items that exceed a 1/k fraction of the total number of items. They are useful when you know that a few items occur a lot of the time while most items are quite rare (see Power Law Distribution). Page impressions on your website could be an example, or the frequency of certain words in a natural language.