Suppose that the streaming majority-winner algorithm is given as input a random ordering of the input sequence A,A,A,B,B,C. Each different ordering of this input is equally likely. What is the probability that the algorithm returns A as its result? (Note that it is allowed to write a computer program to solve this, although it should also be possible to solve by hand calculation.)
Solution: There are 60 inputs, small enough to go through them by hand, especially with some tricks to find symmetric families of inputs and reduce the number of cases. But if I do a calculation like this I'm likely to make mistakes. So instead I wrote a Python program:
def stringsOfLength(L,alphabet): if L == 0: return [''] return [s+a for s in stringsOfLength(L-1,alphabet) for a in alphabet] Seqs = [X for X in stringsOfLength(6,'ABC') if X.count('A')==3 and X.count('B')==2 and X.count('C')==1] def streamingMajority(Sequence): maj,count = None,0 for element in Sequence: if count == 0: maj,count = element,1 else: count += 2*(maj==element) - 1 return maj print("%d/%d" % (len([S for S in Seqs if streamingMajority(S)=='A']),len(Seqs)))
It says that the answer is $42/60$, which simplifies to $7/10$.
Recall that a priority queue maintains a set of elements with associated priority values, subject to operations that insert an element, delete an element, or find the element with the minimum priority. You may also assume that a priority queue has two more operations that return its length and that return a list of its elements. Give pseudocode that uses these operations to implement a streaming algorithm for maintaining the MinHash sample of a data stream (the $k$ elements of the stream with minimum hash value for some predetermined hash value $h(x)$). Your pseudocode should define two functions, process(x) which handles a new element of the stream, and sample() which returns the current MinHash sample. Your answer should describe only these two functions; do not describe how to implement a priority queue.
Solution:
Q = a priority queue, initially empty def process(x): insert x into Q with priority -h(x) if len(Q) > k: delete the item with minimum priority in Q def sample(): list the elements in Q
Note: although this is the intended solution to this problem, it is not the most efficient way to maintain the MinHash sample. An alternative solution avoids using priority queues, and instead keeps a set $S$ of size $|S|\lt 2k$ that includes the $k$ elements of the MinHash sample and possibly some other elements. In this alternative solution, the process operation can be handled by adding $x$ to $S$ and then, if $S$ has grown to size $2k$, using a median-finding algorithm to reduce it to its $k$ smallest elements. The sample operation can be handled by, again, using a median-finding algorithm. This alternative solution takes constant amortized time per process operation and $O(k)$ time per sample operation, both optimal.
In a stream of distinct elements, consider the following streaming approximate-median algorithm: we maintain a random sample of three elements, and then estimate the median of the stream by returning the middle element of our sample. What is the probability that this estimate is within the middle third of the stream elements (that is, that it is greater than at least 1/3 of the stream and less than at least 1/3 of the stream)? (You may state the answer as a number rather than a formula, ignoring terms in the answer that go to zero as the stream length becomes large.)
Solution: Each of the three samples has a 1/3 probability of being in the first third, middle third, or last third, so there are $27$ different outcomes with probability $1/27$ each. The median of the three sampled elements is below the middle third if all three samples are in the first third (probability $1/27$) or two of the three are in the first third and the remaining one is not (probability $6/27$ as there are three choices of which two are in the first third and two choices of where the remaining one goes). So the total probability of the median of the samples being below the middle third is $7/27$. The same calculation also gives $7/27$ the total probability of the median of the three samples being above the middle third. The remaining probability, that it is in the middle third, is $(27-7-7)/27=13/27$.
Describe a data structure that can handle a stream of insertion and deletion operations on a collection of numbers from $0$ to $N-1$ (allowing the same number to appear multiple times in the collection), and that after each operation reports a number $x$ such that, if there is a majority element, $x$ is that element. That is, if $k$ numbers have been inserted but not deleted, and one number appears more than $k/2$ times among them, then $x$ should be that number. For full credit, your structure should use at most $O(\log N)$ words of memory. You may assume that the stream does not include any deletion operations for numbers that have not been inserted. (Hint: Solve it for $N=2$ first, and then consider the binary representation of the numbers.)
Solution: For $N=2$, the numbers in the sequence are $0$ and $1$, and we can just vote on which one is more popular: maintain a counter $c$, add one to $c$ for each insertion of $1$ or deletion of $0$, and subtract one from $c$ for each deletion of $1$ or insertion of $0$. If $c$ is positive, the majority element is $1$; if $c$ is negative, the majority element is $0$, and if $c$ is zero, there is no majority element. Now, to solve the original problem, maintain a separate vote (with a separate counter) for each bit in the binary representation of the given numbers. (This folklore data structure was part of the inspiration for the count-min sketch.)