0xDE ([info]11011110) wrote,
@ 2007-10-09 16:57:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Entry tags:algorithms

Blum-style analysis of Quickselect

So you're all familiar with Avrim Blum's Motwani and Raghavan's (?) slick analysis of quicksort in CLRS, avoiding probabilistic recurrences and easily getting the correct constant in the leading term, right? (If not, I review it below.) What CLRS doesn't tell you is that the same analysis works almost as well for quickselect.

Quicksort

Randomized quicksort is a recursive algorithm that, in each recursive call chooses a "pivot" element uniformly at random from the values given as input to the call, partitions the data into subsets that are less than, equal to, and greater than the pivot, and recursively sorts the subsets that are less and greater. With some care the partition can be done with one comparison of each data item to the pivot:

def quicksort(L):
    if len(L) < 2: return L
    pivot_pos = random integer in range 0 .. len(L)-1
    x = L[pivot_pos]
    parts = [[],[x],[]]
    for y in L[:pivot_pos] + L[pivot_pos+1:]:
        parts[cmp(y,x)+1].append(y)
    return quicksort(parts[0]) + parts[1] + quicksort(parts[2])

An equivalent view of the algorithm (though not how one would usually program it) is that, before the recursion starts, we pick a random permutation σ; then, in each recursive call, we pick the pivot to be the value among the inputs given to that call that's earliest in σ. It's not hard to see that this gives the same uniform probability that any pivot is chosen. What we want to count is the number of calls to cmp over the course of the algorithm.

The expected number of comparisons is (by linearity of expectation) the same as the sum, over pairs of data values, of the probability that the algorithm compares that pair. If xi denotes the value in position i of the sorted output array, and i < j, then xi and xj are compared exactly when one of these two values is the earliest in σ in the range of values xi, xi+1, ..., xj-1, xj. For, until a pivot within this range is chosen, xi and xj will stay in the same subproblem as each other, but after such a pivot is chosen, they will be separated. Each item in this range is equally likely to be the first one chosen as pivot, so the probability that the first in this range is one of the two endpoints is exactly 2/(j-i+1). Thus, using this expression as the probability that a pair is compared, the expected number of comparisons is (using the known logarithmic evaluation of harmonic numbers)

\sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}\frac2{j-i+1}\,<\,2nH_n=2n\ln{n}+O(n).

Quickselect

Quickselect is a variant of quicksort that finds the kth smallest of a set of items by using the same partition strategy as quicksort but then only recursing within one of the subsets of the partition, the one containing the desired value:

def quickselect(L,k):
    if len(L) < 2: return L[k]
    pivot_pos = random integer in range 0 .. len(L)-1
    x = L[pivot_pos]
    parts = [[],[x],[]]
    for y in L[:pivot_pos] + L[pivot_pos+1:]:
        parts[cmp(y,x)+1].append(y)
    if k < len(parts[0]):
        return quickselect(parts[0],k)
    if k >= len(parts[0]) + len(parts[1]):
        return quickselect(parts[2],k - len(parts[0]) - len(parts[1]))
    return x

The results of calling this should be the same as calling quicksort on L and then indexing the kth entry in the sorted array.

Pretty much the same analysis as before goes through, changed only in the last part. The expected number of comparisons is the same as the sum, over pairs of data values, of the probability that the algorithm compares that pair. Values xi and xj are compared exactly when one of these two values is the earliest in σ in a certain range of values, but now the range is bigger (making the probability of comparison smaller): it's the minimal consecutive range of values containing xi, xj, and xk. For, if a pivot is chosen between xi and xj then they are placed in separate subproblems and not compared, while if a pivot is chosen between both of these values and xk then both values are eliminated and not compared. Thus, we need only replace the probability 2/(j-i+1) in the quicksort analysis by the slightly more complicated formula

\frac2{\max(j-i+1,j-k+1,k-i+1)},

sum over all pairs, and we're done. The following diagram shows graphically how the sum decomposes the (i,j) plane into regions, for k < n/2 (the case for larger k is symmetric); along the colored lines within each region the probability given by the formula is constant.

  • Within the tangerine-colored triangle on the lower left, i and j are both less than k. The qth column within the triangle, through the vertical line i = k - q, has approximately q entries within it, each of which is approximately 2/q, and there are k columns, so the sum within this triangle is (ignoring lower order terms) 2k.
  • Within the salmon-colored triangle on the upper right, i and j are both greater than k. The qth row within the triangle, through the horizontal line j = k + q, has approximately q entries within it, each of which is approximately 2/q, and there are n - k columns, so the sum within this triangle is (ignoring lower order terms) 2(n - k).
  • The remaining terms in the sum have i less than k and j greater than k, but it is convenient to break them down further into three smaller regions. Within the avocado-colored triangle above and to the left of the point (k,k), the distance between i and j is less than k. The qth diagonal within this triangle has approximately q entries within it, each of which is approximately 2/q, and there are k diagonals, so the sum within this triangle is (ignoring lower order terms) 2k.
  • Within the lilac-colored parallelogram, each diagonal has k equal terms, each of the form 2/q for some q. If the same pattern of terms continued down to the main diagonal, the sum of these terms would be 2k times a harmonic number; instead, we get 2k times the difference of two harmonic numbers: 2k(ln(n - k) - ln k).
  • Within the turqoise-colored triangle at the upper right, the diagonal j - i = q has approximately n - q terms, each approximately 2/q. The sum of 2(n - q)/q, for q ranging from n - k to n, can be broken into two parts, the sum of 2n/q and the sum of -2q/q; the latter sum is simply -2k. The first sum is again (factoring out the constant numerator) the difference of two harmonic numbers, so the total is 2(n ln n - n ln(n - k) - k).

Adding all of these parts of the sum together, we find that the total expected number of comparisons made by quickselect is

\left(2n\left(1+\ln\frac{n}{n-k}\right)+2k\ln\frac{n-k}k\right)(1+o(1)).

The worst case happens when k = n/2, for which the number of comparisons can be simplified to

2n(1+\ln 2+o(1))\le3.3863n+o(n).

A simplified analysis, perhaps more suitable for an undergraduate class, would be to expand the upper left rectangle into a larger triangle, with n diagonals each contributing less than 2 to the total; this sloppier analysis shows more easily that the number of comparisons is at most 4n.



(Post a new comment)

Avrim Blum?
[info]erniepan
2007-10-10 01:58 am UTC (link)
Hmm. I've always thought of the indicator-variable analysis of randomized quicksort as Raimund Seidel's idea. The trick is implicit in Raimund's analysis of treaps, and explicit in his survey on backwards analysis. Did Raimund learn it from Avrim?

(Reply to this) (Thread)

Re: Avrim Blum?
[info]11011110
2007-10-10 03:20 am UTC (link)
I don't know, but CLRS say Avrim.

(Reply to this) (Parent)

Quickselect simpler analysis?
(Anonymous)
2007-10-10 12:26 pm UTC (link)
Let X_i be the size of the subproblem in the ith level of the recurrence.
E[ X_i | X_{i-1} ] = X_{i-1}/2.
As such, E[X_i ] = E[ E[X_i| X_i-1] ] = E[X_{i-1}]/2 = ... = n/2^i.
Thus, the expected running time is
E[\sum_i X_i ] = \sum_i n/2^i = O(n).

No?

--Sariel

(Reply to this) (Thread)

Re: Quickselect simpler analysis?
(Anonymous)
2007-10-10 12:39 pm UTC (link)
BTW, E[X_i|X_{i-1}] = X_{i-1}/2 is over simplifying things.
E[ X_i | X_{i-1} ] <= X_{i-1} * (15/16) is easier to show, and it is enough...

--S

(Reply to this) (Parent)(Thread)

Re: Quickselect simpler analysis?
[info]11011110
2007-10-10 02:50 pm UTC (link)
How easy is it to get the correct leading term without O-notation this way?

(Reply to this) (Parent)(Thread)

Re: Quickselect simpler analysis?
(Anonymous)
2007-10-10 04:38 pm UTC (link)
Yeh. This analysis only gives you the asymptotic bound, not the constant. A fair trade off between elegance and preciseness, IMHO.

Now, if you argue that that the worst case is when the element to be picked is exactly in the middle (which is true), then
E[X_i| X_{i-1}] = (3/4)X_i -1 <= 3/4(X_i)
This would give you a reasonable constant (i.e., 4n comparisons), but NOT the right constant which is slightly smaller.

In any case, QuickSelect is the wrong algorithm if you care about the constant - one can do 1.5n + noise comparisons by a randomized algorithm (see Raghavan and Motwani book). Any deterministic algorithm need to do 2n comparisons. Randomization wins again.

--S

(Reply to this) (Parent)(Thread)

Re: Quickselect simpler analysis?
[info]11011110
2007-10-10 05:50 pm UTC (link)
More precisely, n + min(k,n-k) + O((n log n)2/3) comparisons. I included it in some old lecture notes. But I think it's easiest to understand that algorithm if you already know quickselect.

(Reply to this) (Parent)

Re: Quickselect simpler analysis?
(Anonymous)
2007-10-11 03:04 am UTC (link)
Talking of all this, has anyone looked into nailing down the right constant for deterministic median-finding? The last effort I know of is by Dor and Zwick who show (in two amazing papers) that the constant is less than 3 and greater than 2.

(Reply to this) (Parent)(Thread)

Re: Quickselect simpler analysis?
[info]11011110
2007-10-11 03:25 am UTC (link)
That's the last one I remember seeing.

(Reply to this) (Parent)

Re: Quickselect simpler analysis?
(Anonymous)
2007-10-11 04:24 am UTC (link)
Sorry, forgot to sign above post.
- Amit Chakrabarti

(Reply to this) (Parent)

Software for figures
(Anonymous)
2007-10-16 04:04 pm UTC (link)
What do you use to make your figures? In particular, the diagram in this post. Thanks.

(Reply to this) (Thread)

Re: Software for figures
[info]11011110
2007-10-16 04:22 pm UTC (link)
Adobe Illustrator.

(Reply to this) (Parent)(Thread)

Re: Software for figures
(Anonymous)
2007-10-17 06:36 am UTC (link)
Thanks. They look very nice.

(Reply to this) (Parent)(Thread)

Re: Software for figures
[info]11011110
2007-10-17 06:41 am UTC (link)
You're welcome, and thanks!

(Reply to this) (Parent)

Re: Software for figures
[info]11011110
2007-10-16 05:45 pm UTC (link)
Oh, and the formulae were copied and pasted in from LaTeX Equation Editor.

(Reply to this) (Parent)


(Anonymous)
2007-11-02 05:48 pm UTC (link)
I just ran across this today. I tried giving this as part of an assignment in my undergraduate algorithms course a few years ago (I thought it was easier than it actually is):

http://cg.scs.carleton.ca/~morin/teaching/4804/notes/questions/master.pdf
(question 2.17)

The students solved it with varying degrees of accuracy, but a few of them pushed right through and got the right answer.

Pat Morin

(Reply to this) (Thread)


[info]11011110
2007-11-02 07:34 pm UTC (link)
Yeah, it seems quite easy if you only think about it up to the "thus, we need only replace the probability 2/(j-i+1) in the quicksort analysis by the slightly more complicated formula" part. But then you have to actually solve the sum... Thanks for the pointer.

(Reply to this) (Parent)

Not originally from me -Avrim
(Anonymous)
2007-11-29 03:32 am UTC (link)
Hi. Was browsing around and saw this thread. I've been using this analysis in my undergrad algorithms class, but it definitely didn't originally come from me. I think I got it from the Motwani & Raghavan Randomized Algorithms book (which has lots of nice stuff - I highly recommend it!). BTW, if you want a strange probability puzzle, check out problem 6 here (http://www.cs.cmu.edu/~avrim/Randalgs98/hwk1.ps). (Problem 5 is more of a classic, but problem 6 is pretty mind-bending).
-Avrim

(Reply to this) (Thread)

Re: Not originally from me -Avrim
[info]11011110
2007-11-29 04:02 am UTC (link)
Hi Avrim; thanks for helping clear this up. Now that you mention it, I have this vague feeling of hearing you discuss this same issue of who this analysis should really be credited to sometime before. Maybe at SODA 2003 when we met for the STOC PC?

Problem 5 gives me the feeling that it was designed to lead students towards thinking about priors, no? And ∞ - ∞ = indeterminate.

(Reply to this) (Parent)(Thread)

Re: Not originally from me -Avrim
(Anonymous)
2007-11-29 03:38 pm UTC (link)
Hi. I don't exactly remember but could be, actually.

Problem 5 is one of those classic "be careful about your intuition" kinds of problems. (How could you possibly have a better than 50/50 shot at picking the larger number no matter what x and y are?). Yes, you're absolutely right about problem 6. We have a quantity G (the gain of the always switch strategy) such that E[G|x] > 0 for any x you see, and a quantity G' (the gain of the always-stay strategy) such that E[G'|x] < 0 for any x you see, and yet the two strategies are really identical. They're basically just pairing up the divergent series in different ways. Still, it's very strange -- it does show one has to be careful with infinities. (If there were only a finite number of cards, the strategy to optimize expected gain would be to always switch *unless* you see the largest number). Actually, I like to use this as a segue into talking about utilities...

(Reply to this) (Parent)


Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…