Monday, October 20, 2008

Multicolumn querying

Suppose we have a database table equipped with a multicolumn (also called concatenated or complex) index like this:

‎CREATE TABLE user (last_name varchar, first_name varchar)
‎CREATE INDEX user_index ON user (last_name, first_name)
Most platforms take advantage of user_index in queries involving both user attributes or just last_name:
‎SELECT * from user WHERE first_name="Robert" AND last_name="Grant";
‎SELECT * from user WHERE last_name="Smith";
while, on the other hand, queries like the following:
‎SELECT * from user WHERE first_name="Mary";

trigger a full table scan or use the index in a suboptimal fashion. The simple reason behind this is that database indexes are usually implemented with B-trees or similar structures that sort the data by the values of the indexed column: for a multicolumn index on (c1,...,cn), sorting is performed according to the lexicographical order associated to the n columns, with ci taking precedence over cj if i < j. So, in our example the index sorts the rows by last_name and, within clusters with equal last_name, data is further sorted by first_name. This explains why querying on last_name benefits from the index (rows are sorted by the first attribute to begin with) while querying on first_name does not. Database designers learn to carefully consider the column specification order when creating multicolumn indexes.

So, in general, an index on (c1,...,cn) serves to accelerate queries of the form

c1 = v1 AND c2 = v2 AND ··· AND ci = vi

for all in. The question arises: how many multicolumn indexes, and how, must be created so that all possible attribute combinations are covered by some index? For instance, if we have three attributes:

‎CREATE TABLE user (last_name varchar, first_name varchar, age integer)

three indexes suffice:

‎CREATE INDEX user_index1 ON user (last_name, first_name, age)
‎CREATE INDEX user_index2 ON user (first_name, last_name, age)
‎CREATE INDEX user_index3 ON user (age, last_name, first_name)

(You might find it entertaining to check out manually that all the 7 different attribute combinations are covered.) It turns out that for n attributes, which generate 2n − 1 combinations, the minimum number of indexes needed is

C(n,floor(n/2)) = n!/(ceil(n/2)!·floor(n/2)!),

where C(a,b) denotes the binomial coefficient a choose b. The values of C(n,floor(n/2)) for n = 1,2,... are: 1, 2, 3, 6, 10, 20, 35, 70, 126,... Using Stirling's approximation, the sequence can be proved to be O(2n/√n). The mathematical formulation of this problem has been studied in a previous entry. Also, an algorithm for generating the index set is provided.

2 comments:

  1. You could use shorter indexes like these and still full fill all combinations of the three attributes.

    ‎CREATE INDEX user_index1 ON user (last_name, first_name, age)
    ‎CREATE INDEX user_index2 ON user (first_name, age)
    ‎CREATE INDEX user_index3 ON user (age, last_name)

    ReplyDelete
  2. Hi Mathieu, yes, you're right, but still you'll need some of the indices to have full length, and still you'll need C(n,floor(n/2)) in total. It is not difficult to modify the generating algorithm to otuput full indices only when needed.

    ReplyDelete