Bannalia: trivial notes on themes diverse: Multicolumn querying

Suppose we have a database table equipped with a multicolumn (also called concatenated or complex) index like this:

‎CREATE TABLE user (last_name varchar, first_name varchar)
‎CREATE INDEX user_index ON user (last_name, first_name)

Most platforms take advantage of user_index in queries involving both user attributes or just last_name:

‎SELECT * from user WHERE first_name="Robert" AND last_name="Grant";
‎SELECT * from user WHERE last_name="Smith";

while, on the other hand, queries like the following:

‎SELECT * from user WHERE first_name="Mary";

trigger a full table scan or use the index in a suboptimal fashion. The simple reason behind this is that database indexes are usually implemented with B-trees or similar structures that sort the data by the values of the indexed column: for a multicolumn index on (c₁,...,c_n), sorting is performed according to the lexicographical order associated to the n columns, with c_i taking precedence over c_j if i < j. So, in our example the index sorts the rows by last_name and, within clusters with equal last_name, data is further sorted by first_name. This explains why querying on last_name benefits from the index (rows are sorted by the first attribute to begin with) while querying on first_name does not. Database designers learn to carefully consider the column specification order when creating multicolumn indexes.

So, in general, an index on (c₁,...,c_n) serves to accelerate queries of the form

c₁ = v₁ AND c₂ = v₂ AND ··· AND c_i = v_i

for all i ≤ n. The question arises: how many multicolumn indexes, and how, must be created so that all possible attribute combinations are covered by some index? For instance, if we have three attributes:

‎CREATE TABLE user (last_name varchar, first_name varchar, age integer)

three indexes suffice:

‎CREATE INDEX user_index1 ON user (last_name, first_name, age)
‎CREATE INDEX user_index2 ON user (first_name, last_name, age)
‎CREATE INDEX user_index3 ON user (age, last_name, first_name)

(You might find it entertaining to check out manually that all the 7 different attribute combinations are covered.) It turns out that for n attributes, which generate 2ⁿ − 1 combinations, the minimum number of indexes needed is

C(n,floor(n/2)) = n!/(ceil(n/2)!·floor(n/2)!),

where C(a,b) denotes the binomial coefficient a choose b. The values of C(n,floor(n/2)) for n = 1,2,... are: 1, 2, 3, 6, 10, 20, 35, 70, 126,... Using Stirling's approximation, the sequence can be proved to be O(2ⁿ/√n). The mathematical formulation of this problem has been studied in a previous entry. Also, an algorithm for generating the index set is provided.

2 comments :

MathieuOctober 20, 2008 at 5:20 PM
You could use shorter indexes like these and still full fill all combinations of the three attributes.

‎CREATE INDEX user_index1 ON user (last_name, first_name, age)
‎CREATE INDEX user_index2 ON user (first_name, age)
‎CREATE INDEX user_index3 ON user (age, last_name)
Joaquín M López MuñozOctober 20, 2008 at 5:47 PM
Hi Mathieu, yes, you're right, but still you'll need some of the indices to have full length, and still you'll need C(n,floor(n/2)) in total. It is not difficult to modify the generating algorithm to otuput full indices only when needed.

Monday, October 20, 2008

Multicolumn querying

2 comments :