Suppose we have a database table equipped with a multicolumn (also called concatenated or complex) index like this:

`CREATE TABLE user (last_name varchar, first_name varchar)`

CREATE INDEX user_index ON user (last_name, first_name)

`user_index`

in queries involving both `user`

attributes or just `last_name`

:`SELECT * from user WHERE first_name="Robert" AND last_name="Grant";`

SELECT * from user WHERE last_name="Smith";

while, on the other hand, queries like the following:`SELECT * from user WHERE first_name="Mary";`

trigger a full table scan or use the index in a suboptimal fashion. The simple reason behind this is that database indexes are usually implemented with B-trees or similar structures that sort the data by the values of the indexed column: for a multicolumn index on (c_{1},...,c_{n}), sorting is performed according to the lexicographical order associated to the n columns, with c_{i} taking precedence over c_{j} if i < j. So, in our example the index sorts the rows by `last_name`

and, within clusters with equal `last_name`

, data is further sorted by `first_name`

. This explains why querying on `last_name`

benefits from the index (rows are sorted by the first attribute to begin with) while querying on `first_name`

does not. Database designers learn to carefully consider the column specification order when creating multicolumn indexes.

So, in general, an index on (c_{1},...,c_{n}) serves to accelerate queries of the form

c_{1} = v_{1} `AND`

c_{2} = v_{2} `AND`

··· `AND`

c_{i} = v_{i}

for all i ≤ n. The question arises: how many multicolumn indexes, and how, must be created so that all possible attribute combinations are covered by some index? For instance, if we have three attributes:

`CREATE TABLE user (last_name varchar, first_name varchar, age integer)`

three indexes suffice:

`CREATE INDEX user_index1 ON user (last_name, first_name, age)`

CREATE INDEX user_index2 ON user (first_name, last_name, age)

CREATE INDEX user_index3 ON user (age, last_name, first_name)

(You might find it entertaining to check out manually that all the 7 different attribute combinations are covered.) It turns out that for n attributes, which generate 2^{n} − 1 combinations, the minimum number of indexes needed is

C(n,floor(n/2)) = n!/(ceil(n/2)!·floor(n/2)!),

where C(a,b) denotes the binomial coefficient a choose b. The values of C(n,floor(n/2)) for n = 1,2,... are: 1, 2, 3, 6, 10, 20, 35, 70, 126,... Using Stirling's approximation, the sequence can be proved to be O(2^{n}/√n). The mathematical formulation of this problem has been studied in a previous entry. Also, an algorithm for generating the index set is provided.

You could use shorter indexes like these and still full fill all combinations of the three attributes.

ReplyDeleteCREATE INDEX user_index1 ON user (last_name, first_name, age)

CREATE INDEX user_index2 ON user (first_name, age)

CREATE INDEX user_index3 ON user (age, last_name)

Hi Mathieu, yes, you're right, but still you'll need some of the indices to have full length, and still you'll need

ReplyDeleteC(n,floor(n/2)) in total. It is not difficult to modify the generating algorithm to otuput full indices only when needed.