Sunday, July 27, 2008

Indirect survey

The usual way to statistically determine the prevalence of a given trait T in a population involves selecting a representative population sample and having their members answer the question "Do you have trait T?"; the number of affirmative questions divided by the size of the sample is an estimate of the actual prevalence of p = P(T). How can we estimate p if the following question is used instead?

Do you have some friend with trait T?

Let us assume that having a friend with T is independent of the fact that other friends may or may not have T. The probability then that a person has a friend with T is given by:

p' = 1 − ∑fi(1 − p)i,
fi := P(a person has exactly i friends).

So, p' is simply 1 − GF(1 − p), where GF is the probability-generating function of F, the random variable expressing the number of friends a person has. If we model F by a Poisson distribution with mean λ, we have

GF(x)= eλ(x − 1)


p' = 1 − eλp,
p = −(1/λ) ln(1 − p').

The figure shows p' as a function of p for various values of λ.

There are situations where it can be interesting to estimate p through this indirect survey method:

  1. For usual values of λ much greater than 1 , p' is notably larger than p when the latter lies in the vicinity of 0. This implies that indirect estimation of p through p' results in better (narrower) confidence intervals when p is very small. For instance, for p = 1/1000 and λ = 10 a direct survey obtains on average one affirmative answer in a sample of 1000 individuals, with an associated 95% Wilson score confidence interval of (0.0001, 0.0065) (calculated via VassarStats). With an indirect survey we typically get 10 affirmative answers for the same size of population, with a confidence interval (0.0051, 0.019) associated to p'; applying the transformation p = −(1/λ)ln(1 − p'), the latter confidence interval maps to (0.0005, 0.0019), whose width is ~20% of that of the interval for the direct survey.
  2. When doing surveys on sensitive issues (illegal drug use, paid sex usage) results can be severely biased due to the reluctance of people to answer thruthfully. In these cases, survey respondents might be more inclined to answer accurately when the questions are indirect: it is easier for a person to recognize that he has friends who had paid for sex than admitting paying for sex oneself.

A serious drawback of this method is its reliance on the a priori unkown parameter λ. One technique for solving this is the following: include into the survey an additional control question for which the associated statistics are already known, so that the survey results can be used to estimate λ. Ideally, the control question's subject matter should be similar to that of the actual question so as to cancel evasive answer bias out.


  1. This is why I like this blog. It has the good content trait.

    Keep up the good work!