For many applications of machine learning the goal is to predict the value ofa vector c given the value of a vector x of input features. In a classificationproblem c represents a discrete class label, whereas in a regression problemit corresponds to one or more continuous variables. From a probabilisticperspective, the goal is to find the conditional distribution p(c|x). The mostcommon approach to this problem is to represent the conditional distributionusing a parametric model, and then to determine the parameters using atraining set consisting of pairs xn , cn of input vectors along with theircorresponding target output vectors. The resulting conditional distributioncan be used to make predictions of c for new values of x. This is knownas a discriminative approach, since the conditional distribution discriminatesdirectly between the different values of c.An alternative approach is to find the joint distribution p(x, c), expressedfor instance as a parametric model, and then subsequently uses this jointdistribution to evaluate the conditional p(c|x) in order to make predictions of cfor new values of x. This is known as a generative approach since by samplingfrom the joint distribution it is possible to generate synthetic examples of thefeature vector x. In practice, the generalization performance of generativemodels is often found to be poorer than than of discriminative models due todifferences between the model and the true distribution of the data.When labelled training data is plentiful, discriminative techniques are widelyused since they give excellent generalization performance. However, althoughcollection of data is often easy, the process of labelling it can be expensive.Consequently there is increasing interest in generative methods since thesecan exploit unlabelled data in addition to labelled data.Although the generalization performance of generative models can often beimproved by 'training them discriminatively', they can then no longer makeuse of unlabelled data. In an attempt to gain the benefit of both generativeand discriminative approaches, heuristic procedure have been proposed whichinterpolate between these two extremes by taking a convex combination ofthe generative and discriminative objective functions.Here we discuss a new perspective which says that there is only one correctway to train a given model, and that a 'discriminatively trained' generativemodel is fundamentally a new model (Minka, 2006). From this viewpoint,generative and discriminative models correspond to specific choices for theprior over parameters. As well as giving a principled interpretation of 'dis-criminative training', this approach opens the door to very general ways ofinterpolating between generative and discriminative extremes through alter-native choices of prior. We illustrate this framework using both syntheticdata and a practical example in the domain of multi-class object recognition.Our results show that, when the supply of labelled training data is limited,the optimum performance corresponds to a balance between the purely gen-erative and the purely discriminative. We conclude by discussing how to usea Bayesian approach to find automatically the appropriate trade-off betweenthe generative and discriminative extremes.
|Title of host publication||Bayesian Statistics 8|
|Publisher||Oxford University Press|
|Number of pages||22|
|Publication status||Published - 2007|