TY - GEN
T1 - Dimensionality reduction for active learning with nearest neighbour classifier in text categorisation problems
AU - Davy, Michael
AU - Luz, Saturnino
PY - 2008/2/25
Y1 - 2008/2/25
N2 - Dimensionality reduction techniques are commonly used in text categorisation problems to improve training and classification efficiency as well as to avoid overfitting. The best performing dimensionality reduction techniques for text categorisation are supervised, hence utilise the label information of the training data. Active learning is used to reduce the number of labelled training examples for problems where obtaining label information is expensive. Since the vast majority of data supplied to active learning are unlabelled, supervised dimensionality reduction techniques cannot be readily employed. For this reason, active learning in text categorisation problems do not perform dimensionality reduction thereby restricting the choice of classifier. In this paper we investigate unsupervised dimensionality reduction techniques in active learning for text categorisation problems. Two unsupervised techniques are investigated, namely Document Frequency and Principal Components Analysis. We empirically show increased performance of active learning, using a k-Nearest Neighbour classifier, when dimensionality reduction is applied using the unsupervised techniques.
AB - Dimensionality reduction techniques are commonly used in text categorisation problems to improve training and classification efficiency as well as to avoid overfitting. The best performing dimensionality reduction techniques for text categorisation are supervised, hence utilise the label information of the training data. Active learning is used to reduce the number of labelled training examples for problems where obtaining label information is expensive. Since the vast majority of data supplied to active learning are unlabelled, supervised dimensionality reduction techniques cannot be readily employed. For this reason, active learning in text categorisation problems do not perform dimensionality reduction thereby restricting the choice of classifier. In this paper we investigate unsupervised dimensionality reduction techniques in active learning for text categorisation problems. Two unsupervised techniques are investigated, namely Document Frequency and Principal Components Analysis. We empirically show increased performance of active learning, using a k-Nearest Neighbour classifier, when dimensionality reduction is applied using the unsupervised techniques.
UR - http://www.scopus.com/inward/record.url?scp=47349119706&partnerID=8YFLogxK
U2 - 10.1109/ICMLA.2007.39
DO - 10.1109/ICMLA.2007.39
M3 - Conference contribution
AN - SCOPUS:47349119706
SN - 0769530699
SN - 9780769530697
T3 - Proceedings - 6th International Conference on Machine Learning and Applications, ICMLA 2007
SP - 292
EP - 297
BT - Proceedings - 6th International Conference on Machine Learning and Applications, ICMLA 2007
T2 - 6th International Conference on Machine Learning and Applications, ICMLA 2007
Y2 - 13 December 2007 through 15 December 2007
ER -