TY - GEN
T1 - Multi-document Topic Segmentation
AU - Jeong, Minwoo
AU - Titov, Ivan
PY - 2010
Y1 - 2010
N2 - Multiple documents describing the same or closely related sets of events are common and often easy to obtain: for example, consider document clusters on a news aggregator site or multiple reviews of the same product or service. Even though each such document discusses a similar set of topics, they provide alternative views or complimentary information on each of these topics. We argue that revealing hidden relations by jointly segmenting the documents, or, equivalently, predicting links between topically related segments in different documents would help to visualize documents of interest and construct friendlier user interfaces. In this paper, we refer to this problem as multi-document topic segmentation. We propose an unsupervised Bayesian model for the considered problem that models both shared and document-specific topics, and utilizes Dirichlet process priors to determine the effective number of topics. We show that topic segmentation can be inferred efficiently using a simple split-merge sampling algorithm. The resulting method outperforms baseline models on four datasets for multi-document topic segmentation.
AB - Multiple documents describing the same or closely related sets of events are common and often easy to obtain: for example, consider document clusters on a news aggregator site or multiple reviews of the same product or service. Even though each such document discusses a similar set of topics, they provide alternative views or complimentary information on each of these topics. We argue that revealing hidden relations by jointly segmenting the documents, or, equivalently, predicting links between topically related segments in different documents would help to visualize documents of interest and construct friendlier user interfaces. In this paper, we refer to this problem as multi-document topic segmentation. We propose an unsupervised Bayesian model for the considered problem that models both shared and document-specific topics, and utilizes Dirichlet process priors to determine the effective number of topics. We show that topic segmentation can be inferred efficiently using a simple split-merge sampling algorithm. The resulting method outperforms baseline models on four datasets for multi-document topic segmentation.
U2 - 10.1145/1871437.1871579
DO - 10.1145/1871437.1871579
M3 - Conference contribution
SN - 978-1-4503-0099-5
T3 - CIKM '10
SP - 1119
EP - 1128
BT - Proceedings of the 19th ACM International Conference on Information and Knowledge Management
PB - ACM
CY - New York, NY, USA
ER -