TY - GEN
T1 - Context-Aware Latent Dirichlet Allocation for Topic Segmentation
AU - Li, Wenbo
AU - Matsukawa, Tetsu
AU - Saigo, Hiroto
AU - Suzuki, Einoshin
N1 - Funding Information:
A part of this work is supported by Grant-in-Aid for Scientific Research JP18H03290 from the Japan Society for the Promotion of Science (JSPS) and the State Scholarship Fund of China Scholarship Council (grant 201706680067).
Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2020
Y1 - 2020
N2 - We propose a new generative model for topic segmentation based on Latent Dirichlet Allocation. The task is to divide a document into a sequence of topically coherent segments, while preserving long topic change-points (coherency) and keeping short topic segments from getting merged (saliency). Most of the existing models either fuse topic segments by keywords or focus on modeling word co-occurrence patterns without merging. They can hardly achieve both coherency and saliency since many words have high uncertainties in topic assignments due to their polysemous nature. To solve this problem, we introduce topic-specific co-occurrence of word pairs within contexts in modeling, to generate more coherent segments and alleviate the influence of irrelevant words on topic assignment. We also design an optimization algorithm to eliminate redundant items in the generated topic segments. Experimental results show that our proposal produces significant improvements in both topic coherence and topic segmentation.
AB - We propose a new generative model for topic segmentation based on Latent Dirichlet Allocation. The task is to divide a document into a sequence of topically coherent segments, while preserving long topic change-points (coherency) and keeping short topic segments from getting merged (saliency). Most of the existing models either fuse topic segments by keywords or focus on modeling word co-occurrence patterns without merging. They can hardly achieve both coherency and saliency since many words have high uncertainties in topic assignments due to their polysemous nature. To solve this problem, we introduce topic-specific co-occurrence of word pairs within contexts in modeling, to generate more coherent segments and alleviate the influence of irrelevant words on topic assignment. We also design an optimization algorithm to eliminate redundant items in the generated topic segments. Experimental results show that our proposal produces significant improvements in both topic coherence and topic segmentation.
UR - http://www.scopus.com/inward/record.url?scp=85085734754&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85085734754&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-47426-3_37
DO - 10.1007/978-3-030-47426-3_37
M3 - Conference contribution
AN - SCOPUS:85085734754
SN - 9783030474256
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 475
EP - 486
BT - Advances in Knowledge Discovery and Data Mining - 24th Pacific-Asia Conference, PAKDD 2020, Proceedings
A2 - Lauw, Hady W.
A2 - Lim, Ee-Peng
A2 - Wong, Raymond Chi-Wing
A2 - Ntoulas, Alexandros
A2 - Ng, See-Kiong
A2 - Pan, Sinno Jialin
PB - Springer
T2 - 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2020
Y2 - 11 May 2020 through 14 May 2020
ER -