Abstract
One of the important works of Information Content Security is evaluating the theme words of the text. Because of the variety of the Chinese expression, especially of the abbreviation, the supervision of the theme words becomes harder. The goal of this paper is to quickly and accurately discover the intercept abbreviations from the text crawled at the short time period. The paper firstly segments the target texts, and then utilizes the Supported Vector Machine (SVM) to recognize the abbreviations from the wrongly segmented texts as the candidates. Secondly, this paper presents the collaborative methods: Improve the Conditional Random Fields (CRF) to predict the corresponding word to each character of the abbreviation; To solve the problems of the 1:n relationship, collaboratively merge the ranking list from the predict steps with the matched results of the thesaurus of abbreviations. The experiments demonstrate that our method at the recognizing stage is 76.5% of the accuracy and 77.8% of the recall rate. At the recovery step, the accuracy is 62.1%, which is 20.8% higher than the method based on Hidden Markov Model (HMM).
Original language | English |
---|---|
Title of host publication | Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data |
Subtitle of host publication | 16th China National Conference, CCL 2017, and 5th International Symposium, NLP-NABD 2017, Nanjing, China, October 13-15, 2017, Proceedings |
Editors | Maosong Sun, Xiaojie Wang, Baobao Chang, Deyi Xiong |
Place of Publication | Cham |
Publisher | Springer |
Pages | 224-236 |
Number of pages | 13 |
ISBN (Electronic) | 978-3-319-69005-6 |
ISBN (Print) | 978-3-319-69005-6 |
DOIs | |
Publication status | Published - 7 Oct 2017 |
Event | The 16th Chian National Conference on Computational Linguistics & The 5th International Symposium on Natural Processing based on Natural Annoted Big Data - Nanjing, China Duration: 13 Oct 2017 → 15 Oct 2017 http://www.cips-cl.org/static/CCL2017/en/home.html |
Publication series
Name | Lecture Notes in Computer Science |
---|---|
Publisher | Springer, Cham |
Volume | 10565 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | The 16th Chian National Conference on Computational Linguistics & The 5th International Symposium on Natural Processing based on Natural Annoted Big Data |
---|---|
Abbreviated title | CCL 2017 and NLP-NABD 2017 |
Country/Territory | China |
City | Nanjing |
Period | 13/10/17 → 15/10/17 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- Collaborative recovery
- Improved CRF
- Chinese abbreviation