TY - GEN
T1 - Significance of Low-Frequent Words in Concept Describing Document
AU - Okumura, Yuki
AU - Hirokawa, Sachio
AU - Takeuchi, Kazuhiro
PY - 2019/7
Y1 - 2019/7
N2 - In applications of information retrieval, text mining, and natural language processing, tf-idf (term frequency-inverse document frequency) is a numerical statistic that is intended to reflect how significant a word is to a document in a collection. The value of tf-idf increases proportionally to the number of times a word occurs in the document and is offset by the number of documents in the corpus that contain the word, reflecting the fact that some words appear more frequently in general. Therefore, the value of tf-idf is designed to be more significant in a certain document when a word occurs frequently. In other words, document classification using tf-idf does not care about the role of the infrequent words. In this paper, we focus on words that appear infrequently in a document. Specifically, we examine features that characterize document sets that describe specific knowledge using the SVM (Support Vector Machine) based feature extraction method. As a result, we confirmed that the words appeared only once in some of a document that belong to documents describing specific knowledge and contribute to distinguishing them from the documents that describe general knowledge.
AB - In applications of information retrieval, text mining, and natural language processing, tf-idf (term frequency-inverse document frequency) is a numerical statistic that is intended to reflect how significant a word is to a document in a collection. The value of tf-idf increases proportionally to the number of times a word occurs in the document and is offset by the number of documents in the corpus that contain the word, reflecting the fact that some words appear more frequently in general. Therefore, the value of tf-idf is designed to be more significant in a certain document when a word occurs frequently. In other words, document classification using tf-idf does not care about the role of the infrequent words. In this paper, we focus on words that appear infrequently in a document. Specifically, we examine features that characterize document sets that describe specific knowledge using the SVM (Support Vector Machine) based feature extraction method. As a result, we confirmed that the words appeared only once in some of a document that belong to documents describing specific knowledge and contribute to distinguishing them from the documents that describe general knowledge.
UR - http://www.scopus.com/inward/record.url?scp=85080863106&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85080863106&partnerID=8YFLogxK
U2 - 10.1109/IIAI-AAI.2019.00214
DO - 10.1109/IIAI-AAI.2019.00214
M3 - Conference contribution
T3 - Proceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019
SP - 1035
EP - 1036
BT - Proceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 8th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2019
Y2 - 7 July 2019 through 11 July 2019
ER -