Significance of Low-Frequent Words in Concept Describing Document

Yuki Okumura, Sachio Hirokawa, Kazuhiro Takeuchi

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    In applications of information retrieval, text mining, and natural language processing, tf-idf (term frequency-inverse document frequency) is a numerical statistic that is intended to reflect how significant a word is to a document in a collection. The value of tf-idf increases proportionally to the number of times a word occurs in the document and is offset by the number of documents in the corpus that contain the word, reflecting the fact that some words appear more frequently in general. Therefore, the value of tf-idf is designed to be more significant in a certain document when a word occurs frequently. In other words, document classification using tf-idf does not care about the role of the infrequent words. In this paper, we focus on words that appear infrequently in a document. Specifically, we examine features that characterize document sets that describe specific knowledge using the SVM (Support Vector Machine) based feature extraction method. As a result, we confirmed that the words appeared only once in some of a document that belong to documents describing specific knowledge and contribute to distinguishing them from the documents that describe general knowledge.

    Original languageEnglish
    Title of host publicationProceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages1035-1036
    Number of pages2
    ISBN (Electronic)9781728126272
    DOIs
    Publication statusPublished - Jul 2019
    Event8th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2019 - Toyama, Japan
    Duration: Jul 7 2019Jul 11 2019

    Publication series

    NameProceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019

    Conference

    Conference8th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2019
    Country/TerritoryJapan
    CityToyama
    Period7/7/197/11/19

    All Science Journal Classification (ASJC) codes

    • Computer Networks and Communications
    • Computer Science Applications
    • Information Systems
    • Information Systems and Management
    • Social Sciences (miscellaneous)

    Fingerprint

    Dive into the research topics of 'Significance of Low-Frequent Words in Concept Describing Document'. Together they form a unique fingerprint.

    Cite this