In information retrieval, texts are usually retrieved by them with queries. In this study, an approach was suggested that texts are automatically classified into categories and retrieved by matching them with queries classified in the same way. For an efficient information retrieval using automatic classification, extracting methods of words from texts and matching methods are essential. Some extracting methods from Japanese texts have been suggested in natural languages processing. However, it is difficult to extract significant words from Japanese texts because Japanese texts are written without blank space separating words. As for matching methods, many weighting methods have been suggested as well as vector space models and probabilistic models. This article reports the results of an experiment of classifying Japanese texts into Nippon Decimal Classification (NDC) categories based on the title information in Japanese MARC records. In this experiment, three extracting methods: - juman, MHSA, n-gram - are tested on a set of 1,000 books. Four weighting methods: - relative term frequency between categories, tf · idf and tf (max)·idf - are tested. The results indicate that the extracting method using juman achieved best and the best weighting method was the relative term frequency between categories, being able to select correct classification categories (upper three digits of NDC) for about 55.9% of 1,000 books.
|Number of pages
|Library and Information Science
|Published - 1998
All Science Journal Classification (ASJC) codes
- Library and Information Sciences