Toward Automatic Identification of Dataset Names in Scholarly Articles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As the number of scholarly articles we can access is increasing, it becomes possible to read them freely. However, it is difficult to understand scholarly articles since they are basically written for experts. Our big goal is, developing methods to extract essential elements of articles, to facilitate open innovation. To this end, this paper is devoted to considering automatic identification of dataset names in articles. Because a dictionary of datasets is necessary for evaluation, existing methods have focused on some specific discipline. To achieve applicability to any disciplines, we adopt a machine learning approach with a huge amount of scholarly papers. Because we treat papers in multi-disciplines, it is challenging how to evaluate experimental results. To solve it, we quantitatively evaluate experimental results with precision@N, which does not require to know all the dataset names in the papers we use, and qualitatively check if candidate tokens are dataset names or not using a GUI tool we have developed. While about 1/3 tokens of the top 20 output by our method were dataset names, the other ones are names of methods, models, or organizations. So it is important future work to remove such noise results, using additive compositionality of word vectors.

Original languageEnglish
Title of host publicationProceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages379-382
Number of pages4
ISBN (Electronic)9781728126272
DOIs
Publication statusPublished - Jul 2019
Event8th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2019 - Toyama, Japan
Duration: Jul 7 2019Jul 11 2019

Publication series

NameProceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019

Conference

Conference8th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2019
Country/TerritoryJapan
CityToyama
Period7/7/197/11/19

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems
  • Information Systems and Management
  • Social Sciences (miscellaneous)

Fingerprint

Dive into the research topics of 'Toward Automatic Identification of Dataset Names in Scholarly Articles'. Together they form a unique fingerprint.

Cite this