TY - GEN
T1 - Toward Automatic Identification of Dataset Names in Scholarly Articles
AU - Ikeda, Daisuke
AU - Taniguchi, Yuta
PY - 2019/7
Y1 - 2019/7
N2 - As the number of scholarly articles we can access is increasing, it becomes possible to read them freely. However, it is difficult to understand scholarly articles since they are basically written for experts. Our big goal is, developing methods to extract essential elements of articles, to facilitate open innovation. To this end, this paper is devoted to considering automatic identification of dataset names in articles. Because a dictionary of datasets is necessary for evaluation, existing methods have focused on some specific discipline. To achieve applicability to any disciplines, we adopt a machine learning approach with a huge amount of scholarly papers. Because we treat papers in multi-disciplines, it is challenging how to evaluate experimental results. To solve it, we quantitatively evaluate experimental results with precision@N, which does not require to know all the dataset names in the papers we use, and qualitatively check if candidate tokens are dataset names or not using a GUI tool we have developed. While about 1/3 tokens of the top 20 output by our method were dataset names, the other ones are names of methods, models, or organizations. So it is important future work to remove such noise results, using additive compositionality of word vectors.
AB - As the number of scholarly articles we can access is increasing, it becomes possible to read them freely. However, it is difficult to understand scholarly articles since they are basically written for experts. Our big goal is, developing methods to extract essential elements of articles, to facilitate open innovation. To this end, this paper is devoted to considering automatic identification of dataset names in articles. Because a dictionary of datasets is necessary for evaluation, existing methods have focused on some specific discipline. To achieve applicability to any disciplines, we adopt a machine learning approach with a huge amount of scholarly papers. Because we treat papers in multi-disciplines, it is challenging how to evaluate experimental results. To solve it, we quantitatively evaluate experimental results with precision@N, which does not require to know all the dataset names in the papers we use, and qualitatively check if candidate tokens are dataset names or not using a GUI tool we have developed. While about 1/3 tokens of the top 20 output by our method were dataset names, the other ones are names of methods, models, or organizations. So it is important future work to remove such noise results, using additive compositionality of word vectors.
UR - http://www.scopus.com/inward/record.url?scp=85080877034&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85080877034&partnerID=8YFLogxK
U2 - 10.1109/IIAI-AAI.2019.00083
DO - 10.1109/IIAI-AAI.2019.00083
M3 - Conference contribution
T3 - Proceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019
SP - 379
EP - 382
BT - Proceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 8th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2019
Y2 - 7 July 2019 through 11 July 2019
ER -