TY - GEN
T1 - The Impact of Language Properties in Multilingual Datasets on Sarcasm Detection
AU - Yang, Linshuo
AU - Ikeda, Daisuke
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Currently, people spend a lot of time on social media sites to express their opinions and emotions, making it one of the most important data sources for sentiment analysis tasks executed by machine learning. However, sarcasm can be an obstacle to tasks that seek to determine people's true intentions such as sentiment analysis. As a result, research on automatic sarcasm detection has garnered attention. In addition, with the globalization of social media, it has become crucial to have sarcasm detection models that can handle multiple languages. Although research on multilingual sarcasm detection models has become popular in recent years, there has been little examination of how the types of languages included in the training dataset affect model performance. Sarcasm is highly dependent on culture, and language represents the culture, so the differences in languages may affect the differences in sarcastic expressions. This study focused on the morphological typological differences between Arabic and Chinese, and trained the model using two datasets. One is an English-Arabic dataset, in which the languages belong to the same category. The other one is an English-Chinese dataset, in which the languages belong to different categories. Then the results were compared using two English test datasets. The experiment showed that the training results of English and Arabic were better than those of English and Chinese, indicating that the morphological typological classification of languages in the dataset affects multilingual sarcasm detection. In other words, to increase the detection effectiveness of languages belonging to a certain category, it is better to use training data of the same type. Additionally, a Multilingual BERT-LSTM model was constructed and compared to the BERT-only experiment. As a result, the LSTM structure was generally found to be effective for multilingual sarcasm detection.
AB - Currently, people spend a lot of time on social media sites to express their opinions and emotions, making it one of the most important data sources for sentiment analysis tasks executed by machine learning. However, sarcasm can be an obstacle to tasks that seek to determine people's true intentions such as sentiment analysis. As a result, research on automatic sarcasm detection has garnered attention. In addition, with the globalization of social media, it has become crucial to have sarcasm detection models that can handle multiple languages. Although research on multilingual sarcasm detection models has become popular in recent years, there has been little examination of how the types of languages included in the training dataset affect model performance. Sarcasm is highly dependent on culture, and language represents the culture, so the differences in languages may affect the differences in sarcastic expressions. This study focused on the morphological typological differences between Arabic and Chinese, and trained the model using two datasets. One is an English-Arabic dataset, in which the languages belong to the same category. The other one is an English-Chinese dataset, in which the languages belong to different categories. Then the results were compared using two English test datasets. The experiment showed that the training results of English and Arabic were better than those of English and Chinese, indicating that the morphological typological classification of languages in the dataset affects multilingual sarcasm detection. In other words, to increase the detection effectiveness of languages belonging to a certain category, it is better to use training data of the same type. Additionally, a Multilingual BERT-LSTM model was constructed and compared to the BERT-only experiment. As a result, the LSTM structure was generally found to be effective for multilingual sarcasm detection.
UR - http://www.scopus.com/inward/record.url?scp=85183458939&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85183458939&partnerID=8YFLogxK
U2 - 10.1109/IIAI-AAI59060.2023.00011
DO - 10.1109/IIAI-AAI59060.2023.00011
M3 - Conference contribution
AN - SCOPUS:85183458939
T3 - Proceedings - 2023 14th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2023
SP - 1
EP - 6
BT - Proceedings - 2023 14th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 14th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2023
Y2 - 8 July 2023 through 13 July 2023
ER -