TY - JOUR
T1 - Semi-supervised learning with density-ratio estimation
AU - Kawakita, Masanori
AU - Kanamori, Takafumi
N1 - Funding Information:
Acknowledgements The authors are grateful to Dr. Masayuki Henmi, Dr. Hironori Fujisawa and Prof. Shinto Eguchi of the Institute of Statistical Mathematics, as well as the anonymous reviewers for their helpful comments. M.K. was partially supported by JSPS KAKENHI Grant Number 21700308. T.K. was partially supported by JSPS KAKENHI Grant Number 24500340.
PY - 2013/5
Y1 - 2013/5
N2 - In this paper we study statistical properties of semi-supervised learning, which is considered to be an important problem in the field of machine learning. In standard supervised learning only labeled data is observed, and classification and regression problems are formalized as supervised learning. On the other hand, in semi-supervised learning, unlabeled data is also obtained in addition to labeled data. Hence, the ability to exploit unlabeled data is important to improve prediction accuracy in semi-supervised learning. This problem is regarded as a semiparametric estimation problem with missing data. Under discriminative probabilistic models, it was considered that unlabeled data is useless to improve the estimation accuracy. Recently, the weighted estimator using unlabeled data achieves a better prediction accuracy compared to the learning method using only labeled data, especially when the discriminative probabilistic model is misspecified. That is, improvement under the semiparametric model with missing data is possible when the semiparametric model is misspecified. In this paper, we apply the density-ratio estimator to obtain the weight function in semi-supervised learning. Our approach is advantageous because the proposed estimator does not require well-specified probabilistic models for the probability of the unlabeled data. Based on statistical asymptotic theory, we prove that the estimation accuracy of our method outperforms supervised learning using only labeled data. Some numerical experiments present the usefulness of our methods.
AB - In this paper we study statistical properties of semi-supervised learning, which is considered to be an important problem in the field of machine learning. In standard supervised learning only labeled data is observed, and classification and regression problems are formalized as supervised learning. On the other hand, in semi-supervised learning, unlabeled data is also obtained in addition to labeled data. Hence, the ability to exploit unlabeled data is important to improve prediction accuracy in semi-supervised learning. This problem is regarded as a semiparametric estimation problem with missing data. Under discriminative probabilistic models, it was considered that unlabeled data is useless to improve the estimation accuracy. Recently, the weighted estimator using unlabeled data achieves a better prediction accuracy compared to the learning method using only labeled data, especially when the discriminative probabilistic model is misspecified. That is, improvement under the semiparametric model with missing data is possible when the semiparametric model is misspecified. In this paper, we apply the density-ratio estimator to obtain the weight function in semi-supervised learning. Our approach is advantageous because the proposed estimator does not require well-specified probabilistic models for the probability of the unlabeled data. Based on statistical asymptotic theory, we prove that the estimation accuracy of our method outperforms supervised learning using only labeled data. Some numerical experiments present the usefulness of our methods.
UR - http://www.scopus.com/inward/record.url?scp=84881249451&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84881249451&partnerID=8YFLogxK
U2 - 10.1007/s10994-013-5329-8
DO - 10.1007/s10994-013-5329-8
M3 - Article
AN - SCOPUS:84881249451
SN - 0885-6125
VL - 91
SP - 189
EP - 209
JO - Machine Learning
JF - Machine Learning
IS - 2
ER -