The automatic image annotation (AIA) task, in which a system specifies descriptive keywords for an input image, has been a shared task studied for long time, and still important because the annotation keywords enables users efficient access of ever-growing image data. However, the current performance of the AIA systems remains at low levels. One of the difficulties of the AIA comes from inconsistency of annotation keywords in the training data, which is naturally occurred in manual annotations, for many supervised methods. For example, annotation keywords for images of people may be “tourist” or “woman” depending on scenes of the images. This inconsistency makes it difficult to annotate images, which possibly have such similar keywords. For that difficulty, we propose a modality converting method that transforms an input image into an encyclopedic text of keywords assigned to the image. With the modality converting, similar keywords can share their features derived from texts with each other. In the proposed method, we pair images with Wikipedia articles, which have annotation keywords as their titles. We train a modality convertor from images to Wikipedia texts using a neural network with the paired data. Then, the method classifies the converted text into annotation keywords similar to the text classification. Experimental results show relatively high performance of our method based on the converted text compared with existing methods.