TY - GEN
T1 - Convolutional Recurrent Neural Networks for Better Image Understanding
AU - Vallet, Alexis
AU - Sakamoto, Hiroyasu
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/12/22
Y1 - 2016/12/22
N2 - Although deep convolutional neural networks have brought basic computer vision tasks to unprecedented accuracy, the best models still struggle to produce higher level image understanding. Indeed, current models for tasks such as visual question answering, often based on recurrent neural networks, have difficulties surpassing baseline methods. We suspect that this is due in part to spatial information in the image not being properly leveraged. We attempt to solve these difficulties by introducing a recurrent unit able to keep and process spatial information throughout the network. On a simple task, we show that our method is significantly more accurate than alternative baselines which discard spatial information. We also demonstrate that higher resolution input performs better than lower resolution input to a surprising degree, even when the input features are less discriminative. Notably, we show that our approach based on higher resolution input is better able to detect details of the images such as the precise number of objects, and the presence of smaller objects, while being less sensitive to biases in the label distribution of the training set.
AB - Although deep convolutional neural networks have brought basic computer vision tasks to unprecedented accuracy, the best models still struggle to produce higher level image understanding. Indeed, current models for tasks such as visual question answering, often based on recurrent neural networks, have difficulties surpassing baseline methods. We suspect that this is due in part to spatial information in the image not being properly leveraged. We attempt to solve these difficulties by introducing a recurrent unit able to keep and process spatial information throughout the network. On a simple task, we show that our method is significantly more accurate than alternative baselines which discard spatial information. We also demonstrate that higher resolution input performs better than lower resolution input to a surprising degree, even when the input features are less discriminative. Notably, we show that our approach based on higher resolution input is better able to detect details of the images such as the precise number of objects, and the presence of smaller objects, while being less sensitive to biases in the label distribution of the training set.
UR - http://www.scopus.com/inward/record.url?scp=85011018155&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85011018155&partnerID=8YFLogxK
U2 - 10.1109/DICTA.2016.7797026
DO - 10.1109/DICTA.2016.7797026
M3 - Conference contribution
AN - SCOPUS:85011018155
T3 - 2016 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2016
BT - 2016 International Conference on Digital Image Computing
A2 - Liew, Alan Wee-Chung
A2 - Zhou, Jun
A2 - Gao, Yongsheng
A2 - Wang, Zhiyong
A2 - Fookes, Clinton
A2 - Lovell, Brian
A2 - Blumenstein, Michael
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2016 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2016
Y2 - 30 November 2016 through 2 December 2016
ER -