RIASSUNTO
Automatic generation of natural language descriptions for images has recently become an important research topic. In this paper, we propose a frame-based algorithm for generating a composite natural language description for a given image. The goal of this algorithm is to describe not only the objects appearing in the image but also the main activities happening in the image and the objects participating in those activities. The algorithm builds upon a pre-trained CRF (Conditional Random Field)-based structured prediction model, which generates a set of alternative frames for a given image. We use imSitu, a situation recognition dataset with 126,102 images, 504 activities, 11,538 objects, and 1,788 roles, as a test bed of our algorithm. We ask human evaluators to evaluate the quality of the descriptions for 20 images from the imSitu dataset. The results demonstrate that our composite description contains on average 16% more visual elements than the baseline method and gains a significantly higher accuracy score by the human evaluators.