RIASSUNTO
Highlight detection from videos has been widely studied due to the fast growth of video contents. However, most existing approaches to highlight detection, either handcraft feature based or deep learning based, heavily rely on human-curated training data, which is very expensive to obtain and, thus, hinders the scalability to large datasets and unlabeled video categories. We observe that the largely available Web images can be applied as a weak supervision for highlight detection. For example, the top-ranked images in reference to the query “skiing” returned by a search engine may contain considerable positive samples of “skiing” highlights. Motivated by this observation, we propose a novel triplet deep ranking approach to video highlight detection using Web images as a weak supervision. The approach handles the relative preference of highlight scores between highlighting frames, nonhighlighting frames, and Web images by the triplet ranking constraints. Our approach can iteratively train two interdependent deep models (i.e., a triplet highlight model and a pairwise noise model) to deal with the noisy Web images in a single framework. We train the two models with relative preferences to generalize the capability regardless of the categories of training data. Therefore, our approach is fully category independent and exploits weakly supervised Web images. We evaluate our approach on two challenging datasets and achieve impressive results compared with the state-of-the-art pairwise ranking support vector machines, a robust recurrent autoencoder, and spatial deep convolution neural networks. We also empirically verify through cross-dataset evaluation that our category-independent model is fairly generalizable even if two different datasets do not share exactly the same categories.