Substantial progress has been made in the past decades in computer vision, in particular as a result of the application of machine learning methods. Despite these developments, computer vision remains primitive as compared to human vision. It still cannot consistently and robustly solve some problems that human can solve with ease such as recognizing a person from different orientations and under different illuminations, segmenting an object (e.g dog) from an image with cluttered back- ground, and recognizing human activities from low resolution video. One factor that contributes to the glaring gap between human vision and machine vision is human brain’s ability to encode prior knowledge, to update the knowledge incrementally over one’s life, and its ability to integrate the prior knowledge with visual measurements for robust visual understanding and inference. The machine learning methods, in contrast, are mostly data-driven. While the transition away from hand-crafted, subjective, and unscalable AI models and towards automatic data-driven methods has opened many welcome advantages for computer vision, the data-driven learning methods do not generalize well beyond the data used to train the algorithms, and they become very brittle when the training data is inadequate.
Parallel to data, there is usually prior knowledge in many domains that governs the target object, its context, and the computer vision tasks. These knowledge may manifest themselves in different ways. Some are qualitative statements on the properties of the target or its context, while other knowledge may appear in the form of rigorous theories and principles that govern the properties of the targets and underpin the computer vision tasks. The existing machine learning methods, unfortunately, can- not effectively exploit the prior knowledge since they are inherently data-based and they do not have a mechanism to effectively capture and encode the prior knowledge.
To address this problem, in this work, we advocate that the long-term success of computer vision requires a union of prior knowledge and the data, and that the prior knowledge encoding is crucial for the development of robust and generalizable computer vision algorithms. To this goal, we propose a Knowledge-Augmented Visual Learning (KAVL) approach which emulates human’s ability in encoding related knowledge and in combining the encoded knowledge with visual measurement to achieve robust and generalizable visual understanding. Through the framework, the prior knowledge from various sources are systematically exploited, captured, and are principally integrated, along with the image data, into different stages.
Xiaoyang Wang, Qiang Ji, "Incorporating Contextual Knowledge to Dynamic Bayesian Networks for Event Recognition", in Proceedings of the 21st International Conference on Pattern Recognition (ICPR), pp. 3378-3381, 2012, (Oral Presentation). [Piero Zamperoni Best Student Paper Award]
Xiaoyang Wang, Qiang Ji, "A Novel Probabilistic Approach Utilizing Clip Attributes as Hidden Knowledge for Event Recognition", in Proceedings of the 21st International Conference on Pattern Recognition (ICPR), pp. 3382-3385, 2012, (Oral Presentation).
Ziheng Wang, Yongqiang Li, Shangfei Wang, and Qiang Ji, Capturing Global Semantic Relationships for Facial Action Unit Recognition, International Conference on Computer Vision (ICCV), 2013.
Ziheng Wang, Shangfei Wang, and Qiang Ji, Capturing Complex Spatio-Temporal Relations among Facial Muscles for Facial Expression Recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.