In this project, we propose a system that brings together three properties of human intelligence: perception, natural language and reasoning. We build general, open-ended, and explainable systems that incorporate both visual input and world knowledge. The system maps natural language utterances onto a (semantic) representation that is directly executable on images or knowledge bases. The (semantic) representation is thus a program, composed of a number of modular skills also called primitive operations. The system actively tries to recombine its acquired skills to solve a given task. We design the system according to a novel hybrid approach that brings together symbolic and sub-symbolic computation, combining their strengths. This is realized through the implementation of the primitive operations. While sub-symbolic operations are good at handling complex data, such as images, symbolic operations excel at higher-level reasoning tasks. We validate these systems on several tasks, including visual question answering and grounded dialogue systems, and propose an innovative application in the form of intelligent safety assistants.