Learning to Build Multimodal Intelligence across Vision, Language and Speech
MetadataShow full item record
Artificial intelligence can already do many things that humans cannot. But how far away are we from building ``human-like" AI? What are the key problems that we need to solve before we get there?Human gain and evolve intelligence by absorbing and accumulating knowledge from multiple sources, which can also be applied for how to build machine intelligence. In this research, we seek to enable machine to have multimodal intelligence like human beings.One key obstacle for building Multimodal Intelligence is that data from different modalities does not share the same representation.As this dissertation focuses on Machine Learning and Artificial Intelligence research, we seek for building the cornerstone with the universal perceptive knowledge, which directly represents data from different modalities into a common space. In such a system, multimodal data can be perceived, understood and fused into the intelligence building process.We accomplish the goal of building multimodal intelligence by mainly focusing on three steps, i.e. learning to perceive knowledge from a single modal, learning to align knowledge across modalities and further learning to fuse knowledge from multiple modalities.In this dissertation, we present novel techniques that are developed by learning from Vision, Language and Speech. We also show how such techniques can effectively resolve many real world problems across different modalities.It is very challenging to achieve the above goals:first, to make machines have the ability to understand, especially for images, the performance of deep CNN methods is often compromised by the constraint that the neural network only takes the fixed-size input. To accommodate this requirement, input images need to be properly transformed. Thus the high level information of the original images is impaired because of potential loss of fine grained details and holistic image layout. Second, it is challenging in discovering the appropriate correspondences across modalities when generation by machines.Existing works build upon Generative Adversarial Network (GAN) in such a way that the distribution of the generated samples are indistinguishable from the distribution of the target set. However, such set-level constraints cannot learn the instance-level correspondences (e.g. aligned semantic parts in object configuration task). This limitation often results in false positives (e.g. geometric or semantic artifacts), and further leads to mode collapse problem.Third, it is even harder to generate samples in a more novel and controllable way, e.g. modeling style during synthesizing. Some state-of-the-art approaches achieve style modeling with a reconstruction loss. However, it is insufficient to disentangle style from other factors of variation. Furthermore, current techniques are designed for a specific task (e.g., within a single modal or across two modalities). A unified model that can be applied for a wide range of data types is highly needed.The high-level contribution of this dissertation lies in building novel techniques that addressed these challenges. Such technologies have been applied to a wide range of applications in a very scalable way. We have explored the usage of such primitives to support image translation, controllable Text-To-Speech Synthesis (TTS), multimodal translation, interactive photography assistant, image aesthetics assessment, and heterogeneous social network recommendation.