• Login
    View Item 
    •   UBIR Home
    • Theses and Dissertations (2018-present)
    • 2019-09-01 UB Theses and Dissertations (public)
    • View Item
    •   UBIR Home
    • Theses and Dissertations (2018-present)
    • 2019-09-01 UB Theses and Dissertations (public)
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Learning to Build Multimodal Intelligence across Vision, Language and Speech

    Thumbnail
    View/Open
    Ma_buffalo_0656A_16736.pdf (20.25Mb)
    Date
    2019
    Author
    Ma, Shuang
    Metadata
    Show full item record
    Abstract
    Artificial intelligence can already do many things that humans cannot. But how far away are we from building ``human-like" AI? What are the key problems that we need to solve before we get there?Human gain and evolve intelligence by absorbing and accumulating knowledge from multiple sources, which can also be applied for how to build machine intelligence. In this research, we seek to enable machine to have multimodal intelligence like human beings.One key obstacle for building Multimodal Intelligence is that data from different modalities does not share the same representation.As this dissertation focuses on Machine Learning and Artificial Intelligence research, we seek for building the cornerstone with the universal perceptive knowledge, which directly represents data from different modalities into a common space. In such a system, multimodal data can be perceived, understood and fused into the intelligence building process.We accomplish the goal of building multimodal intelligence by mainly focusing on three steps, i.e. learning to perceive knowledge from a single modal, learning to align knowledge across modalities and further learning to fuse knowledge from multiple modalities.In this dissertation, we present novel techniques that are developed by learning from Vision, Language and Speech. We also show how such techniques can effectively resolve many real world problems across different modalities.It is very challenging to achieve the above goals:first, to make machines have the ability to understand, especially for images, the performance of deep CNN methods is often compromised by the constraint that the neural network only takes the fixed-size input. To accommodate this requirement, input images need to be properly transformed. Thus the high level information of the original images is impaired because of potential loss of fine grained details and holistic image layout. Second, it is challenging in discovering the appropriate correspondences across modalities when generation by machines.Existing works build upon Generative Adversarial Network (GAN) in such a way that the distribution of the generated samples are indistinguishable from the distribution of the target set. However, such set-level constraints cannot learn the instance-level correspondences (e.g. aligned semantic parts in object configuration task). This limitation often results in false positives (e.g. geometric or semantic artifacts), and further leads to mode collapse problem.Third, it is even harder to generate samples in a more novel and controllable way, e.g. modeling style during synthesizing. Some state-of-the-art approaches achieve style modeling with a reconstruction loss. However, it is insufficient to disentangle style from other factors of variation. Furthermore, current techniques are designed for a specific task (e.g., within a single modal or across two modalities). A unified model that can be applied for a wide range of data types is highly needed.The high-level contribution of this dissertation lies in building novel techniques that addressed these challenges. Such technologies have been applied to a wide range of applications in a very scalable way. We have explored the usage of such primitives to support image translation, controllable Text-To-Speech Synthesis (TTS), multimodal translation, interactive photography assistant, image aesthetics assessment, and heterogeneous social network recommendation.
    URI
    http://hdl.handle.net/10477/80964
    Collections
    • 2019-09-01 UB Theses and Dissertations (public)

    To add content to the repository or for technical support: Contact Us
     

     

    Browse

    All of UBIRCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsTypesThis CollectionBy Issue DateAuthorsTitlesSubjectsTypes

    My Account

    LoginRegister

    To add content to the repository or for technical support: Contact Us