Advertisment

Large Multimodal Models- Another Step towards AGI

LMMs represent next leap in AI, combining text, images, and audio into a single system that understands the world more like humans do. This advancement moves us closer to AI that can perform complex tasks across various domains and nearer to Artificial General Intelligence.

author-image
PCQ Bureau
New Update
Large Multimodal Models

Large Multimodal Models

Listen to this article
0.75x 1x 1.5x
00:00 / 00:00

The excitement surrounding large language models (LLMs) is rapidly increasing, with industries widely exploring diverse use cases. As a transformative technology, LLMs are being closely monitored for their potential to revolutionize and optimize everything from customer service to complex data analysis to advance health care. Bill Gates recently wrote a blog on how agents will be the next big thing in software. He further claimed that in the next 5 years, anyone who’s online will be able to have a personal assistant powered by artificial intelligence.

Advertisment

While the industries & user community are still embracing the euphoria of Large Language Models (LLMs), the Hi-Tech industry has already started to work on evolution of Large Multimodal Models (LMM) - a step towards extending the ‘emergent’ abilities of LLMs beyond text-only input/output models.

Large Multimodal Models

We human beings are blessed with multiple sensory & cognitive capabilities and our intelligence is a collective intelligence derived from multiple sources. As we grow, we learn to use one or more of these ‘Modes of interactions’ to interact with the world around us. The future of AI will likely follow the same realm and will work on integrating multiple data modalities at input and/or output into AI models, leading to the development of LMMs. The input or output modes of interest could be text/language, images, video, audio, sensors data, actuator data, etc. Till recently, the focus was on unimodal models which could process only one data mode (such as text or speech or image) at a time.

Advertisment

By combining these different types of data, LMMs can achieve a more holistic understanding of the world, enabling them to perform complex tasks. For instance, an LMM could analyze a video, recognize objects, understand spoken language, and generate descriptive text all in one seamless iteration.

Evolution from LLMs to LMM build on several technological innovations. The deep neural network architectures underlying LMMs, such as transformers, have been extended to handle different data modalities. Techniques like Vision Transformers (ViT) for image processing and cross-modal attention mechanisms enable these models to integrate and correlate information of diverse nature from different sources.

LMM for Vision & Language

Advertisment

Vision and language being a dominant mode of any human interaction, the key focus of multimodal AI is ‘fusion’ of vision and language models. To give a simplified idea of how the LMM works, this article will consider a case of LMM supporting image & text.

The LMM extends the training & inferencing approach that was applied to large language models. In addition to word-token embedding (numerical representation of the word-token in the vector space of encoded words), the patch embedding (representation of the image-segment in vector scope of encoded images) is used for the joint training. The patch is a small tile of the image, and the pixel attributes are enriched with the spatial (position with respect to other tiles in the image) information. Vision Transformers (ViT) are an adaptation of the traditional transformer architecture for image processing. Instead of processing sequences of words (as in text), ViTs process sequences of image patches. An image is divided into small patches, each of which is treated as a "token" similar to words in text processing. These patches are then embedded into a sequence, allowing the transformer to apply self-attention mechanisms to image data.

At the time of training, text annotation describing the image is also fed along with the image for the model to learn the nature & level of association between the images and the text. Cross-modal attention mechanisms extend the self-attention concept to integrate information from different data types. With many image-text pairs & the iterative weight updates based on error feedback (back-propagation principle), the deep neural network learns the most likely match for a given image or text.

Advertisment

Such pre-trained models are available as so-called Vision Language foundation models that have gained the capability of ‘visual perception’ and able to generalize well. These models can output text related to a given image, or output image related to given text.

Needless to stress that training of such ‘fusion’ models needs enormous amount of labelled data and large compute resources.

Use-cases & solutions

Advertisment

There are potentially many applications of the LMM in different fields such as manufacturing, healthcare, education, media & entertainment, advertisement, forensics, etc. For example, in the healthcare domain, LMMs can integrate patient records, medical images, and clinical notes to provide comprehensive diagnostic insights. In the entertainment domain, they can create more immersive and interactive experiences. These models can potentially enable and aid effective communication and experiences for people with disabilities, visually impaired etc .by understanding and generating multimodal inputs and outputs.

Vision foundation models could be used for use cases requiring VQA (visual question answering), Visual Insights, Visual Search, visual audit/inspection, image captioning, infographic creation, interactive learning, evidence analysis and many more.

Well known multimodal AI solutions are Gemini (from Google), GPT-4V (from OpenAI), Ray-Ban Smart glasses (from Meta).

Advertisment

Challenges

1) One of the significant challenges of large multimodal models is their need for enormous computing resources, raising concerns about environmental sustainability. A potential approach to address this is to compromise on computational accuracy at intermediate stages while still maintaining output quality, thereby reducing energy usage and computing demands.
2) Current LMM are known to perform well on multi-media contents available on the Internet, however they cannot be directly applied to domain specific tasks (say manufacturing or medical imaging).
3) The complexity of integrating multiple data modalities also increases the model biases and requires careful consideration of ethical and privacy concerns

Conclusion

Advertisment

The evolution of Large multimodal models is a significant step in the journey towards more intelligent AI systems. The integrations of broader array of data modalities, cross attention models pave the way for much more comprehensive and impactful adoption. This evolution highlights the continuous advancement of AI technology and its profound potential to transform the world.

Author: Amit Gupta leads Cloud computing, Artificial Intelligence and Machine Learning Practice at Hughes Systique

Advertisment

Stay connected with us through our social media channels for the latest updates and news!

Follow us: