Tutorial // ACM Multimedia Asia 2024

1. Vision-Language Models for Multimedia Applications: From Foundations to State-of-the-Art

Website: https://csyanbin.github.io/MMAsian_tutorial.html

Vision-Language Models (VLMs) are revolutionizing the multimedia landscape by seamlessly integrating visual and textual data for a wide range of applications, such as image captioning, Visual Question Answering (VQA), and multimodal retrieval. This tutorial will explore both foundational and state-of-the-art VLMs, providing attendees with a deep understanding of how these models function and how they can be applied effectively.

Participants will explore the evolution of VLMs from classical architectures like CNNs and RNNs to cutting-edge transformer-based models such as CLIP, BLIP, and more. The tutorial will also focus on key challenges such as scaling these models, optimizing their performance, and improving their interpretability for real-world multimedia applications.

Contact person:
Yanbin Liu (yanbin.liu@aut.ac.nz)

Speaker:
Yanbin Liu

Tutorials