MMGR24@ACM-MM2024

About

MMGR Workshop

Welcome to 2^nd MMGR Workshop co-located with ACM Multimedia 2024!

Information generation (IG) and information retrieval (IR) are two key representative approaches of information acquisition, i.e., producing content either via generation or via retrieval. While traditional IG and IR have achieved great success within the scope of languages, the under-utilization of varied data sources in different modalities (i.e., text, images, audio, and video) would hinder IG and IR techniques from giving the full advances and thus limits the applications in the real world. Knowing the fact that our world is replete with multimedia information, this workshop encourages the development of deep multimodal learning for the research of IG and IR. Benefiting from a variety of data types and modalities, some latest prevailing techniques are extensively invented to show great facilitation in multimodal IG and IR learning, such as DALL-E, Stable Diffusion, GPT4, Sora, etc. Given the great potential shown by multimodal-empowered IG and IR, there can be still unsolved challenges and open questions in these directions. With this special issue, we aim to encourage more explorations in Deep Multimodal Generation and Retrieval, providing a platform for researchers to share insights and advancements in this rapidly evolving domain.

Calls

Call for Papers

In this workshop, we welcome three types of submissions:

Position or perspective papers (The same format & template as the main conference, but the manuscript’s length is limited to one of the two options: a) 4 pages plus 1-page reference; or b) 8 pages plus up to 2-page reference.): original ideas, perspectives, research vision, and open challenges in the topics of the workshop;
Featured papers (title and abstract of the paper, plus the original paper): already published papers or papers summarizing existing publications in leading conferences and high-impact journals that are relevant for the topics of the workshop;
Demonstration papers (up to 2 pages in length, plus unlimited pages for references): original or already published prototypes and operational evaluation approaches in the topics of the workshop.

All the accepted papers will be archived in the ACM MM proceedings. Authors of accepted papers will be presented at the workshop. Also, high-quality papers can be recommended to ACM ToMM Special Issue of MMGR.

We will select from the accepted papers the Best Paper Award, which will be announced during the workshop.

Topics and Themes

Topics of interests include but not limited to:

Multimodal Semantics Understanding, such as
- - Vision-Language Alignment Analysis
- - Multimodal Fusion and Embeddings
- - Large-scale Vision-Language Pre-training
- - Structured Vision-Language Learning
- - Visually Grounded Interaction of Language Modeling
- - Commonsense-aware Vision-Language Learning
- - Visually Grounded Language Parsing
- - Semantic-aware Vision-Language Discovery
- - Large Multimodal Models
Generative Models for Image/Video Synthesis, such as
- - Text-free/conditioned Image Synthesis
- - Text-free/conditioned Video Synthesis
- - Temporal Coherence in Video Generation
- - Image/Video Editing/Inpainting
- - Visual Style Transfer
- - Image/Video Dialogue
- - Panoramic Scene Generation
- - Multimodal Dialogue Response Generation
- - LLM-empowered Multimodal Generation
Multimodal Information Retrieval, such as
- - Image/Video-Text Compositional Retrieval
- - Image/Video Moment Retrieval
- - Image/Video Captioning
- - Image/Video Relation Detection
- - Image/Video Question Answering
- - Multimodal Retrieval with MLLMs
- - Hybrid Synthesis with Retrieval and Generation
Explainable and Reliable Multimodal Learning, such as

- Explainable Multimodal Retrieval
- Relieve Hallucination of LLMs
- Adversarial Attack and Defense
- Multimodal Learning for Social Good
- Multimodal-based Reasoning
- Multimodal Instruction Tuning
- Efficient Learning of MLLMs

Submission Instructions

Page limits include diagrams and appendices. Submissions should be written in English, and formatted according to the current ACM two-column conference format. Authors are responsible for anonymizing the submissions. Suitable LaTeX, Word, and Overleaf templates are available from the ACM Website (use “sigconf” proceedings template for LaTeX and the Interim Template for Word).

Review Process

All submissions will be peer-reviewed by at least two reviewers of experts in the field. The reviewing process will be two-way anonymized. Acceptance will be dependent on the relevance to the workshop topics, scientific novelty, and technical quality. The accepted workshop papers will be published in the ACM Digital Library.

Important Dates

Paper Submission: July 19, 2024 Aug 7, 2024 (AoE)
Notification of Acceptance: ~~August 12, 2024 (AoE)~~
Camera-ready Submission: August 19, 2024 (AoE) [Firm Deadline]
Workshop dates: 28 October, 2024 (AoE)

Papers

Accepted Papers

Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization
Richard Luo, Austin Peng, Adithya Vasudev, Rishabh Jain
A Learnable Agent Collaboration Network Framework for Personalized Multimodal AI Search Engine
Yunxiao Shi, Min Xu, Haimin Zhang, Xing Zi, Qiang Wu
Meme Generation with Multi-modal Input and Planning
Ashutosh Ranjan, Vivek Srivastava, Jyotsana Khatri, Savita Bhat, Shirish Karande
Bridging the Lexical Gap: Generative Text-to-Image Retrieval for Parts-of-Speech Imbalance in Vision-Language Models
Hyesu Hwang, Daeun Kim, Jaehui Park, Yongjin Kwon
Staying in the Cat-and-Mouse Game: Towards Black-box Adversarial Example Detection
Yifei Gao, Zhiyu Lin, Yunfan Yang, Jitao Sang, Xiaoshan Yang, Changsheng Xu

Workshop Schedule

Program

Date: October 28, 2024 (half day). Meeting Room: 216. Please note the schedule is in Melbourne time zone. The program at a glance can be downloaded here.

Also you can online join via Zoom Meeting (click to enter), ID: 335 825 3206, Passcode: 118404

09:00 - 09:10	\|	Welcome Message from the Chairs
09:10 - 09:50	\|	Keynote 1: Retrieval or Generation? A Perspective from Food Recognition, by Prof. Chong-Wah Ngo
09:50 - 10:30	\|	Keynote 2: Enhancing Recommendations and Search with Brain Signals, by Prof. Min Zhang
10:30 - 11:00	\|	Coffee Break
11:00 - 11:12	\|	Presentation 1: Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarizationy
11:12 - 11:24	\|	Presentation 2: A Learnable Agent Collaboration Network Framework for Personalized Multimodal AI Search Engine
11:24 - 11:36	\|	Presentation 3: Meme Generation with Multi-modal Input and Planning
11:36 - 11:48	\|	Presentation 4: Bridging the Lexical Gap: Generative Text-to-Image Retrieval for Parts-of-Speech Imbalance in Vision-Language Models
11:48 - 12:00	\|	Presentation 5: Perception-Driven Hand-Object Interaction Generation: Mimicking Human Sensory Understanding, by Yuze Hao
12:00 - 12:12	\|	Presentation 6: SAMControl: Controlling Pose and Object for Image Editing with Soft Attention Mask, by Yue Zhang
12:12 - 12:20	\|	Workshop Closing

Talks

Invited Speakers

Chong Wah

Singapore Management University

Min Zhang

Tsinghua University

Title: Retrieval or Generation? A Perspective from Food Recognition
Speaker: Chong-Wah Ngo

Abstract: Recognizing food image in the free-living environment is a highly challenging task due to wildly different types of cuisines and dish presentation styles. The fundament step towards the success of this task is to enable zero-shot prediction of any ingredients, portion sizes, calories and nutrition facts for food prepared under any methods and images taken in any environments. The conventional approach is by retrieving the recipe associated with the query food image, where the recipe provides necessary information (e.g., ingredients and their weights) for estimation of food content and amount. Nevertheless, this line of approaches is practically limited by the number of recipes available for retrieval. Recently, due to strong reasoning capability of multimodal foundation models, direct prediction of food content and amount is possible without referring to recipes. Nevertheless, these models generally suffer from hallucination and are difficult to apply for real-world applications. This talk will present the research efforts in these two directions. Particularly, a recently developed Food Large Multi-modal Model (FoodLMM), which is a recent effort to unify different tasks into a foundation model for food recognition, will be introduced. This talk will share insights regarding the training of FoodLMM and its performance compared to the conventional approaches.

Bio: Dr. Chong-Wah Ngo is a professor with School of Computing and Information Systems, Singapore Management University (SMU). Before joining SMU, he was a professor with the Department of Computer Science, City University of Hong Kong. His main research interests include multimedia search, multi-modal fusion, and video content understanding. He has researched on various issues in food computing, including ingredient recognition, open-vocabulary ingredient segmentation, food image generation from recipe, cross-domain food recognition, cross-modal and cross-lingual food image and recipe retrieval, mobile app food logging systems. He leads the research and development of FoodAI engine in SMU, an engine deployed by Singapore Health Promotion Board for citizen food logging.

Title: Enhancing Recommendations and Search with Brain Signals
Speaker: Min Zhang

Abstract: In recent years, it has been a cutting-edge research direction on leveraging the power of brain signals to help improve recommendation and search. In this talk, I will firstly share our latest research advancements in exploring the correlation between users’ experiences and brain EEG signals, emotions, and user satisfaction in short video recommendation scenario. Subsequently, I will demonstrate how user immersion, a new factor that links psychology, brain science, and information retrieval, contributes to more effective short video recommendation. Moreover, in the realm of search, I will briefly introduce Brain-Aug, a novel approach that decodes semantic information directly from brain fMRI signals of users to augment query representation. We believe there will be more valuable future directions on utilizing brain signals for sophisticated information generation, recommendation and search.

Bio: Dr. Min Zhang is a full professor in the Dept. of Computer Sci. & Tech., Tsinghua University. She is the chief director of the AI Lab. She specializes in Web search, personalized recommendation, and user modeling. She has been the Editor-in-Chief of ACM Transaction on Information Systems (TOIS) since 2020, and also serves as the General co-Chair of ACM MM’25, and PC co-Chairs of CHIIR’24, RecSys’23, CIKM’23, ICTIR’20, WSDM’17, etc. She won the “Test-of-Time” award at SIGIR’24, WSDM’22 Best Paper award, IBM Global Faculty Award, Okawa Fund, SIGIR’20 Best Full Paper Honorable Mention, etc.