Advanced Seminar on Vision Language Models

Logistics

Instructor: Vishnu Lokhande
Piazza: Email me for the link Lectures: Wed, 4:00PM - 6:40PM, Davis 113A

Description

This advanced seminar explores the transformative field of vision-language models, which have reshaped modern machine learning through training on massive datasets, with a particular focus on their reliability, trustworthiness, and the broader AI alignment problem, how to encode human values and goals into generative models to ensure they are safe, robust, and useful. Students will engage with recent research on vision-language, language, and vision-audio models, while also examining the pressing challenges of fine-grained controllability, robustness to domain shifts, and scalable continual adaptation, which are especially critical in sensitive domains such as medical imaging where diagnostic fidelity must be preserved despite variations in acquisition protocols, demographics, or sources. The course is structured around critical paper readings and expert-led lectures, with each student presenting one or two sessions (slides due 24 hours in advance) and completing a research project that culminates in a NeurIPS-style paper; depending on time, we may also extend discussions to generative applications in visual, audio, and video content creation.

Papers List: Google Doc
Schedule: Wednesdays 4pm to 6pm

Grading

The seminar is graded on a Satisfactory/Unsatisfactory (S/U) basis. A score of 75% or higher is considered Satisfactory, while a score of less than 75% is considered Unsatisfactory. The grading breakdown is as follows: 40% of the grade is based on presentations, and 60% is based on the project.

Project

For the project, students should form groups of 1 or 2 and aim to make a small but meaningful contribution to machine learning research. To generate ideas, they should explore recent conferences such as NeurIPS, ICML, CVPR, and ECCV/ICCV, and follow at least five recent papers within a specific thread. The final deliverable is a project report in the style of a NeurIPS paper, consisting of 5 pages including an abstract, body, and references. Students can download the LaTeX style file for the report here.