|
Subject
Text-to-Image generative models which generate images given a text prompt, have been largely advanced in the last few years, capable of generating realistic-looking, novel images. There exist two types of generative models in the recent few-years: Diffusion-based (e.g., Stable Diffusion, DiT, FluxDiT) and Autoregressive-based (e.g., DALL-E, Parti). While these models are very good at generating images, leveraging these models for other tasks in a zero-shot fashion (without additional training) has not yet been explored. Some of these tasks include classification, segmentation, object detection, image-text retrieval and image-image retrieval.
This master thesis proposal aims to analyze the internal latent representations of text-to-image models, and to study how well the image and text representations are aligned. As generative models must transform a text prompt to an image, we assume that there is strong visual-text alignment in their feature space, which allows us to perform several other zero-shot applications using those powerful models.
Kind of work
The student will perform the following:
Analyze the alignment and shared space of visual-textual features in text-to-image models
Leverage this feature space to perform zero-shot applications
The project will employ real-world datasets such as ImageNet and COCO.
Number of Students
1
Expected Student Profile
Strong knowledge of Machine Learning, AI, and deep learning.
Good understanding of Transformer-based models
Strong Experience in Python programming and the PyTorch deep learning framework
|
|