4M
4M: Massively Multimodal Masked Modeling
...The same model family can classify, segment, detect, caption, and even generate images, with a single interface for both discriminative and generative use. The repository releases code and models for multiple variants (e.g., 4M-7 and 4M-21), emphasizing transfer to unseen tasks and modalities. Training/inference configs and issues discuss things like depth tokenizers, input masks for generation, and CUDA build questions, signaling active research iteration. The design leans into flexibility and steerability, so prompts and masks can shape behavior without bespoke heads per task. ...