In the context of multimodal foundation models, can we create models that can process such composite entities as a single modality of input?
Not sure if this is too naive an idea or more of an engineering than a research problem 😅
In the context of multimodal foundation models, can we create models that can process such composite entities as a single modality of input?
Not sure if this is too naive an idea or more of an engineering than a research problem 😅