Will InternVL utilize LCL vision encoder?

Great work for visual pre-training!
However, I've noticed that InternVL have poor multi-image & multi-turn conversation ability, see [InternVL issue223](https://github.com/OpenGVLab/InternVL/issues/223). It's the same in my practice.
So, is there a possibility that LCL pre-trained models can be integrated into internVL series in the future?

这是一个很棒的工作！
但是我注意到，你们的InternVL系列模型的多图、多轮对话能力可能仍有缺陷，无论是在我自己的实践中（尝试用ICL的方式让模型学习示例中如何处理图片，但是模型会把示例和后面给他的新图片搞混），还是在[issue 223](https://github.com/OpenGVLab/InternVL/issues/223)中，都有体现。
因此，未来是否有可能将LCL模型用到InternVL系列中，个人认为使用 Interleaved image-text data训练的模型可能非常适合于进行多图多轮对话。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will InternVL utilize LCL vision encoder? #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Will InternVL utilize LCL vision encoder? #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions