An open-sourced Large Language and Vision Assistant
LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.
The researchers used GPT-4 to generate the instruction-following dataset, which contains virtual conversations between a human user and an AI assistant about the content of images. This dataset was used to fine-tune the LLaVA model, which consists of two foundation models: CLIP for vision and LLaMA for language, with an additional network layer to tie the two together. The team also used GPT-4 to evaluate LLaVA's responses in experiments, by asking it to rate LLaVA's output on a scale of 1 to 10. When further fine-tuned on the ScienceQA training dataset, LLaVA achieved an accuracy of 92.53%, a new record for the benchmark. According to the researchers,
The technique of fine-tuning large language models (LLMs) with instruction-following datasets has led to gains in performance, as demonstrated by ChatGPT, and has prompted researchers to explore this technique with smaller LLMs. InfoQ recently reported on LLaMA, which has only 7B parameters compared to GPT-3's 175B, but can outperform GPT-3 on many tasks. The next step in development of AI assistants has been the addition of the ability to handle image data, as shown with the release of GPT-4 and Visual ChatGPT.