Number of images must corespond to the number of conversations

Hi, thanks for library and model, great work!

I tried to implement a chat-like application that allows a discussion about details of a given image.

I got it working but the application get's real slow because I have to pass an image to every conversational step, otherwise I get "ValueError: Image features and image tokens do not match".

Do you know a way to ask multiple questions to one single image?

```
import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration

processor = AutoProcessor.from_pretrained("fancyfeast/llama-joycaption-beta-one-hf-llava")
llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)
llava_model.eval()

with torch.no_grad():
    image1 = Image.open("tank.jpg")
    images = [image1]

    convo = [
        {
            "role": "system",
            "content": "You are a sensor to detect vehicles.",
        }     
    ]
        
    while True:
        user_input = input("You: ")
        if user_input.lower() == "quit":
            break

        convo.append({"role":"user","content": user_input})
        
        convo_string = processor.apply_chat_template(convo, tokenize = False, add_generation_prompt = True)

        inputs = processor(text=[convo_string], images=images, return_tensors="pt").to('cuda')
        inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)

        generate_ids = llava_model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=True,
            suppress_tokens=None,
            use_cache=True,
            temperature=0.6,
            top_k=None,
            top_p=0.9
        )[0]

        # Trim off the prompt
        generate_ids = generate_ids[inputs['input_ids'].shape[1]:]

        # Decode the caption
        caption = processor.tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
        caption = caption.strip()
        print(caption)
        convo.append({"role":"system", "content": caption})
        images.append(image1)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Number of images must corespond to the number of conversations #56

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Number of images must corespond to the number of conversations #56

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions