-
Notifications
You must be signed in to change notification settings - Fork 60
Open
Description
Hi, thanks for library and model, great work!
I tried to implement a chat-like application that allows a discussion about details of a given image.
I got it working but the application get's real slow because I have to pass an image to every conversational step, otherwise I get "ValueError: Image features and image tokens do not match".
Do you know a way to ask multiple questions to one single image?
import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
processor = AutoProcessor.from_pretrained("fancyfeast/llama-joycaption-beta-one-hf-llava")
llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)
llava_model.eval()
with torch.no_grad():
image1 = Image.open("tank.jpg")
images = [image1]
convo = [
{
"role": "system",
"content": "You are a sensor to detect vehicles.",
}
]
while True:
user_input = input("You: ")
if user_input.lower() == "quit":
break
convo.append({"role":"user","content": user_input})
convo_string = processor.apply_chat_template(convo, tokenize = False, add_generation_prompt = True)
inputs = processor(text=[convo_string], images=images, return_tensors="pt").to('cuda')
inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
generate_ids = llava_model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
suppress_tokens=None,
use_cache=True,
temperature=0.6,
top_k=None,
top_p=0.9
)[0]
# Trim off the prompt
generate_ids = generate_ids[inputs['input_ids'].shape[1]:]
# Decode the caption
caption = processor.tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
caption = caption.strip()
print(caption)
convo.append({"role":"system", "content": caption})
images.append(image1)
Metadata
Metadata
Assignees
Labels
No labels