An easy-to-use image captioning for data set annotation. Need to caption your images? Inscriptor has you. This is an implementation of diffusers utilizing Blip2's new zero-shot instructed vision-to-language generation. Runs best on the blip2-opt-6.7b-coco model, but you can change it to any of the other models in the blip2 family that are lower or higher requirements.
24GB VRAM and 48GB RAM
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
or install from https://pytorch.org/get-started/locally/
Coming soon
Inside inscriptor-mass-captioning.ipynb, point imagesDirectory to your local dataset directory. The dataset directory should contain images, in any of these formats: .jpg, .png, .jpeg, .webp, .gif, and can contain subfolders with their own images. Inscriptor will recursively search for images in the directories. The names of these folders can be used as tokens to add to the caption. For example, if you have a folder named cat and another named dog, the caption will contain the tokens cat and dog. This will generate a .txt file with the in the same location where your original image is located with the same filename. The .txt file will contain the caption generated by Inscriptor.