Datasets

Aidacalc: labeled pictures of math equations
crohme: handwritten mathematical equations
ImageNet
Flicker8k
landlord handwritten name recognition
Street View Text
- Text with bounding boxes from real images. Dictionary given so other words in the image can be parsed out
IAM handwriting
- Motion of hand writing. Sample point is position, timestamp, pressure value of pen
NEOCR: Natural Environment OCR Dataset
KAIST Scene Text
MSRA Text Detection with bounding boxes
Stanford OCR clean subset of words and images in a csv file with the pixel values
Chars74k Each character is its own image. Masks for character location also provided
COCO Images with masks of objects to idenity
EMNIST Handwritten letters (not just digits)
EgoBody Motion of interacting people from head-mounted devices

ami: audio recordings of meetings
cmudict artic voice: recordings with sentence labels
commonvoice: speach transcriptions
Speech commands: individual words
timit: audio transcription with labels at the sentence, word and phenome level
CallHome talkbank: audio transcriptions of phone calls mid conversation. utterance level labels/timing for audio

commonsense dialogues json of conversations (4-6 turns between speakers)
ConvoKit
- persuasionforgood-corpus: Introduction hello, how are you?
- tennis-corpus: Reporter jumps right into conversation. Each question/answer is a conversation and they go in order
- iq2-corpus: Full on debate with speaker introductions
- friends-corpus: Jumps in mid conversation
- gap-corpus: Mid conversation talking about 15 most important items in a hypothetical plane crash. $=laughter
- casino-corpus: Introduction hello, how are you? Campsight neighbors negotiate for food water firewood etc
Project Gutenberg plain text books Plain text
OpenWebText replication of OpenAI's WebText
- nanoGPT reproducting gpt-2
HotpotQA chain of thought question and answer using search engine
- ReAct combines chain of thought with actions
OpenAssistant ChatGPT replacement
https://github.com/czyssrs/FinQA Financial question and answer dataset from hired financial experts on company filings
https://huggingface.co/datasets/next-tat/TAT-QA/viewer/default/test Financial question and answer based on tables created by human financial experts

Reference

WordNet - how do words relate to each other in terms of hierarchy
ConceptNet - how do words relate to each other in terms of usage (ex: A person can make coffee)