Skip to content
Draft
39 changes: 39 additions & 0 deletions datasets/ivrit-ai-audio-v2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Name: ivrit-ai Hebrew Audio v2 Dataset
Description: >
The ivrit-ai audio-v2 dataset is a curated collection of Hebrew speech recordings and metadata designed to advance speech recognition and AI research using high-quality, crowd-sourced and/or institutional audio. Contact ivrit.ai for information about its composition and source domains.Documentation: https://huggingface.co/datasets/ivrit-ai/audio-v2
Contact: info@ivrit.ai
ManagedBy: ivrit.ai
UpdateFrequency: Updated several times per year
Tags:
- natural language processing
- automatic speech recognition
- speech processing
License: >
ivrit.ai license (modified CC-BY, permitting use for training AI models only and prohibiting deepfake generation; see https://www.ivrit.ai/en/license-faqs/ for full terms)
Citation: >
If you use this dataset, cite:
Marmor, Yanir and Lifshitz, Yair and Snapir, Yoad and Misgav, Kinneret (2025). Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing. Proc. Interspeech 2025, pp. 723–727.
"[ivrit-ai Crowd-Transcribe Hebrew Speech Dataset] was accessed on [DATE] at registry.opendata.aws/ivrit-ai-crowdtranscribe"
Resources:
- Description: "Hebrew speech audio and aligned metadata in plain text and other formats. Data is available via Hugging Face Datasets. Contact ivrit.ai for bulk/alternative access methods."
ARN: ""
Region: ""
Type: "External Resource"
Explore:
- "https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5"
DataAtWork:
Tutorials:
- Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
AuthorName: Marmor, Yanir et al.
- Title: "Get to Know the ivrit-ai Hebrew Audio v2 Dataset"
URL: https://github.com/yanirmr/notebooks/blob/main/how_to_use_ivrit_ai_audio_v2.ipynb
AuthorName: ivrit.ai
Tools & Applications: []
Publications:
- Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
AuthorName: Marmor, Yanir; Lifshitz, Yair; Snapir, Yoad; Misgav, Kinneret
ADXCategories:
- Language
- Speech
41 changes: 41 additions & 0 deletions datasets/ivrit-ai-crowdtranscribe.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Name: ivrit-ai Crowd-Transcribe Hebrew Speech Dataset
Description: >
The ivrit-ai Crowd-Transcribe v5 dataset is a comprehensive Hebrew speech dataset contributed and vetted by a crowd of volunteers, designed to support the development of open-source Hebrew ASR systems and other language technologies. It is available for the purposes of training AI models, subject to the ivrit.ai license, which prohibits use for non-AI-model training and deepfake creation. The dataset enables robust Hebrew speech-to-text and downstream research.
Documentation: https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5
Contact: info@ivrit.ai
ManagedBy: ivrit.ai
UpdateFrequency: Updated several times per year
Tags:
- natural language processing
- automatic speech recognition
- speech processing
License: >
ivrit.ai license (modified CC-BY, permitting use for training AI models only and prohibiting deepfake generation; see https://www.ivrit.ai/en/license-faqs/ for full terms)
Citation: >
If you use this dataset, cite:
Marmor, Yanir and Lifshitz, Yair and Snapir, Yoad and Misgav, Kinneret (2025). Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing. Proc. Interspeech 2025, pp. 723–727.
"[ivrit-ai Crowd-Transcribe Hebrew Speech Dataset] was accessed on [DATE] at registry.opendata.aws/ivrit-ai-crowdtranscribe"
Resources:
- Description: "Hebrew crowd-sourced transcribed speech audio and aligned metadata in plain text and other formats. Data is available via Hugging Face Datasets. Contact ivrit.ai for bulk/alternative access methods."
ARN: ""
Region: ""
Type: "External Resource"
Explore:
- "https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5"
DataAtWork:
Tutorials:
- Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
AuthorName: Marmor, Yanir et al.
- Title: "Get to Know the ivrit-ai Crowd-Transcribe v5 Dataset"
URL: https://github.com/yanirmr/notebooks/blob/main/how_to_use_ivrit_ai_Crowd_Transcribe_v5.ipynb
AuthorName: ivrit.ai
Tools & Applications: []
Publications:
- Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
AuthorName: Marmor, Yanir; Lifshitz, Yair; Snapir, Yoad; Misgav, Kinneret

ADXCategories:
- Language
- Speech
40 changes: 40 additions & 0 deletions datasets/ivrit-ai-knesset-plenums.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
Name: ivrit-ai Knesset Plenum Transcriptions Dataset
Description: >
The ivrit-ai Knesset Plenum Transcriptions dataset comprises aligned Hebrew speech and transcriptions from Israeli Knesset parliamentary plenary sessions. The dataset supports research on parliamentary speech, political discourse, and automatic speech recognition.
Documentation: https://huggingface.co/datasets/ivrit-ai/knesset-plenums
Contact: info@ivrit.ai
ManagedBy: ivrit.ai
UpdateFrequency: Updated several times per year
Tags:
- natural language processing
- automatic speech recognition
- speech processing
License: >
ivrit.ai license (modified CC-BY, permitting use for training AI models only and prohibiting deepfake generation; see https://www.ivrit.ai/en/license-faqs/ for full terms)
Citation: >
If you use this dataset, cite:
Marmor, Yanir and Lifshitz, Yair and Snapir, Yoad and Misgav, Kinneret (2025). Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing. Proc. Interspeech 2025, pp. 723–727.
"[ivrit-ai Crowd-Transcribe Hebrew Speech Dataset] was accessed on [DATE] at registry.opendata.aws/ivrit-ai-crowdtranscribe"
Resources:
- Description: "Hebrew Knesset plenum audio and transcriptions, with aligned metadata. Access via Hugging Face Datasets or by contacting ivrit.ai for bulk."
ARN: ""
Region: ""
Type: "External Resource"
Explore:
- "https://huggingface.co/datasets/ivrit-ai/knesset-plenums"
DataAtWork:
Tutorials:
- Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
AuthorName: Marmor, Yanir et al.
- Title: "Get to Know the ivrit-ai Knesset Plenums Dataset"
URL: https://github.com/yanirmr/notebooks/blob/main/how_to_use_ivrit_ai_Knesset_Plenums_Dataset.ipynb
AuthorName: ivrit.ai
Tools & Applications: []
Publications:
- Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
AuthorName: Marmor, Yanir; Lifshitz, Yair; Snapir, Yoad; Misgav, Kinneret
ADXCategories:
- Language
- Speech