awslabs · yanirmr · Sep 19, 2025 · Sep 19, 2025 · Sep 19, 2025 · Sep 26, 2025
diff --git a/datasets/ivrit-ai-audio-v2.yaml b/datasets/ivrit-ai-audio-v2.yaml
@@ -0,0 +1,39 @@
+Name: ivrit-ai Hebrew Audio v2 Dataset
+Description: >
+ The ivrit-ai audio-v2 dataset is a curated collection of Hebrew speech recordings and metadata designed to advance speech recognition and AI research using high-quality, crowd-sourced and/or institutional audio. Contact ivrit.ai for information about its composition and source domains.Documentation: https://huggingface.co/datasets/ivrit-ai/audio-v2
+Contact: info@ivrit.ai
+ManagedBy: ivrit.ai
+UpdateFrequency: Updated several times per year
+Tags:
+  - natural language processing
+  - automatic speech recognition
+  - speech processing
+License: >
+  ivrit.ai license (modified CC-BY, permitting use for training AI models only and prohibiting deepfake generation; see https://www.ivrit.ai/en/license-faqs/ for full terms)
+Citation: >
+  If you use this dataset, cite:  
+  Marmor, Yanir and Lifshitz, Yair and Snapir, Yoad and Misgav, Kinneret (2025). Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing. Proc. Interspeech 2025, pp. 723–727.  
+  "[ivrit-ai Crowd-Transcribe Hebrew Speech Dataset] was accessed on [DATE] at registry.opendata.aws/ivrit-ai-crowdtranscribe"
+Resources:
+  - Description: "Hebrew speech audio and aligned metadata in plain text and other formats. Data is available via Hugging Face Datasets. Contact ivrit.ai for bulk/alternative access methods."
+    ARN: ""
+    Region: ""
+    Type: "External Resource"
+    Explore:
+      - "https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5"
+DataAtWork:
+  Tutorials:
+    - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
+      URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
+      AuthorName: Marmor, Yanir et al.
+    - Title: "Get to Know the ivrit-ai Hebrew Audio v2 Dataset"
+      URL: https://github.com/yanirmr/notebooks/blob/main/how_to_use_ivrit_ai_audio_v2.ipynb
+      AuthorName: ivrit.ai
+Tools & Applications: []
+Publications:
+  - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
+    URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
+    AuthorName: Marmor, Yanir; Lifshitz, Yair; Snapir, Yoad; Misgav, Kinneret
+ADXCategories:
+  - Language
+  - Speech
diff --git a/datasets/ivrit-ai-crowdtranscribe.yaml b/datasets/ivrit-ai-crowdtranscribe.yaml
@@ -0,0 +1,41 @@
+Name: ivrit-ai Crowd-Transcribe Hebrew Speech Dataset
+Description: >
+  The ivrit-ai Crowd-Transcribe v5 dataset is a comprehensive Hebrew speech dataset contributed and vetted by a crowd of volunteers, designed to support the development of open-source Hebrew ASR systems and other language technologies. It is available for the purposes of training AI models, subject to the ivrit.ai license, which prohibits use for non-AI-model training and deepfake creation. The dataset enables robust Hebrew speech-to-text and downstream research.
+Documentation: https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5
+Contact: info@ivrit.ai
+ManagedBy: ivrit.ai
+UpdateFrequency: Updated several times per year
+Tags:
+  - natural language processing
+  - automatic speech recognition
+  - speech processing
+License: >
+  ivrit.ai license (modified CC-BY, permitting use for training AI models only and prohibiting deepfake generation; see https://www.ivrit.ai/en/license-faqs/ for full terms)
+Citation: >
+  If you use this dataset, cite:  
+  Marmor, Yanir and Lifshitz, Yair and Snapir, Yoad and Misgav, Kinneret (2025). Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing. Proc. Interspeech 2025, pp. 723–727.  
+  "[ivrit-ai Crowd-Transcribe Hebrew Speech Dataset] was accessed on [DATE] at registry.opendata.aws/ivrit-ai-crowdtranscribe"
+Resources:
+  - Description: "Hebrew crowd-sourced transcribed speech audio and aligned metadata in plain text and other formats. Data is available via Hugging Face Datasets. Contact ivrit.ai for bulk/alternative access methods."
+    ARN: ""
+    Region: ""
+    Type: "External Resource"
+    Explore:
+      - "https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5"
+DataAtWork:
+  Tutorials:
+    - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
+      URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
+      AuthorName: Marmor, Yanir et al.
+    - Title: "Get to Know the ivrit-ai Crowd-Transcribe v5 Dataset"
+      URL: https://github.com/yanirmr/notebooks/blob/main/how_to_use_ivrit_ai_Crowd_Transcribe_v5.ipynb
+      AuthorName: ivrit.ai
+Tools & Applications: []
+Publications:
+  - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
+    URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
+    AuthorName: Marmor, Yanir; Lifshitz, Yair; Snapir, Yoad; Misgav, Kinneret
+
+ADXCategories:
+  - Language
+  - Speech
diff --git a/datasets/ivrit-ai-knesset-plenums.yaml b/datasets/ivrit-ai-knesset-plenums.yaml
@@ -0,0 +1,40 @@
+Name: ivrit-ai Knesset Plenum Transcriptions Dataset
+Description: >
+  The ivrit-ai Knesset Plenum Transcriptions dataset comprises aligned Hebrew speech and transcriptions from Israeli Knesset parliamentary plenary sessions. The dataset supports research on parliamentary speech, political discourse, and automatic speech recognition.
+Documentation: https://huggingface.co/datasets/ivrit-ai/knesset-plenums
+Contact: info@ivrit.ai
+ManagedBy: ivrit.ai
+UpdateFrequency: Updated several times per year
+Tags:
+  - natural language processing
+  - automatic speech recognition
+  - speech processing
+License: >
+  ivrit.ai license (modified CC-BY, permitting use for training AI models only and prohibiting deepfake generation; see https://www.ivrit.ai/en/license-faqs/ for full terms)
+Citation: >
+  If you use this dataset, cite:  
+  Marmor, Yanir and Lifshitz, Yair and Snapir, Yoad and Misgav, Kinneret (2025). Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing. Proc. Interspeech 2025, pp. 723–727.  
+  "[ivrit-ai Crowd-Transcribe Hebrew Speech Dataset] was accessed on [DATE] at registry.opendata.aws/ivrit-ai-crowdtranscribe"
+Resources:
+  - Description: "Hebrew Knesset plenum audio and transcriptions, with aligned metadata. Access via Hugging Face Datasets or by contacting ivrit.ai for bulk."
+    ARN: ""
+    Region: ""
+    Type: "External Resource"
+    Explore:
+      - "https://huggingface.co/datasets/ivrit-ai/knesset-plenums"
+DataAtWork:
+  Tutorials:
+    - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
+      URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
+      AuthorName: Marmor, Yanir et al.
+    - Title: "Get to Know the ivrit-ai Knesset Plenums Dataset"
+      URL: https://github.com/yanirmr/notebooks/blob/main/how_to_use_ivrit_ai_Knesset_Plenums_Dataset.ipynb
+      AuthorName: ivrit.ai
+Tools & Applications: []
+Publications:
+  - Title: "Building an Accurate Open-Source Hebrew ASR System through Crowdsourcing"
+    URL: https://www.isca-archive.org/interspeech_2025/marmor25_interspeech.pdf
+    AuthorName: Marmor, Yanir; Lifshitz, Yair; Snapir, Yoad; Misgav, Kinneret
+ADXCategories:
+  - Language
+  - Speech