Releases · CopticScriptorium/corpora

12 Dec 22:37

v6.2.0

6c2acf0

Late 2025 Release Latest

Latest

We are pleased to announce the release of version 6.2.0 of the Coptic Scriptorium data! Our corpora now total 2,375,875 words. This release provides significant new annotated data in both the Bohairic and Sahidic Dialects:

New parts of works in Bohairic, including:
- The Life of Shenoute, Parts 2 & 3 (part 1 was released in September)
- The Lausiac History, Parts 2 & 3 (part 1 was released in September)
New corpora
- The Gospel of Thomas, edited from the manuscript by Paul Dilley
- The Sahidic book of Jonah (with manual edits and corrections to NLP annotations by Stephan Claassen; the automatically processed Jonah is in the Coptic OT corpus)
New documents in the following existing corpora:
- Apophthegmata Patrum
- Shenoute’s work known as Acephalous Work 22
More Arabic and English translations for documents previously published

We are grateful to our collaborators and contributors who have made this release possible, particularly Caroline T. Schroeder, Amir Zeldes, Nicholas Wagner, and Paul Dilley as well as Nina Speranskaja, Rebecca Krawiec, Christine Luckritz Marquis, Stephan Claassen, Philippe Zaher, and Safaa Mahfouz. We also want to thank Hany Takla and the St. Shenouda the Archimandrite Coptic Society for their collaborations and support. Additionally, we thank our donors for contributions that made much of the work on this release possible. Please consider supporting Coptic Scriptorium as we navigate the new funding environment in the USA.

As with all our releases, the raw machine-readable data for all corpora—including morphological and syntactic annotations, as well as named entity recognition—are available in our GitHub repository. Data can be downloaded in a variety of popular formats to suit your research needs.

You can read and browse entire documents in an online portal. Our corpora are also linked in entries on the Coptic Dictionary Online.

For searching, including advanced linguistic queries, you can explore the data using our ANNIS server. To help you get started, check out our tutorial with query tips and a convenient cheat sheet. Currently, the Arabic translations are only available in ANNIS, as well.

Assets 2

19 Sep 19:06

amir-zeldes

v6.1.0

4b73601

Summer 2025 Release

We are pleased to announce the release of version 6.1.0 of the Coptic Scriptorium data! Besides numerous corrections, this release focuses on two expansions:

A major update to the Sahidic Old Testament corpus derived from the CoptOT project's Old Testament base text. This data now includes 911 chapters from the Old Testament, covering over 625,000 tokens of Sahidic Coptic, up from the previous version's 729 chapters with only 459,000 tokens - an increase of over 180 chapters for which no material was previously available, and 166,000 tokens.
Parts of new works in Bohairic, including:
- The Life of Shenoute, part 1
- Lausiac History, Part 1

We are grateful to our collaborators and contributors who have made this release possible, particularly Caroline T. Schroeder and Amir Zeldes, as well as Randy Komforty, Lawrence Rafferty, Nina Speranskaja, and Nicholas Wagner. We also want to thank the St. Shenouda the Archimandrite Coptic Society for their generous support.

For advanced linguistic queries, you can explore the data using our ANNIS server. To help you get started, check out our tutorial with query tips and a convenient cheat sheet.

Assets 2

21 Dec 11:57

amir-zeldes

v6.0.0

89cf7d7

Fall 2024 Release

We are pleased to announce the release of version 6.0.0 of the Coptic Scriptorium data! Our corpus has been dramatically expanded in this release, now exceeding 2.2 million tokens of searchable, linguistically annotated Coptic texts. Among the highlights of this update is the exponential growth of our Bohairic corpus, now comprising approximately 750,000 words and featuring translated texts such as the Bohairic Bible (Old and New Testament), as well as original works such as the Life of Isaac. This milestone brings substantial enhancements to our collections, including modern editions processed with Optical Character Recognition (OCR) technology alongside both new and updated Coptic texts.

New OCR Material and Automatic Tagging

This release includes the addition of OCR-based editions. For the first time, fully automated tagging has been applied to a selection of OCR datasets:

Version 6.0.0 also includes several newly curated corpora, reflecting a diversity of dialects, genres, and textual traditions:

More selections now with parallel Arabic translations:
- Apophthegmata Patrum (AP)
- Pseudo-Theophilus:
  - On Repentance and Continence
- Mercurius:
  - Martyrdom
  - Miracles Part 1 and Part 2
  - Encomium
Additions to Shenoute of Atripe’s Acephalous Work 22 (A22):
- YB 83-96
Bohairic texts:
- Old Testament (automatic processing)
- New Testament (automatic processing)
- Life of Isaac (with manual corrections)
- Bohairic Bible selection manually segmented and tagged:

We are grateful to our collaborators and contributors who have made this release possible, particularly Caroline T. Schroeder and Amir Zeldes, as well as Randy Komforty, Lydia Bremer-McCollum, Lawrence Rafferty, Nina Speranskaja, and Nicholas Wagner. We also want to thank the National Endowment for the Humanities for their ongoing support.

For advanced linguistic queries, you can explore the data using our ANNIS server. To help you get started, check out our tutorial with query tips and a convenient cheat sheet.

Assets 2

30 May 12:15

LCBM0828

v5.0.0

d4d786c

Spring 2024 Release

We are pleased to announce release 5.0.0 of Coptic Scriptorium! Our data now includes over 1,288,229 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works.

This release also marks the introduction of Bohairic Coptic data to our corpus holdings: the repository now contains Bohairic Bible materials, covering Mark 1-16 and 1 Cor. 1-16, with manually reviewed segmentation for the entire corpus, and manual tagging and treebanking for chapters 1-5 in each book. Segmentation and tagging were reviewed in collaboration with Nicholas Wagner, and treebanking was done in collaboration with Nina Speranskaja. As a result of this work, we are in the process of compiling new NLP tools and guidelines specifically for Bohairic.

In addition, the release includes corrections and updates to existing corpora as well as the addition of several new Sahidic works and documents:

A. Sections of five works by Shenoute of Atripe:

B. New documents were added to existing works:

C. Newly added translation spans for Pistis Sophia, aligned by Randy Komforty

These join the newly treebanked and tagged Bohairic data, which can be found here:

We are very grateful to all of our collaborators and contributors, without whom this project could not function. We welcome Nicholas Wagner to the team and warmly thank Randy Komforty for his work on Pistis Sophia, and Nina Sepranskaja for her treebanking work.

As with all our releases, raw machine readable data for all corpora can be found, including morphological and syntactic analysis, as well as named entity recognition and entity linking (currently only for Sahidic), in this GitHub repository, in a variety of popular formats:
https://github.com/CopticScriptorium/corpora

You can also search for complex linguistic annotations in the data using our ANNIS server - please see our tutorial here to get started with some query tips and a helpful cheat sheet:
https://copticscriptorium.org/ANNIS_tutorial

Assets 2

25 Oct 13:52

amir-zeldes

v4.5.0

2877117

Fall 2023 corpus release

We are pleased to announce release 4.5.0 of Coptic Scriptorium! Our data now includes over 1,278,500 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works (an increase of over 11,500 tokens from the previous release).

This release corrects a large number of consistency errors identified in our existing data, and also adds some new documents:

Revisions to five works of Besa:
Sections of three works by Shenoute of Atripe:
New documents were added to existing works:
- Acephalous Work 22
- Apophthegmata Patrum
Newly treebanked data with syntactic gold standard annotations for 1 Corinthians 7

We are very grateful to all of our collaborators and contributors, without whom this project could not function. We welcome Christine Ayad, Lydia Bremer-McCollum, Adeline Harrington, and Nina Speranskaja.

As with all releases, raw machine readable data for all corpora can be found, including morphological and syntactic analysis, as well as named entity recognition and entity linking, in this GitHub repository, in a variety of popular formats.

You can also search for complex linguistic annotations in the data using our ANNIS server - please see our new tutorial here to get started with some query tips and a helpful cheat sheet:

https://copticscriptorium.org/ANNIS_tutorial

Assets 2

14 Oct 15:09

amir-zeldes

v4.4.0

78730ee

Fall 2022 corpus release

We are pleased to announce release 4.4.0 of Coptic Scriptorium. Our data now includes over 1,267,000 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works (an increase of almost 100,000 tokens from the previous release). We are very grateful to all of our collaborators and contributors, without whom this project could not function.

This release corrects a large number of consistency errors identified in our existing data, and also adds some new documents:

Sections of three works by Shenoute of Artipe:
New documents added to existing works:
- Acephalous Work 22
- Apophthegmata Patrum
The remaining books 2-4, as well as the postscript of Pistis Sophia, which are now added to the previously released book 1 in our online interfaces
Newly treebanked data with syntactic gold standard annotations for the Life of John the Kalybites, part 1

We would like to thank the Marcion Project for making the underlying digitized text of Pistis Sophia available, and all of the annotators for their hard work. Tamara Siuda, Rebecca Krawiec, Philippe Zaher, and Lance Martin contributed, in addition to Amir and Carrie. As our current DHAG grant ends, we would like to give special thanks to Lance, who has been working as our DH specialist on the project since 2019, for doing an amazing job of keeping track of all the data and the various tasks he’s been in charge of over the past three years!

All documents have metadata for word segmentation, tagging, parsing, entities and identities (Wikipedia identifiers for named entities) to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of treebanking (gold).

Assets 2

12 May 13:55

amir-zeldes

v4.3.0

a120735

Spring 2022 corpus release

It is our pleasure to announce release 4.3.0 of Coptic Scriptorium corpora, which currently cover over 1,175,000 tokens of searchable, linguistically analyzed Coptic data from dozens of ancient Coptic works. New in this release:

The History of Eustathius and Theopiste (hagiography, annotations by Lance Martin)
Pistis Sophia, book 1 (Gnosticism, annotations by Lance Martin, Tamara Siuda, Caroline T. Schroeder and Amir Zeldes)
Life of Pisentius, part 3 (hagiography, annotations by Tamara Siuda, Lance Martin, Caroline T. Schroeder)

Corrections and additional annotations:

Pilot work adding partial Arabic translations (work by Philippe Zaher)
- Apophthegmata Patrum
- Abraham our Father by Shenoute
Improvements and error corrections to a variety of works (including Because of You Too O Prince of Evil, Dormition of John, Book of Ruth and Homilies of Proclus)

The newly released material encompasses over 57,000 tokens of semi-automatically annotated data. We would like to give special thanks to the Marcion Project for making much of the underlying digitized text available, and the annotators whose hard work has made this release possible.

Assets 2

13 Apr 17:09

amir-zeldes

v4.2.0

6829991

Fall 2021 corpus release

It is our pleasure to announce the latest data release from Coptic Scriptorium, version 4.2.0. This release contains both new Coptic material and additions to older datasets, as well as expanding our entity annotations and named-entity linking to all of our data, including the semi-automatically annotated Old Testament. This also means automatic updates to all of our interfaces, such as the recently added example usage functionality in the Coptic Dictionary Online, which is linked to the corpora.

The new material, including more digitized data courtesy of the Marcion project, as well as manually digitized and corrected OCR data from out of print editions includes:

Encomium of Pseudo-Celestinus on Victor (annotations by Mitchell Abrams and Lance Martin)
Encomium of Pseudo-Flavianus on Demetrius, Archbishop of Alexandria (annotations by Mitchell Abrams, Lance Martin and Amir Zeldes)
Added works by Shenoute of Atripe:
- In the Night (Canons 9, annotations by Lance Martin, Caroline T. Schroeder and Amir Zeldes)
- Because of You Too O Prince of Evil (Discourses 4, annotations by Tamara Siuda, Lance Martin and Caroline T. Schroeder)
Expansions and improvements of existing corpora:
- More Apophthegmata Patrum (work by Christine Luckritz Marquis, So Miyagawa, Caroline T. Schroeder and Amir Zeldes)
- Further material from Shenoute’s works:
  - God Says Through Those Who Are His (including parallel witnesses and new material, data courtesy of David Brakke, annotations by Rebecca Krawiec, Lance Martin, Dana Robinson, Caroline T. Schroeder)
  - Acephalous Work 22 (data courtesy of David Brakke, annotations by Elizabeth Davidson, Rebecca Krawiec, Elizabeth Platte, Caroline T. Schroeder, Amir Zeldes)
- More syntactically annotated gold treebanked data in the Coptic Treebank
- Completely re-annotated Old Testament corpus, based on the base text courtesy of the Digital Edition of the Coptic Old Testament (CoptOT) project – with improved segmentation and parsing, now complete with semi-automatic entity recognition and linking to Wikipedia entries for people and places

Assets 2

02 Apr 19:34

amir-zeldes

v4.1.0

ef141be

Spring 2021 corpus release

We are please to announce the following additions/updates, as well as the addition of entity annotations and named entity Wikipedia links to the automatically processed New Testament corpus (sahidica.nt):

Life of John the Kalybites, parts 1 and 2 (annotations by Lance Martin, Tamara, Siuda, and Caroline T. Schroeder)
Mysteries of John the Evangelist, parts 1 and 2 (Mitchell Abrams, Lance Martin, Tamara Siuda, Caroline T. Schroeder)
Pseudo-Ephrem, The Asceticon of Apa Ephrem, parts 1 and 2 (Lance Martin and Caroline T. Schroeder)
Pseudo-Timothy of Alexandria Discourses, Discourse on Abbaton, parts 1 and 2 (Elizabeth Davidson, Lance Martin, Caroline T. Schroeder, and Amir Zeldes)
Magical Papyri (Korshi Dosoo, Edward O. D. Love, Markéta Preininger, Lance Martin, Caroline T. Schroeder, and Amir Zeldes)

Expansions and Improvements of existing corpora:

Apa Johannes Canons (Diliana Atanassova, Caroline T. Schroeder, Lance Martin, and Amir Zeldes)
Apophthegmata Patrum (Marina Ghaly, Christine Luckritz Marquis, Caroline T. Schroeder)
New release of sahidica.nt, now with semi-automatically disambiguated named entity linking

Assets 2

01 Sep 19:43

amir-zeldes

v4.0.0

49d26d2

Summer 2020 corpus release

We are please to announce the following additions/updates, as well as the addition of entity annotations and named entity Wikipedia links:

John of Constantinople, on Penitence and Abstinence
Pseudo-Chrysostom:
- On the Canaanite Woman
- On Susanna
Pseudo-Basil of Caesarea, on the End of the World and the Temple of Solomon
Life of Pisentius, parts 1-2
Expansions and improvements of existing corpora:
- More Apophthegmata Patrum
- Further material from Shenoute’s works:
  - God Says Through Those Who Are His (including parallel witnesses and new material)
  - Some Kinds of People Sift Dirt

Assets 2

Releases: CopticScriptorium/corpora

Late 2025 Release

Uh oh!

Summer 2025 Release

Uh oh!

Fall 2024 Release

Uh oh!

Spring 2024 Release

Uh oh!

Fall 2023 corpus release

Uh oh!

Fall 2022 corpus release

Uh oh!

Spring 2022 corpus release

Uh oh!

Fall 2021 corpus release

Uh oh!

Spring 2021 corpus release

Uh oh!

Summer 2020 corpus release

Uh oh!