Sarah Savant announced last week that scholars “working in Arabic can now download the entire corpus used by the KITAB team through Zenodo, an Open Science platform that supports Open Access”:
The corpus, available at doi.org/10.5281/zenodo.3082464, features 7,144 manuscripts, 4,288 unique titles, and 1,859 different authors. Within the 4,288 unique titles there are 755,689,541 words, while altogether the project has brought together a total of “1,520,667,360 words.”
These texts sit within the Open Islamicate Texts Initiative (OpenITI) corpus, to which the KITAB project is a major contributor. Which Arabic texts are they? According to the KITAB FAQ, they focus “on the origins of the written Arabic tradition, in the eighth century, up to roughly the fifteenth century, but aim to include as many texts as possible, so you can also find texts written after 1500.”
Further on the chronological distribution of texts, from the release notes:
The goal of the OpenITI, Savant writes:
…is to build a machine-actionable corpus of premodern texts in Islamicate languages to encourage computational analysis of the Islamicate written tradition. Most of the Arabic texts have been collected from open-access online collections of premodern and modern Arabic texts such as http://shamela.ws/ and http://shiaonlinelibrary.com/
As for what a “first look” at KITAB’s data showed, it’s that it is hugely intertexual.
Those using the corpus are asked to cite as: Maxim Romanov and Masoumeh Seydi. 2019. “OpenITI: A Machine-readable Corpus of Islamicate Texts”. Zenodo. doi:10.5281/zenodo.3082464.
You can find more at the KITAB website at kitab-project.org.
Dear Marcia, Thanks very much for this. I’m going to follow up on this in a piece on Arabic OCR software.
With good wishes, Edward
Fantastic. I was thinking of doing the same, I’m glad you will take care of it instead & look forward to reading.
Comments are closed.