Open Access Release of Arabic Corpus: 1.5 Billion Words

Sarah Savant announced last week that scholars “working in Arabic can now download the entire corpus used by the KITAB team through Zenodo, an Open Science platform that supports Open Access”:

The corpus, available at doi.org/10.5281/zenodo.3082464, features 7,144 manuscripts, 4,288 unique titles, and 1,859 different authors. Within the 4,288 unique titles there are 755,689,541 words, while altogether the project has brought together a total of “1,520,667,360 words.”

These texts sit within the Open Islamicate Texts Initiative (OpenITI) corpus, to which the KITAB project is a major contributor. Which Arabic texts are they? According to the KITAB FAQ, they focus “on the origins of the written Arabic tradition, in the eighth century, up to roughly the fifteenth century, but aim to include as many texts as possible, so you can also find texts written after 1500.”

Further on the chronological distribution of texts, from the release notes:

The goal of the OpenITI, Savant writes:

…is to build a machine-actionable corpus of premodern texts in Islamicate languages to encourage computational analysis of the Islamicate written tradition. Most of the Arabic texts have been collected from open-access online collections of premodern and modern Arabic texts such as http://shamela.ws/ and http://shiaonlinelibrary.com/

As for what a “first look” at KITAB’s data showed, it’s that it is hugely intertexual.

Those using the corpus are asked to cite as: Maxim Romanov and Masoumeh Seydi. 2019. “OpenITI: A Machine-readable Corpus of Islamicate Texts”. Zenodo. doi:10.5281/zenodo.3082464.

You can find more at the KITAB website at kitab-project.org.