About the Project

Because of the many varieties of Arabic, there can never be “one” authoritative corpus of the language. To achieve the best results for language-learning resources and natural language processing, corpora for both the standard language and the spoken varieties need to be available. To this end, Tunisiya.org is a project, led by Karen McNeil and Miled Faiza, seeking to build a four-million-word corpus of Tunisian Spoken Arabic.

To find out more about the Tunisian Arabic Corpus project, please read the attached summary paper.

Project Status

There are currently 2,005 texts in the corpus, comprising 859,814 words. The main categories currently included are displayed in the chart on the right. As you can see, the internet sources are currently dominant ("Web" is a category for materials that have been harvested from the internet but not yet put into more specific categories.)

Download Function Improved

April 29, 2016

The download function has been improved, so that the elements (before context, search term, and after context) appear in the correct order.

Corpus To Be Presented at University of Vienna: July 6, 2015

June 23, 2015

Karen will be giving the keynote address at the International Symposium on Tunisian and Libyan Arabic Dialects, at the University of Vienna on July 6, 2015. Her presentation is entitled "Tunisian Arabic Corpus: Creating a Written Corpus of an "Unwritten" Language." She will also be presenting separately about her research on the use of ("in") as a marker of the progressive verbal aspect in Tunisian (and Libyan) Arabic. This work was informed by data from the corpus.

Problem with Search Function

March 23, 2015

Earlier today the corpus was returning erroneous empty results. It's working now, but if anyone experiences problems like this, please contact us. Thank you!

Corpus Presented at Brown University Digital Humanities Workshop

October 18, 2014

Karen had an opportunity to present a poster about the corpus at Brown's Digital Islamic Humanities Workshop. Here's the handout, which provides a brief overview of the project and its current status: TACHandout.pdf.

Search Tool Improved

May 30, 2014

There were several improvements made to the search tool:

  • A "category" field was added, so you can filter results by text category.
  • Bug Fix: Added validation to the form, so that it will not allow users to submit empty queries (which used to lead to errors)
  • Bug Fix: Added validation to check that any regular expression entered is valid.

Right now the new search tool is only here on the index page. There were some difficulties adding it to the corpus results page, but we'll try to straighten them out in the next update.

Google Chrome Problem Fixed

October 10, 2012

It was brought to our attention that the concordance results were not displaying correctly in Google Chrome. The issue has now been fixed.

Stability Improved

August 31, 2012

We've added a test server, to validate any changes before they go live. So if you've visited the site and been greeted with an unpleasant error message, this should ensure that that doesn't happen anymore.

Large Amount of Web Data Added

August 30, 2012

A large number of internet texts have been added to the corpus, using WebBootCaT (through Sketch Engine). These texts will need to be de-duped, and may contain non-Tunisian material, but at a first pass they seem to be largely Tunisian. They come from blogs, forum postings, YouTube comments, and other informal sources. There's also some erotic fiction (expanding the breadth of vocabulary represented into previously uncovered teritory), and there may be other fiction as well. This would be a great addition to the corpus, since there is no prose fiction currently represented, with the exception of folktales. In addition to being a welcome addition in and of themselves, these new texts will also provide the sites (especially blogs) where more Tunisian texts can be gathered.

Results Now Downloadable

July 24, 2012

A link has been added to the concordance page which allows the search results to be downloaded as a .cvs file. The cvs file can then be opened up in Microsoft Excel or any text editer for further analysis.

Search Capability Upgraded

July 23, 2012

The search tool has been upgraded with a morphological parser, allowing users to search for words by the stem and get results for all inflected forms. The parser currently has an accuracy of 88% (recall: 0.868, precision: 0.970, F-score: 0.916). Future versions of the parser will attempt to improve this accuracy.

The parser is a rule-based parser, with some additional statistical processing to improve results. For more details on the internal workings of the parser and how it was developed, an informal paper on the topic is available.

Tunisian Arabic Corpus presented at Arabic Corpus Linguistics workshop

April 12, 2011

Karen and Miled presented a paper on the corpus project at the Arabic Corpus Linguistics workshop at Lancaster University in England.

Tunisian Arabic Corpus presented at Jil Jadid conference

February 19, 2011

Karen gave a presentation about the Tunisian Arabic Corpus project at the Jil Jadid conference at University of Texas, Austin. A video of the presentation is available here: