About the Project
Because of the many varieties of Arabic, there can never be “one” authoritative corpus of the language. To achieve the best results for language-learning resources and natural language processing, corpora for both the standard language and the spoken varieties need to be available. To this end, Tunisiya.org is a project, led by Karen McNeil and Miled Faiza, seeking to build a four-million-word corpus of Tunisian Spoken Arabic.
To find out more about the Tunisian Arabic Corpus project, please read the attached summary paper.
There are currently 2,000 texts in the corpus, comprising 818,310 words. The main categories currently included are displayed in the chart on the right. As you can see, the internet sources are currently dominant ("Web" is a category for materials that have been harvested from the internet but not yet put into more specific categories.)
Search Tool Improved
May 30, 2014
There were several improvements made to the search tool:
- A "category" field was added, so you can filter results by text category.
- Bug Fix: Added validation to the form, so that it will not allow users to submit empty queries (which used to lead to errors)
- Bug Fix: Added validation to check that any regular expression entered is valid.
Right now the new search tool is only here on the index page; it will be added to the concordance page in the next few days.
Google Chrome Problem Fixed
October 10, 2012
It was brought to our attention that the concordance results were not displaying correctly in Google Chrome. The issue has now been fixed.
August 31, 2012
We've added a test server, to validate any changes before they go live. So if you've visited the site and been greeted with an unpleasant error message, this should ensure that that doesn't happen anymore.
Large Amount of Web Data Added
August 30, 2012
A large number of internet texts have been added to the corpus, using WebBootCaT (through Sketch Engine). These texts will need to be de-duped, and may contain non-Tunisian material, but at a first pass they seem to be largely Tunisian. They come from blogs, forum postings, YouTube comments, and other informal sources. There's also some erotic fiction (expanding the breadth of vocabulary represented into previously uncovered teritory), and there may be other fiction as well. This would be a great addition to the corpus, since there is no prose fiction currently represented, with the exception of folktales. In addition to being a welcome addition in and of themselves, these new texts will also provide the sites (especially blogs) where more Tunisian texts can be gathered.
Results Now Downloadable
July 24, 2012
A link has been added to the concordance page which allows the search results to be downloaded as a .cvs file. The cvs file can then be opened up in Microsoft Excel or any text editer for further analysis.
Search Capability Upgraded
July 23, 2012
The search tool has been upgraded with a morphological parser, allowing users to search for words by the stem and get results for all inflected forms. The parser currently has an accuracy of 88% (recall: 0.868, precision: 0.970, F-score: 0.916). Future versions of the parser will attempt to improve this accuracy.
The parser is a rule-based parser, with some additional statistical processing to improve results. For more details on the internal workings of the parser and how it was developed, an informal paper on the topic is available.
Tunisian Arabic Corpus presented at Arabic Corpus Linguistics workshop
April 12, 2011
Karen and Miled presented a paper on the corpus project at the Arabic Corpus Linguistics workshop at Lancaster University in England.
Tunisian Arabic Corpus presented at Jil Jadid conference
February 19, 2011
Karen gave a presentation about the Tunisian Arabic Corpus project at the Jil Jadid conference at University of Texas, Austin. A video of the presentation is available here: