At the end of January, Miriam Butt, Melanie Herschel, and Christin Schätzle (members of projects D02 and D03 of the SFB/Transregio 161) organized a workshop on Data Provenance and Annotation in Computational Linguistics in Prague, co-located with the Treebanks and Lingustic Theory (TLT16) conference.
The workshop aimed at bringing together researchers from the fields of provenance, data annotation, and data curation with researchers working within computational linguistics and dealing with the annotation of language data in order to start a discussion between experts from the different fields. Provenance is concerned with understanding how to model, record, and share metadata about the origin of data and the further sharing or processing that data has undergone. While provenance has been studied in various domains (e.g., for business applications or in the life sciences), many of the central issues are also of vital interest for computational linguistics.
For example, issues of “data cleaning“ and data curation both have serious repercussions for the reproducibility of analyses or experiments. In general, computational linguistic work with data tends to involve several pre-processing steps (stop-lists, data normalization, filtering out of information that is considered to be not at-issue or error correction). However, these steps are
seldom documented or described in detail. Data sets may also undergo several rounds of pre-processing, with information about the successive changes again not well documented. Data may also be automatically or semi-automatically generated. In computational linguistics this often takes the form of automatic or semi-automatic data annotation. This, as well as manual annotation, is prone to errors and inter-annotator disagreement, leading to rounds of adjucation or correction. This work with data is also generally not documented (in detail) so that annotation decisions may be hard to „undo“. Finally, once a data set is released, newer versions will inevitably also have to be released to deal with data expansion or correction. In this case, proper versioning and data curation is vital to ensure experimental and analytical reproducability.
While computational linguists deal with these issues on a daily basis, there is little awareness of established methodology and best practices coming from the field of data provenance. Therefore, the aim of our workshop was to begin a dialog. On the one hand, we wanted to create awareness of the needs and challenges posed by linguistic data in the data provenance community. On the other hand, we aimed to import an understanding of the experiences and best practices established with respect to data provenance into the computational linguistics community.
During the workshop, we heard very interesting talks from researchers coming from the different fields. The first talk of the workshop was by Nicoletta Calzolari, the president of ELRA, i.e., the European Language Resource Association. In her talk, Nicoletta Calzolari discussed the importance of policy issues for language resources. Then, Adriane Boyd, a computational linguist from the University of Tübingen, presented her work on annotation alignment and error detection. Peter Buneman (University of Edinburgh), who is famous for his research on database systems and database theory, which includes work on data provenance, annotations and digital curation, presented a talk titled ‘What is annotation, and why is it important?’. This was followed by the talk of Sarah Cohen Boulakia, a computer scientist from the Université Paris Sud, on ‘Scientific Workflows for Computational Reproducibility: Experiences from the Bioinformatics domain, Status, Challenges, and Opportunities’. The final talk in the workshop was given by Jan Hajič, the local organizer of the TLT16 conference, who reported on the lessons learned during the annotation of the Prague Dependency Treebank, a deep linguistically annotated corpus containing Czech texts.
The workshop also featured a poster session, where Jochen Görtler (project A01 of the SFB/Transregio 161) presented joint work on data uncertainty with Christoph Schulz, Daniel Weiskopf and Oliver Deussen (all members of project A01) in the form of a poster on Bubble Treemaps for Visualization of Data Uncertainty. Aikaterini-Lida Kalouli, a member of Miriam Butt’s working group at the University of Konstanz, provided a poster together with Valeria de Paiva (Nuance Communications) and Livy Real (University of São Paulo) on Annotating Logic Inference Pitfalls. Moreover, Heike Zinsmeister, a computational linguist from the University of Hamburg, reported on her experiences about `Provenance of annotation: a survey on multiple annotations for applications in computational linguistics and the digital humanities’.
Overall, the workshop was very successful in bringing the fields together and a dialog was started between computational linguists and researchers from the fields of provenance, data annotation, and data curation, laying the foundation for future collaborations between the fields.