The Denoised Web Treebank

We introduce the Denoised Web Treebank: a treebank including a normalization layer and a corresponding evaluation metric for dependency parsing of noisy text, such as Tweets. This benchmark enables the evaluation of text normalization methods, including normalization as machine translation and unsupervised lexical normalization, directly on syntactic trees.

Documentation

For more information and documentation, please see my 2013 Master thesis.

Data

Download the treebank here.

Source code

All source code used in this paper is available under Apache 2.0 license on Github: jodaiber/ithaka.

License

The treebank is licensed under Creative Commons (BY-NC-SA).

File contents

The treebank cosists of the following files:

  • dev/test.ids: IDs of the tweets that were the source of the sentence.
  • dev/test.tokens: The original noisy tokens.
  • dev/test.normalized: The manually normalized tokens including the alignments.
  • dev/test.conll: Manually annotated dependency trees for each sentence (CONLL X format).