Collection, Description, and Visualization of the German Reddit CorpusReport as inadecuate




Collection, Description, and Visualization of the German Reddit Corpus - Download this document for free, or read online. Document in PDF available to download.

1 OeAW - Austrian Academy of Sciences 2 Berlin-Brandenburg Academy of Sciences

Abstract : Reddit is a major social bookmarking and microblogging platform. An extensive dataset of Reddit comments has recently been made publicly available. I use a two-tiered filter to single out comments in German in order to build a linguistic corpus which is then tokenized and annotated. This article offers first insights of both nature and quality of data at the lexical level. Additionally, a visualization makes it possible to grasp the possible geographical distribution of German users of the platform.

Keywords : Computer-mediated Communication Web corpus construction Information Visualization Language Identification





Author: Adrien Barbaresi -

Source: https://hal.archives-ouvertes.fr/



DOWNLOAD PDF




Related documents