|Title||Annotation and Classification of Toxicity for Thai Twitter|
|Publication Type||Conference Paper|
|Year of Publication||2017|
|Authors||Sirihattasak S, Komachi M, Ishikawa H|
|Conference Name||Second Workshop on Text Analytics for Cybersecurity and Online Safety (TA-COS 2018)|
|Publisher||European Language Resources Association (ELRA)|
|Conference Location||Miyazaki, Japan|
In this study, we present toxicity annotation for a Thai Twitter Corpus as a preliminary exploration for toxicity analysis in the Thai language. We construct a Thai toxic word dictionary and select 3,300 tweets for annotation using the 44 keywords from our dictionary. We obtained 2,027 toxic tweets and 1,273 non-toxic tweets labeled by three annotators. The result of corpus analysis indicates that tweets that include toxic words are not always toxic. Further, it is more likely to that a tweet is toxic, if it contains toxic words indicating their original meaning. Moreover, disagreements in annotation are primarily due to sarcasm, unclear existing target, and word sense ambiguity. Finally, we conducted supervised classification using our corpus as a dataset and obtained an accuracy of 0.80, which is comparable with the inter-annotator agreement of this dataset. Our dataset is available on GitHub.