We are pleased to inform you that the second workshop on "Text Analytics for Cybersecurity and Online Safety (TA-COS)" will be held in conjunction with the 11th edition of the Language Resources and Evaluation Conference, 7-12 May 2018 in Miyazaki (Japan).
TA-COS 2018 will take place at the LREC Conference venue, the Phoenix Seagaia Resort, as an afternoon session of Saturday, 12 May 2018.
Please find the Programma below. We look forward to your participation!
|14:10||Data-driven models of reputation in cyber-security (invited talk)||Pierre Lison|
|15:00||Annotation and Classification of Toxicity for Thai Twitter||Sugan Sirihattasak, Mamoru Komachi, Hiroshi Ishikawa|
|15:30||Monitoring Targeted Hate in Online Environments||Tim Isbister, Magnus Sahlgren, Lisa Kaati, Milan Obaidi, Nazar Akrami|
|16:30||Think Before Your Click: Data and Models for Adult Content in Arabic Twitter||Ali Alshehri, El Moatez Billah Nagoudi, Hassan Alhuzali, Muhammad Abdul-Mageed|
|17:00||Automated email Generation for Targeted Attacks using Natural Language||Avisha Das, Rakesh Verma|
14:10 – 15:00
Data-driven models of reputation in cyber-security - Pierre Lison
In this talk, I will present our work on developing data-driven, predictive models of reputation (such as benign or malicious) for end-point hosts. I'll focus on two particular questions:
1) Malware often relies on so-called domain-generation algorithms (DGAs) to produce "fake" domain names that are used to connect compromised hosts with a command-and-control server. Many types of DGAs are been developed, from simple hashing techniques to more sophisticated approaches based on wordlists. I will show that these malware-generated domain names can be detected through recurrent neural networks such as LSTMs or GRUs.
2) The second part of the talk will focus on neural models of traffic reputation learned from passive DNS data. Passive DNS data are collections of inter-server DNS queries captured by sensors distributed on the network. This data is a goldmine for predicting whether a given domain name or IP address is likely to be benign or malicious. I will describe a deep neural architecture that predicts the reputation of end-point hosts with high accuracy. The neural model is trained on a large passive DNS dataset (745 million entries) and relies on a broad range of features extracted from the DNS graph.
Workshop Papers I
Annotation and Classification of Toxicity for Thai Twitter - Sugan Sirihattasak, Mamoru Komachi, Hiroshi Ishikawa
In this study, we present toxicity annotation for a Thai Twitter Corpus as a preliminary exploration for toxicity analysis in the Thai language. We construct a Thai toxic word dictionary and select 3,300 tweets for annotation using the 44 keywords from our dictionary. We obtained 2,027 and 1,273 toxic and non-toxic tweets, respectively; these were labeled by three annotators. The result of corpus analysis indicates that tweets that include toxic words are not always toxic. Further, it is more likely that a tweet is toxic, if it contains toxic words indicating their original meaning. Moreover, disagreements in annotation are primarily because of sarcasm, unclear existing target, and word sense ambiguity. Finally, we conducted supervised classification using our corpus as a dataset and obtained an accuracy of 0.80, which is comparable with the inter-annotator agreement of this dataset. Our dataset is available on GitHub.
Monitoring Targeted Hate in Online Environments - Tim Isbister, Magnus Sahlgren, Lisa Kaati, Milan Obaidi, Nazar Akrami,
Hateful comments, swearwords and sometimes even death threats are becoming a reality for many people today in online environments. This is especially true for journalists, politicians, artists, and other public figures. This paper describes how hate directed towards individuals can be measured in online environments using a simple dictionary-based approach. We present a case study on Swedish politicians, and use examples from this study to discuss shortcomings of the proposed dictionary-based approach. We also outline possibilities for potential refinements of the proposed approach.
Think Before Your Click: Data and Models for Adult Content in Arabic Twitter - Ali Alshehri, El Moatez Billah Nagoudi, Hassan Alhuzali, Muhammad Abdul-Mageed
Given the widespread use of social media and their increasingly impactful role in our lives today, there is a pressing need to ensure their safety of use. In particular, various social groups view the spread of adult content in social networks as undesirable. This content may even pose a serious threat to other vulnerable groups (e.g. children). In this work, we develop a unique, large-scale dataset of adult content in Arabic Twitter and provide in-depth analyses of the data. The dataset enables us to study the scope and distribution of adult content in the Arabic version of the network, thus possibly uncovering target phic locales. In addition, computationally exploit the data to learn a large lexicon specific to the topic and detect spreaders of adult content on the microblogging platform. Our models achieve promising results, reaching 79% accuracy on the task (24% higher than a competitive baseline).
Automated email Generation for Targeted Attacks using Natural Language - Avisha Das, Rakesh Verma
With an increasing number of malicious attacks, the number of people and organizations falling prey to social engineering attacks is proliferating. Despite considerable research in mitigation systems, attackers continually improve their modus operandi by using sophisticated machine learning, natural language processing techniques with an intent to launch successful targeted attacks aimed at deceiving detection mechanisms as well as the victims. We propose a system for advanced email masquerading attacks using Natural Language Generation (NLG) techniques. Using legitimate as well as an influx of varying malicious content, the proposed deep learning system generates fake emails with malicious content, customized depending on the attacker’s intent. The system leverages Recurrent Neural Networks (RNNs) for automated text generation. We also focus on the performance of the generated emails in defeating statistical detectors, and compare and analyze the emails using a proposed baseline.