DMC collaborated with the Hack4Good programme of ETH Zurich to explore how machine learning workflows could be used to get near real-time information from Twitter. This information could be particularly useful to monitor small-scale displacement events that do not reach major news headlines. During a hackathon, a team of four students tackled this challenge. This guest blog presents an overview of the solution they developed.
WHY ARE WE EXPLORING THE USE OF SOCIAL MEDIA?
IDMC aggregates data on internal displacement collected by governments, United Nations agencies, and other international and national relief and emergency response actors. However, when information from these primary data collectors is not available, IDMC monitors the world's news media (e.g. using IDETECT). This means that information about some small-scale events that do not appear in traditional media headlines may be missed. Potentially, such events could be identified using social media platforms such as Twitter, where directly affected people and local institutions give updates on the current situation on-site. Additionally, this platform gives us the opportunity to monitor events in near real-time.
The idea behind this project is to access and analyse many tweets and extract useful information. However, the enormous amount of information available bears a major problem: how to filter relevant content from irrelevant information.
WHERE DID IT ALL START?
A group of four students from ETH Zurich have tackled this challenge in the scope of the Hack4Good 2020 Fall Edition. Hack4Good is an eight week-long pro-bono student-run programme organised by the Analytics Club at ETH Zurich. It matches Data Science talents from ETH Zurich with non-governmental organisations (NGOs) that promote social causes. In close collaboration with IDMC, the team developed a machine learning (ML) workflow to filter relevant tweets and extract information on internal displacement.
BUILDING A NLP MODEL TO FILTER RELEVANT CONTENT
We start extracting tweets from the Twitter API, pre-filtering the content using a list of keywords. Then we classified the tweets into relevant and not relevant information using some hard classification rules. This process is known as data labelling, and results in the creation of training data for the model. Why do we need a training dataset? To help a programme to learn to predict a given outcome. In our case, we want the model to filter and identify tweets that contain relevant information describing situations of internal displacement.
The next step consists of cleaning the noise from the content using different natural language processing (NLP) techniques applied for classifying text data. The NLP techniques were used to simplify and associate similar words. Once the text was simplified, it was transformed into a format easy to ingest and analyse by computers, and that allows us to analyse each words' context and how words relate to each other.
Then it was time to implement a machine learning model to automatically classify relevant tweets. The classification process consists of programming a training task to measure the probability of observing words in tweets that have the same context as our labelled dataset. Once an optimal model was identified to classify tweets, some key information and metadata (e.g. the name of the individual Twitter user or the organisation posting information) were extracted and organised in tabular form. The final output of the model is a table of relevant tweets that can be analysed and verified by IDMC’s monitoring experts.
As a result of this pilot project, a NLP workflow was implemented to automatically download, classify, and filter a large volume of tweets and extract a useful summary of relevant information likely describing internal displacement events.
WHAT ARE THE CHALLENGES OF USING SOCIAL MEDIA DATA AND NLP MODELS?
While working on this project we encountered some challenges and limitations, as outlined below:
- Limited size of the labelled dataset: Generally, a large training dataset increases the performance or accuracy of a ML algorithm. Ideally, a training dataset should contain several hundreds to thousands of examples per class. This ensures that the algorithm was trained on a broad spectrum of possibilities. For this project, we worked with just 631 labelled tweets.
- Additionally, the quality of the training dataset can suffer during the manual labelling process. As for any other human action, the labelling process is not completely objective. Different people may label the same tweet as irrelevant or relevant. It is therefore crucial to understand the labelling criteria and make sure that the rules used during the process are specific enough, so each tweet can be unambiguously classified.
- Location biases: Currently, the tool has been applied and validated on tweets in English. However, not everybody tweets in English and Twitter is not equally popular around the world. Therefore, the training data suffers from a restricted geographical representation, as well as a language bias which could affect the usefulness of the tool to monitor internal displacement in different regions around the globe.
HOW GOOD IS OUR MODEL?
The ML classifier has been validated on a set of tweets which were manually labelled by IDMC experts. The whole labelled data set contains 631 tweets, out of which 231 are labelled as relevant and 400 are labelled as irrelevant. Subsequently, we used 90% of the labelled tweets to train the ML classifier. The resulting tool was able to correctly predict whether a tweet was relevant or not for 76% of the cases (47 tweets). In addition, the workflow implemented can successfully extract displacement-related information out of tweets, such as the displacement term used (words used to describe internal displacement, e.g. evacuees, displaced or sheltered people), the displacement trigger (e.g. storms, floods, hurricanes) and finally the displacement unit (this allows us to have an overview of the magnitude of people affected, e.g. households).
The developed tool helps IDMC experts to avoid handling an enormous number of tweets, providing them with a list of relevant content in a near real-time manner. The workflow can not only accurately classify the majority of tweets into relevant and irrelevant, it also extracts and organises key information. This can be seen as a “pre-processing” step for IDMC, saving monitoring experts from manually searching through thousands of tweets.
However, a ML algorithm is only as powerful as the data with which it is provided. Therefore, more work is needed and the IDMC team will continue to invest in the improvement and further exploration of t innovative solutions and tools to reduce information gaps on internal displacement.
Guest authors: Gokberk Ozsoy, Katharina Boersig, Michaela Wenner, Tabea Donauer. For more information on the ETH Zurich Hack4Good program, contact the ETH Analytics Club or visit the webpage.