Social Media Monitoring: Big Data Analytics - New Arrival Predictions: DRC Situation, Angola (update 3)

from UN High Commissioner for Refugees
Published on 11 Jul 2017 View Original


UNHCR recognizes that big data can help improve its understanding of the protection environment. For this reason, the UNHCR Innovation Service started a sentiment analysis exercise together with UN Global Pulse to support the Europe Refugee Crisis - with an aim to provide decision makers with additional context to the situation as it unfolded. Part of the exercise was to determine how big data – in this case unstructured social media data – could be used to improve UNHCR’s understanding of a complicated, and unique situation, by providing structured insights. More information is available in previous updates (1 and 2).


A complex emergency is unfolding in the Kasaï region in the Democratic Republic of the Congo (DRC). Individuals continue to seek safety in Northern Angola from inter-communal clashes, generalized violence, disorder and a shortage of basic items. Protection concerns and human rights violations have been reported. Most civilians in affected areas are at risk of serious human rights violations, including physical mutilation, killing, sexual violence, arbitrary arrest and detention.

The violence in the Kasaï region in the Democratic Republic of Congo (DRC) which erupted in August 2016, has resulted in the displacement of large numbers of Congolese. It is estimated that 1.27 million people have been internally displaced. In addition, as of today, over 30,000 people are newly registered in Angola, having escaped violence in Kasaï. Initially, the daily arrival figure into Angola fluctuated significantly with spikes in arrivals difficult to predict; the Angola operation began exploring solutions to this challenge, including manual social media monitoring (Twitter) for reports of violent incidents in the Kasaï region. On this occasion, the machineled monitor was set-up to help improve efficiency and strengthen the systematization of this monitoring in order to support the operation to better predict and prepare for new arrivals. More information is available on the machine-led data query in previous updates (1 and 2).

Due to the evolving dynamics of the conflict in Kasaï, access to ‘real-time’ information on violent incidents - considered the key driver of displacement in this context - is critical. Access to the Kasaï region by humanitarian actors is challenging - therefore identifying, testing and validated remote information sources (including SM) is a potential way forward in terms of a better understanding of the context.

The monitor was initially set-up in French and English; initial tests have begun in Lingala and Swahili (please see further details below).


• Originally the monitor was established with relatively large geographical searches (including DRC / #DRC); this proved to be too wide in scope for the monitor - with many posts linked to violence in the Kivus. To better refine the monitor, the taxonomy was revised to a smaller geographic scope based on current arrivals data from the Angola Operation. This immediately filtered out the ‘noise’ of posts not related to violence in the Kasaïs.

• Given the common use of variances in spelling and use of accents for place names, there was an identified risk of missing some posts. In order to address this, the taxonomy was adjusted to include geographic spelling variances for each location.

• Based on an initial review of the identified posts, it was clear that retweets were ‘creating noise’ - the timing and scale of the incident could not be determined. To address this, retweets were filtered out, the monitor now only returns original tweets, however this does not filter out directly copied tweets (those without RT @). As this is the first time a retweet filter has been applied, several attempts to remove ‘RTs’ was required before this was successfully established.

• From 21st June onwards an in-depth analysis of the filtered ‘specific’ category was undertaken; this has improved the accuracy of the machine-led categorization.