Introduction

The recent Global Compact for Refugees has acknowledged that the increasing number of forcibly displaced persons and the difficulty to predict mass-movements of people have created an urgent need for data-driven early warning systems that allow governments and humanitarian organisations to use their limited resources most efficiently [1, 2]. Efforts to install early warning systems have led to advances in predicting and forecasting global migration flows; however, forced displacement remains the most elusive and challenging migration form to predict [3, 2, 4].

People base their decision to migrate on a complex set of different factors. However, in addition to the complexity of the individual decision-making process, predicting forced displacement flows is further complicated by the necessity to early detect or predict trigger events. Trigger events, which are often the last link in a long chain of other events that tilt the individual’s decision towards flight, often happen randomly and abruptly [5]. Both processes are difficult to model in and by themselves. However, the lack of timely and accurate data at the micro-, meso-, and macro-level further aggravates this problem.

Data needed to model these processes are often either unavailable (micro-level) or outdated (mesoand macro-level).

Furthermore, whilst existing refugee flows between countries often are self-perpetuating and can be predicted based on historical data, refugee flows from new events pose a challenge because (i) the event might no yet be known to the modeler; (ii) the effect of a new event on future refugee flows is unknown; (iii) no historical data from a recent event exist, and it is uncertain to which degree historical data from other events are applicable to predict refugee flows from the new event. The prediction of forced displacement flows requires, therefore, first and foremost, a thorough understanding of the mechanisms of forced displacement. In particular:

The factors that lead to an event that potentially can trigger forced displacement. The characteristics of an event that have the potency to create sizable forced displacement. The factors that impact the magnitude, demographic, and direction of forced displacement.

Furthermore, a reliable prediction of forced displacement flows requires timely data, preferably at the micro-, meso- and macro level. Novel data sources, like Big (Crisis) Data, can provide such timely data and supplement or substitute more traditional data sources [1, 2].

Big (Crisis) Data1 is an umbrella term for data sources characterised by volume, velocity, and variety, such as satellite imagery, data from social network sites, or exhaust data such as call detail records (CDR), data from search engines, or log-ins [1]. Data sources from various digital devices create an amount of data estimated to have surpassed 2.5 quintillion bytes per day. “According to Statista (2018), social networks users in the world were 2.46 billion in 2017. According to Internet World Stats (2018), at the beginning of the year 2018, the Internet penetration rate ranged from 95.0% in North America and 85.7% in the European Union to 48.1% in Asia and 35.2% in Africa.” [6] However, Big (Crisis) Data’s vastness and high granularity are both a boon and a bane. On the one hand, these data allow timely access to information unavailable through traditional data survey methods. On the other hand, the same vastness means that searching for valuable and applicable information can resemble a search for a needle in a haystack [7].

To assess the information in Big (Crisis) Data that is valuable for predictive models of forced displacement, we evaluate each data source by using three criteria: (A)ccuracy, (B)ias, and (S)calability (ABS).

• Accuracy: Big Data generally suffer from a low signal-to-noise ratio [8, 3]. Especially content from social network sites is prone to contain false, misleading, or irrelevant information: bots with sales ads that use trending hashtags to gain traction and actors with political agendas or trolls who post deceptive or false information all contribute to to the noise on social network sites. Likewise, satellite images require intensive training of advanced deep learning algorithms to extract usable information from pixels, and advanced natural language processing algorithms are needed to extract relevant information from text sources in various languages and dialects which often contain spelling and grammatical errors.

In short, the effort of filtering the signal from the noise can, in some instances, become substantial and can thereby impact the scalability of the data source.

• Bias: To produce user-generated content, exhaust data on the internet or CDRs requires some form of access to electronic devices. However, although the global penetration rate for cell phones and the internet increases every year, it hasn’t reached full saturation yet. Studies have shown that this lack in saturation is not equally distributed across all demographics but leads to a user demographic that is more Western, more urban, more educated, and more male [8, 9, 10]. Although cell phone penetration rates at the household level are high in developing countries, male household members have immediate access to the device and often exclude women and minors. Younger and less educated people prefer communication channels with direct communication, like Instagram, Pinterest, and Facebook, whilst Twitter and LinkedIn are preferred for more professional messages or social and political activism [6].

These factors create an inherent bias in many Big Data sources, which is difficult to correct, as detailed demographics of user groups are often unavailable or are inferred by the platform using unreliable imputation methods [3, 11].2 • Scalability: The aim to use Big Data in analyses with a broad context (ideally a global context) requires easy scalability of the data source. However, propriety rights often inhibit the scalability of a data source. CDRs, for example, are owned by the carrier network and require bilateral agreements between the carrier network and the analyst to become accessible to the latter. The need for multiple bilateral contracts due to various carrier networks within and across countries challenges CDR usage in studies with an international focus. Furthermore, setting up these agreements takes considerable time and resources, and CDR are therefore most accessible if pre-crisis agreements already exist. Likewise, content from social network providers like Facebook, who do not provide real-time access to their data, can compromise the timeliness and flexibility of the data source.

In the following sections, we will evaluate different sources of Big (Crisis) data within the context of a ‘system of forced displacement ’ and by using the ‘ABS’ criteria. Based on these three criteria, we will discuss the advantages and disadvantages of different Big Data sources within various contexts and give suggestions for their usage.