Monday 2 June 2014

Disaster Data Analytics - Learning how to use it

Compiling metadata onto a map can develop powerful analysis of data; a story, event, or outcome.
Photo Credit: Author produced Google Analytics image of DDRS data.

Collecting information to solve problems has logical and obvious outcomes. But how much is enough or how are the right elements compiled and sampled. Where is the data from and how is it interpreted. The list of questions and issues soon unravel into a series of complex and questions that are, at times, a source of uncertainty. For many it fuels hesitation at the worst possible moment. Implementing decisions based on information has immense potential to misguide, direct or impact a scenario with irreversible outcomes. Data is no longer a simple table of words of an archive, but one of near real-time emotions, facts, history, numbers, and potential distortions. Data analytics is now the number one source used to shape doctrine, policy, regulations, research and social structures surrounding our laws at the local and national level. It is shaping how sciences, engineering, and every region of the world will move forward in the future. It is also becoming entrenched in how we respond to Crisis and Disaster events.

The internet's connectivity has allowed the world to collect and analyze data with methods unheard of prior to 2010. In the tech world, 4 years is an eternity. Software development moves in life cycles that are reviewed and updated in 90 days, not 12 months. New analytics tools to interpret and define data are faster and more powerful with every rollout. With it, new research is unfolding at breakneck speed with no limitations in sight on what can be collected and analyzed to determine strategic and tactical decision models that can presented. Every level of government, private sector and interest groups are now using some form of analytics. It is shaping our world in ways many do not understand, let alone comprehend. We can measure the health of the world in any way we see fit to ask one or series of questions. We can react to any government policy change or decision. There are no limitations in what can be reviewed and analyzed. From monetary markets to production levels of raw materials. From where ice and snow storms will impact infrastructure to precise locations displaying floodplains are near vulnerable population centers. Even hypothetical political decisions and policy issues can be decided upon using analytics that will be accepted or rejected with very specific detail, all in a matter of minutes. Systems are available today that can be program structured data that feed results over any period of time and geographic location. For every question that desires an answer, chances are, it has been run through an analytics platform so answers and scenarios can be reviewed. The answers take shape to form answers. The results sometimes are not ones that anyone wants to see. If not, there's probably a way to shape and analyze data until it does.

Analytics in its earliest forms was invented long before the invention of the computer and database software. Opinion polling started in the early 1800's, with the first recorded poll commissioned by the Harrisburg Pennsylvanian newspaper in 1824 covering the U.S. Presidential election between Andrew Jackson and John Quincy Adams. It was a straw poll that had non-binding results asked of 504 individuals. Jackson, despite being more popular in the straw vote (335-169),  lost to Adams. But the advance knowledge of the results did not impact the control of those asked to make a decision, during that election as the House of Representatives decided the outcome after no clear victor emerged in the Electoral College process. It could be easily argued that much has not changed in the 190 years that have since passed in terms of accuracy of polls when applied to political elections including the recent U.S. Presidential election of 2012. The same dangers and problems can arise when using today's analytics tools.

The level of sophistication and potential for accuracy has improved immensely in two centuries. But ever since, there has been a catch phrase every pollster organization now into place whereby they state - "accurate 19 times out of 20" or "Margin of error of +/- 2 or 3 percent. In the world of crisis and disaster management, the level of accuracy or margin of error are currently higher. With response agencies this potential for the errors and mistakes can have serious impacts and is on the minds of any leader in command or authority. There is also the potential to waste resources and analyst cycles to formulate answers that are not required or needed. Just because you can, does not mean you should.

What are the real requirements and needs? How do we solve problems that can be answered with the use of open and social media data, in all its forms, in terms of disaster resilience, response and recovery? Can we deliver a series of tools that achieve new capabilities and delivery mechanism's where needed most? Can the use of past lessons learned be accurately organized and filtered?

On the surface, the answers appear to be obvious and logical. In general terms the response from most experts is yes to all of the above. Yet our world does not work in obvious and on occasion, definitely not in logical patterns. Historically, humans are not as predictable as many might want to suggest and history proves this point time and time again, otherwise we would not have the margin of error caveat. Practitioners, scientists, and sociologist should debate how efficient and effective we currently are in solving these kind of problems and issues that do not have globally accepted standards. Using commercial analytic tools is one of many options. In some areas, we do not even know how to ask the right questions or understand what the data is really telling us. In many aspects, we are on the ground floor in the use analytics while in others,  we have made significant improvements in interpreting data during a disaster. Mentioned earlier, we often fail to ask the right question or do not understand how best respond to results. Like polling, the question is often part of the problem when using analytics. Are you looking for support of your analysis (or conclusions) or attempting to illuminate its weaknesses.

As an example, in the map shown above, the data overlay on illustrated global map is a census of DDRS readers of our Crisis and Disaster Management Magazine Editorial blog. What it could tell us does not mean it actually does. For example, we could state that it illustrates DDRS readers in high risk disaster areas and that the largest concentrations illustrate where major disasters have affected them or that the map illustrates where professional in the field of disaster management live. But is this really true? What other datasets should we use to confirm this analysis? And does it matter and reflect what we need to know? The answer lies in the quality of the data and were the right information underpinnings collected to support the question(s) being asked. In many cases, analysis tools do not have the ability populate these answers unless additional background data is integrated and all the answers submitted are cross referenced and verified. This is not to suggest that answers without them, are extracted inaccurately - but they might be. FUD. Taking into account the margin of error rule, the results could wind up being as accurate as the same straw poll taken in 1824 if vetting, rules and process are not enforced uniformly. Twitter disaster tweets are just as vulnerable to biased sampling as commercial consumer product popularity. The social media world has another dark side too, the injection of false data with cause and intent that are often difficult to detect until its too late as was illustrated in a World Economic Forum influenced research report - Collective attention in the age of (mis) information authored by Delia Mocanu, Luca Rossi, Qian Zhang  from the Laboratory for the Modeling of Biological and Socio-technical Systems, Northeastern University, Boston), Marton Karsai (Northeastern Boston University and Laboratoire de l’Informatique du Parall´elisme, Lyon, France), and Walter Quattrociocchi (Northeastern University, Boston and Laboratory of Computational Social Science, IMT Lucca, Italy) in March of this year.

There are cases where there is information that can fit multiple categories describing different levels of the same query , creating multiple answers. This can easily misguide the user's expectations and interpretation. During Hurricane Sandy, several organizations, including FEMA's Innovation team believed many of the residences in the communities on Staten Island were likely to be under-served or completely ignored during the rescue and recovery phase because of ethnicity and immigration status. These were important concerns and needed to be addressed. Research was carried out by using census and community data available online to see what possible local community groups could be leveraged in the affected areas. The sources of data available to create a 'sense' of which groups were most likely to be affected or at high risk was immense. It was important to determine quality, accuracy, and most importantly, usability to create an action plan in a very short time frame. These questions were repeatedly asked as part of the vetting process.

The levels of risk underwritten by some organizations in collecting data through a process known as sampling, to create an analysis has been alarmingly high in past disasters. These risks including making assumptions regarding demographics, cultural and social structures in data sets that do not have real-time native metadata embedded in the original source information stream. Researchers often inject static data from other sources to then extrapolate answers desired. Sourcing tweets from twitter is a frequent sample target. Just because we can get high level answers, it does not mean they are of value. Tweets cannot tell the entire story and nor should they. They are an enabler to support conclusions. It is when we integrate them (e.g. Twitter tweets) with other data sets we can get into trouble. Sources include census (private and public), supplementary social media, and other databases.  Media organizations have been remarkably successful in using commercial software to analyze twitter to enhance information about tweets such as location and user demographics by archiving large volumes of a user (or group, region, market segment, etc.) data with continuous updates to determine partial or complete validity of information. Such enhancements are not legally available for use by all government agencies or officials. The scientific community is often shackled by a lack of funding and regulations or both.

There are critics of Twitter that suggest the amount of data mining by third parties using twitter bots (computer operated twitter accounts) or the purchasing of large sets of tweets by commercial vendors is diluting the quality and validity. The screening and filtering of tweets to eliminate duplication and retweets contained in each tweet is a common request. How the remaining data is then scrubbed and interpreted varies between each vendor and is considered proprietary property which can (and does) raise valid concerns. Analysis is a critical step if it is to be used in Crisis and Disaster management in countries where regulations permit these queries to be carried out. The allowable margin for error is far less than 2 or 3 percent. Even the time of day when a Facebook or Tweet is posted can skew the accuracy of a report. If a community has or is repeatedly hit by disasters such as Florida (Hurricane season), Texas, Oklahoma, Arkansas (Tornado's) it is likely information will be skewed.

In some cases additional validation points and references are not required. A recent investigation (that you can read about in our DDRS Magazine titled Tweets can Guide Emergency Responders) by researchers Mahalia Miller, Lynne Burks, and Reza Zadeh at Stanford University in cooperation with the United States Geological Survey (USGS) suggested that twitter is a faster method to detect when an damaging earthquake has occurred and associated level of the damage that occurred than just using scientific sensors as standalone alarm triggers and simply delivering an announcement to the public that an event has occurred. In many parts of the world where there are multiple points of connectivity, this potentially directs a faster civil protection response. This could be valuable in situations where local infrastructure would soon be unavailable or quickly saturated afterwards.

Consensus as to how Emergency Management organizations use and operate with open data, social media, and other information sources using Analytics in Intelligence, Surveillance, Reconnaissance (ISR) as an application tool in support of Command & Control (C2) or Incident Command System (ICS) to gain government support in decision making is largely unknown. Early lessons learned are not yet conclusive. There have been lots of opinions published which are difficult to verify or even understand how they have come to their conclusions. The Tōhoku Earthquake - Tsunami, Hurricane Sandy, and Super Typhoon Haiyan offer a glimpse to its potential, suggesting Data Analytics applied to social media and SMS data has value in real-time, post event and crisis and disaster management research (lessons learned). Agencies are still in the early stages of its development, trust and acceptance. As we move forward, analytics is not likely to become a set of universally accepted standards or procedures. Put another way, one system will not fits all.

And while there are significant differences in how polling analysis and analytics of real-time (and archived) data is processed, compiled and published, there are some interesting areas of overlap and common in understanding how data can be interpreted and leveraged.

In the coming months DDRS will explore these issues with editorials, guest written articles and curated news stories from the past and as they unfold in real-time. Let us know your opinions by leaving a comment below.