GeoDa: An Introduction to Spatial Data Analysis - PDF

Please download to get full document.

View again

of 18
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Science

Published:

Views: 8 | Pages: 18

Extension: PDF | Download: 0

Share
Related documents
Description
GeoDa: An Introduction to Spatial Data Analysis Luc Anselin, Ibnu Syabri and Youngihn Kho Spatial Analysis Laboratory Department of Agricultural and Consumer Economics University of Illinois, Urbana-Champaign
Transcript
GeoDa: An Introduction to Spatial Data Analysis Luc Anselin, Ibnu Syabri and Youngihn Kho Spatial Analysis Laboratory Department of Agricultural and Consumer Economics University of Illinois, Urbana-Champaign Urbana, IL USA May 5, 2004 Abstract This paper presents an overview of GeoDa TM, a free software program intended to serve as a user-friendly and graphical introduction to spatial analysis for non-gis specialists. It includes functionality ranging from simple mapping to exploratory data analysis, the visualization of global and local spatial autocorrelation, and spatial regression. A key feature of GeoDa is an interactive environment that combines maps with statistical graphics, using the technology of dynamically linked windows. A brief review of the software design is given, as well as some illustrative examples that highlight distinctive features of the program in applications dealing with public health, economic development, real estate analysis and criminology. Key Words: geovisualization, exploratory spatial data analysis, spatial outliers, smoothing, spatial autocorrelation, spatial regression. 1 Introduction The development of specialized software for spatial data analysis has seen rapid growth since the lack of such tools was lamented in the late 1980s by Haining This research was supported in part by US National Science Foundation Grant BCS , to the Center for Spatially Integrated Social Science (csiss) and by grant RO1 CA from the National Cancer Institute. In addition, this research was made possible in part through a Cooperative Agreement between the Center for Disease Control and Prevention (CDC) and the Association of Teachers of Preventive Medicine (ATPM), award number TS The contents of the paper are the responsibility of the authors and do not necessarily reflect the official views of NSF, NCI, the CDC or ATPM. Special thanks go to Oleg Smirnov for his assistance with the implementation of the spatial regression routines, and to Julie Le Gallo and Julia Koschinsky for preparing, respectively, the data set for the European convergence study and for the Seattle house prices.geoda TM is a trademark of Luc Anselin. GeoDa 2 (1989) and cited as a major impediment to the adoption and use of spatial statistics by GIS researchers. Initially, attention tended to focus on conceptual issues, such as how to integrate spatial statistical methods and a GIS environment (loosely vs. tightly coupled, embedded vs. modular, etc.), and which techniques would be most fruitfully included in such a framework. Familiar reviews of these issues are represented in, among others, Anselin and Getis (1992), Goodchild et al. (1992), Fischer and Nijkamp (1993), Fotheringham and Rogerson (1993, 1994), Fischer et al. (1996), and Fischer and Getis (1997). Today, the situation is quite different, and a fairly substantial collection of spatial data analysis software is readily available, ranging from niche programs, customized scripts and extensions for commercial statistical and GIS packages, to a burgeoning open source effort using software environments such as R, Java and Python. This is exemplified by the growing contents of the software tools clearing house maintained by the U.S.-based Center for Spatially Integrated Social Science (CSISS). 1 CSISS was established in 1999 as a research infrastructure project funded by the U.S. National Science Foundation in order to promote a spatial analytical perspective in the social sciences (Goodchild et al. 2000). It was readily recognized that a major instrument in disseminating and facilitating spatial data analysis would be an easy to use, visual and interactive software package, aimed at the non-gis user and requiring as little as possible in terms of other software (such as GIS or statistical packages). GeoDa is the outcome of this effort. It is envisaged as an introduction to spatial data analysis where the latter is taken to consist of visualization, exploration and explanation of interesting patterns in geographic data. The main objective of the software is to provide the user with a natural path through an empirical spatial data analysis exercise, starting with simple mapping and geovisualization, moving on to exploration, spatial autocorrelation analysis, and ending up with spatial regression. In many respects, GeoDa is a reinvention of the original SpaceStat package (Anselin 1992), which by now has become quite dated, with only a rudimentary user interface, an antiquated architecture and performance constraints for medium and large data sets. The software was redesigned and rewritten from scratch, around the central concept of dynamically linked graphics. This means that different views of the data are represented as graphs, maps or tables with selected observations in one highlighted in all. In that respect, GeoDa is similar to a number of other modern spatial data analysis software tools, although it is quite distinct in its combination of user friendliness with an extensive range of incorporated methods. A few illustrative comparisons will help clarify its position in the current spatial analysis software landscape. In terms of the range of spatial statistical techniques included, GeoDa is most alike to the collection of functions developed in the open source R environment. For example, descriptive spatial autocorrelation measures, rate smoothing and spatial regression are included in the spdep package, as described by Bivand and 1 See GeoDa 3 Gebhardt (2000), Bivand (2002a,b), and Bivand and Portnov (2004). In contrast to R, GeoDa is completely driven by a point and click interface and does not require any programming. It also has more extensive mapping capability (still somewhat experimental in R) and full linking and brushing in dynamic graphics, which is currently not possible in R due to limitations in its architecture. On the other hand, GeoDa is not (yet) customizable or extensible by the user, which is one of the strengths of the R environment. In that sense, the two are seen as highly complementary, ideally with more sophisticated users graduating to R after being introduced to the techniques in GeoDa. 2 The use of dynamic linking and brushing as a central organizing technique for data visualization has a strong tradition in exploratory data analysis (EDA), going back to the notion of linked scatterplot brushing (Stuetzle 1987), and various methods for dynamic graphics outlined in Cleveland and McGill (1988). In geographical analysis, the concept of geographic brushing was introduced by Monmonier (1989) and made operational in the Spider/Regard toolboxes of Haslett, Unwin and associates (Haslett et al. 1990, Unwin 1994). Several modern toolkits for exploratory spatial data analysis (ESDA) also incorporate dynamic linking, and, to a lesser extent, brushing. Some of these rely on interaction with a GIS for the map component, such as the linked frameworks combining XGobi or XploRe with ArcView (Cook et al. 1996, 1997, Symanzik et al. 2000), the SAGE toolbox, which uses ArcInfo (Wise et al. 2001), and the DynESDA extension for ArcView (Anselin 2000), GeoDa s immediate predecessor. Linking in these implementations is constrained by the architecture of the GIS, which limits the linking process to a single map (in GeoDa, there is no limit on the number of linked maps). In this respect, GeoDa is similar to other freestanding modern implementations of ESDA, such as the cartographic data visualizer, or cdv (Dykes 1997), GeoVISTA Studio (Takatsuka and Gahegan 2002) and STARS (Rey and Janikas 2004). These all include functionality for dynamic linking, and to a lesser extent, brushing. They are built in open source programming environments, such as Tkl/Tk (cdv), Java (GeoVISTA Studio) or Python (STARS) and thus easily extensible and customizable. In contrast, GeoDa is (still) a closed box, but of these packages it provides the most extensive and flexible form of dynamic linking and brushing for both graphs and maps. Common spatial autocorrelation statistics, such as Moran s I and even the Local Moran are increasingly part of spatial analysis software, ranging from CrimeStat (Levine 2004), to the spdep and DCluster packages available on the open source Comprehensive R Archive Network (CRAN), 3 as well as commercial packages, such as the spatial statistics toolbox of the forthcoming release of ArcGIS 9.0 (ESRI 2004). However, at this point in time, none of these include the range and ease of construction of spatial weights, or the capacity to carry out sensitivity analysis and visualization of these statistics contained in GeoDa. Apart from the R spdep package, Geoda is the only one to contain functionality 2 Note that the CSISS spatial tools project is an active participant in the development of spatial data analysis methods in R, see, e.g., 3 GeoDa 4 for spatial regression modeling among the software mentioned here. A prototype version of the software (known as DynESDA) has been in limited circulation since early 2001 (Anselin et al. 2002a,b), but the first official release of a beta version of GeoDa occurred on February 5, The program is available for free and can be downloaded from the CSISS software tools web site (http://sal.agecon.uiuc.edu/geoda main.php).the most recent version, i, was released in January The software has been well received for both teaching and research use and has a rapidly growing body of users. For example, after slightly more than a year since the initial release (i.e., as of the end of April 2004), the number of registered users exceeds 1,800, while increasing at a rate of about 150 new users per month. In the remainder of the paper, we first outline the design and briefly review the overall functionality of GeoDa. This is followed by a series of illustrative examples, highlighting features of the mapping and geovisualization capabilities, exploration in multivariate EDA, spatial autocorrelation analysis, and spatial regression. The paper closes with some comments regarding future directions in the development of the software. 2 Design and Functionality The design of GeoDa consists of an interactive environment that combines maps with statistical graphs, using the technology of dynamically linked windows. It is geared to the analysis of discrete geospatial data, i.e., objects characterized by their location in space either as points (point coordinates) or polygons (polygon boundary coordinates). The current version adheres to ESRI s shape file as the standard for storing spatial information. It contains functionality to read and write such files, as well as to convert ascii text input files for point coordinates or boundary file coordinates to the shape file format. It uses ESRI s MapObjects LT2 technology for spatial data access, mapping and querying. The analytical functionality is implemented in a modular fashion, as a collection of C++ classes with associated methods. In broad terms, the functionality can be classified into six categories: spatial data manipulation and utilities: data input, output, and conversion data transformation: variable transformations and creation of new variables mapping: choropleth maps, cartogram and map animation EDA: statistical graphics spatial autocorrelation: global and local spatial autocorrelation statistics, with inference and visualization spatial regression: diagnostics and maximum likelihood estimation of linear spatial regression models GeoDa 5 The full set of functions is listed in Table 1 and is documented in detail in the GeoDa User s Guides (Anselin 2003, 2004). 4 The software implementation consists of two important components: the user interface and graphics windows on the one hand, and the computational engine on the other hand. In the current version, all graphic windows are based on Microsoft Foundation Classes (MFC) and thus are limited to MS Windows platforms. 5 In contrast, the computational engine (including statistical operations, randomization, and spatial regression) is pure C++ code and largely cross platform. The bulk of the graphical interface implements five basic classes of windows: histogram, box plot, scatter plot (including the Moran scatter plot), map and grid (for the table selection and calculations). The choropleth maps, including the significance and cluster maps for the local indicators of spatial autocorrelation (LISA) are derived from MapObjects classes. Three additional types of maps were developed from scratch and do not use MapObjects: the map movie (map animation), the cartogram, and the conditional maps. The three dimensional scatter plot is implemented with the OpenGL library. The functionality of GeoDa is invoked either through menu items or directly by clicking toolbar buttons, as illustrated in Figure 1. A number of specific applications are highlighted in the following sections, focusing on some distinctive features of the software. Figure 1: The opening screen with menu items and toolbar buttons 4 A Quicktime movie with a demonstration of the main features can be found at 5 Ongoing development concerns the porting of all MFC based classes to a cross-platform architecture, using wxwindows. See also Section 7. GeoDa 6 Table 1: GeoDa Functionality Overview Category Functions Spatial Data data input from shape file (point, polygon) data input from text (to point or polygon shape) data output to text (data or shape file) create grid polygon shape file from text input centroid computation Thiessen polygons Data Transformation variable transformation (log, exp, etc.) queries, dummy variables (regime variables) variable algebra (addition, multiplication, etc.) spatial lag variable construction rate calculation and rate smoothing data table join Mapping generic quantile choropleth map standard deviational map percentile map outlier map (box map) circular cartogram map movie conditional maps smoothed rate map (EB, spatial smoother) excess rate map (standardized mortality rate, SMR) EDA histogram box plot scatter plot parallel coordinate plot three-dimensional scatter plot conditional plot (histogram, box plot, scatter plot) Spatial Autocorrelation spatial weights creation (rook, queen, distance, k-nearest) higher order spatial weights spatial weights characteristics (connectedness histogram) Moran scatterplot with inference bivariate Moran scatterplot with inference Moran scatterplot for rates (EB standardization) Local Moran significance map Local Moran cluster map bivariate Local Moran Local Moran for rates (EB standardization) Spatial Regression OLS with diagnostics (e.g., LM test, Moran s I) Maximum Likelihood spatial lag model Maximum Likelihood spatial error model predicted value map residual map GeoDa 7 3 Mapping and Geovisualization The bulk of the mapping and geovisualization functionality consists of a collection of specialized choropleth maps, focused on highlighting outliers in the data, so-called box maps (Anselin 1999). In addition, considerable capability is included to deal with the intrinsic variance instability of rates, in the form of empirical Bayes (EB) or spatial smoothers. 6 As mentioned in Section 2, the mapping operations use the classes contained in ESRI s MapObjects, extended with the capability for linking and brushing. GeoDa also includes a circular cartogram, 7 map animation in the form of a map movie, and conditional maps. The latter are nine micro choropleth maps constructed by conditioning on three intervals for two conditioning variables, using the principles outlined in Becker et al. (1996) and Carr et al. (2002). 8 In contrast to the traditional choropleth maps, the cartogram, map movie and conditional maps do not use MapObjects classes, and were developed from scratch. We illustrate the rate smoothing procedure, outlier maps and linking operations. The objective in this analysis is to identify locations that have elevated mortality rates and to assess the sensitivity of the designation as outlier to the effect of rate smoothing. Using data on prostate cancer mortality in 156 counties contained in the Appalachian Cancer Network (ACN), for the period , we construct a box map by specifying the number of deaths as the numerator and the population as the denominator. 9 The resulting map for the crude rates (i.e., without any adjustments for differing age distributions or other relevant factors) is shown as the upper-left panel in Figure 2. Three counties are identified as outliers and shown in dark red. 10 These match the outliers selected in the box plot in the lower-left panel of the figure. The linking of all maps and graphs results in those counties also being cross-hatched on the maps. The upper-right panel in the Figure represents a smoothed rate map, where the rates were transformed by means of an Empirical Bayes procedure to remove the effect of the varying population at risk. As a result, the original outliers are no longer, but a different county is identified as having elevated risk. Also, a lower outlier is found as well, shown as dark blue in the box map. 11 Note that the upper outlier is barely distinguishable, due to the small area of the county in question. This is a common problem when working with admininistrative units. In order to remove the potentially misleading effect of area on the perception of interesting patterns, a circular cartogram is shown in the lower-right panel 6 The EB procedure is due to Clayton and Kaldor (1987), see also Marshall (1991) and Bailey and Gatrell (1995), pp For an alternative recent software implementation, see Anselin et al. (2004). Spatial smoothing is discussed at length in Kafadar (1996). 7 The cartogram is constructed using the non-linear cellular automata algorithm due to Dorling (1996). 8 The conditional maps are part of a larger set of conditional plots, which includes histograms, box plots and scatter plots. 9 Data obtained from the the National Cancer Institute SEER site (Surveillance, Epidemiology and End Results), 10 The respective counties are Cumberland, KY, Pocahontas, WV, and Forest, PA. 11 The new upper outlier is Ohio county, WV, the lower outlier is Centre county, PA. GeoDa 8 of Figure 2, where the area of the circles is proportional to the value of the EB smoothed rate. The upper outlier is shown as a red circle, the lower outlier as a blue circle. The yellow circles are the counties that were outliers in the crude rate map, highlighted here as a result of linking with the other maps and graphs. 12 Figure 2: Linked box maps, box plot and cartogram, raw and smoothed prostate cancer mortality rates. 4 Multivariate EDA Multivariate exploratory data analysis is implemented in GeoDa through linking and brushing between a collection of statistical graphs. These include the usual histogram, box plot and scatter plot, but also a parallel coordinate plot (PCP) and three-dimensional scatter plot, as well as conditional plots (conditional histogram, box plot and scatter plot). We illustrate some of this functionality with an exploration of the relationships between economic growth and initial development, typical of the recent spatial regional convergence literature (for an overview, see Rey 2004). We use economic data over the period for 145 European regions, most of 12 Note that the outliers identified may be misleading since the rate analyzed is not adjusted for differences in age distribution. In other words, the outliers shown may simply be counties with a larger proportion of older males. A much more detailed analysis is necessary before any policy conclusions may be drawn. GeoDa 9 them at the NUTS II level of spatial aggregation, except for a few at the NUTS I level (for Luxembourg and the United Kingdom). 13 Figure 3: Multivariate exploratory data analysis with linking and brushing. Figure 3 illustrates the various linked plots and map. The left-hand panel contains a simple percentile map (GDP per capital in 1989), and a threedimensional scatter plot (for the percent
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks