-
Notifications
You must be signed in to change notification settings - Fork 0
Correlogram Tutorial
The purpose of this tutorial is to walk a user through the basic steps necessary to calculate a correlogram using pyssage. There are a number of steps involved in going from raw data to final analysis, in order to give the end user a lot more control over what happens at each step, rather than a more simple automation where the decisions are made by the programmer.
For input, we will be starting with a data set where we have measured values located in space (i.e., each data point has a pair of coordinates associated with it). For this example we're going to use an old data set consisting of cancer mortality rates for registration districts in parts of Western Europe for the 1970's. There are 355 locations, each of which has a mortality rate for 40 different sex-specific tumor types. For the purposes of this exercise we'll use just the first of these, which was oral cancer in men.
There are any number of ways you might choose to load this data into Python; for simplicity we will assume you have loaded each of the three values into a separate list of numbers (x, y, and data), although many other forms will work just as well.
In order to begin our analysis, let's start by working backwards from where we want to begin. In order to perform a correlogram analysis we'll need our data and a set of distance class connections. The first of these is already stored in our list, but we'll need to determine the distance class connections. To determine those, we'll need to define a set of distance classes, which to be useful, requires us to have a set of pairwise distance among our points. Let's now move forward through each of these. The first thing we'll need to do is remember to import both pyssage and numpy.
import numpy
import pyssageWe need a matrix containing the pairwise geographic distances among our points. As our coordinates are stored as longitudes (x) and latitudes (y), it would make sense to calculate these as spherical distances across the surface of the globe. We can do this with the spherical_distance_matrix function of the distances unit. The only required input is our coordinates, entered into the function as numpy arrays, so
distances = pyssage.distances.spherical_distance_matrix(numpy.array(x), numpy.array(y))This function returns an n × n (355 × 355 in our example) numpy array containing the pairwise distances in km (by default) between our cancer registration locations.
The second step is to define what the distance classes are going to be for the correlogram analysis. One could easily create this manually as the requirements are only a two-column matrix where each row represents a distance class with the columns contains the minimum (inclusive) and maximum (exclusive) distances for that class, but there is also a built-in function to do this using a variety of methods for partitioning the observed distances into classes. The function create_distance_classes in the distances unit allows the user to specify if they want to fix the number of pairs per class or the width of each class or the number of classes with an equal width per class or the number of classes with an equal number of pairs per class, etc. For this example, let's have the function create 15 classes, where each class will have approximately the same number of pairs: this is the "determine pair count" option. Our previously calculated distance matrix is the input
dist_classes = pyssage.distances.create_distance_classes(distances, "determine pair count", 15)The output of this is a 15 row, 2 column numpy array.
If we were to print it we'd get the following:
>>> print(dist_classes)[[ 0. 220.41551818]
[ 220.41551818 341.13633316]
[ 341.13633316 437.56497963]
[ 437.56497963 523.78634494]
[ 523.78634494 611.28138124]
[ 611.28138124 700.56073194]
[ 700.56073194 786.57501166]
[ 786.57501166 870.80703837]
[ 870.80703837 958.21985005]
[ 958.21985005 1060.20126612]
[1060.20126612 1174.76172377]
[1174.76172377 1306.18694287]
[1306.18694287 1483.14899073]
[1483.14899073 1745.13563903]
[1745.13563903 2862.11560677]]Although not required for determining the correlogram, there is a built-in function that creates a visualization of the distribution of distances across a set of distance classes. The function draw_distance_class_distribution in the graph unit only requires the distance matrix and the distance classes to create a figure.
pyssage.graph.draw_distance_class_distribution(distances, dist_classes, figoutput=pyssage.graph.FigOutput(figshow=True))Pairwise distances are plotted on the figure in order from shortest to longest (the blue curve), with distance on the x-axis and cumulative count (or order) on the y-axis. The red lines indicate the boundaries of the distance classes. In our case, we requested that the 15 classes have the same number of pairs in each class, thus the red horizontal red lines are spaced more or less evenly along the y-axis. The actual geographic distances represented by the classes vary a lot from class to class, with smaller interval classes for shorter distances and longer interval classes for larger distances; this is quite common as with any standard distribution of points there tend to be fewer long-range distances among pairs. If we had chosen different options for our distance class creation, the blue line would remain the same, but the positions and numbers of the red lines would shift depending on our choices.
Our final step before the actual correlogram is to create a list of Connections using the distance_classes_to_connections function in the connections unit. The idea here is that the list will contain one Connections object for each distance class. The input into this function is simply our distance classes and distances matrix
dc_cons = pyssage.connections.distance_classes_to_connections(dist_classes, distances)With that step completed we can calculate our correlogram using the correlogram function of the correlograms unit. The only required inputs are the actual data (which we haven't used yet) as a numpy.array and our list of connections objects from the previous step. By default the function will calculate a correlogram using Moran's I with a random permutation assumption for variance determination, so we'll stick with these.
output, output_text, _ = pyssage.correlogram.correlogram(numpy.array(data), dc_cons)This function produces three outputs, but the third is only meaningful if we did permutation tests. The second output is a list of strings representing a textual output of the analysis. If we print these strings we get:
for line in output_text:
print(line)Moran's I Correlogram
# of data points = 355
Distribution assumption = random
Min dist Max dist # pairs Expected Moran's I SD Z Prob
---------------------------------------------------------------------------------------
0.00000 220.41552 4188 -0.00282 0.68555 0.01474 46.70430 0.00000
220.41552 341.13633 4189 -0.00282 0.44218 0.01468 30.30310 0.00000
341.13633 437.56498 4189 -0.00282 0.29106 0.01473 19.95164 0.00000
437.56498 523.78634 4189 -0.00282 0.17043 0.01473 11.76020 0.00000
523.78634 611.28138 4189 -0.00282 0.05065 0.01473 3.63120 0.00028
611.28138 700.56073 4189 -0.00282 -0.07625 0.01470 4.99334 0.00000
700.56073 786.57501 4189 -0.00282 -0.18170 0.01475 12.12636 0.00000
786.57501 870.80704 4189 -0.00282 -0.24044 0.01478 16.07953 0.00000
870.80704 958.21985 4189 -0.00282 -0.28434 0.01481 19.00977 0.00000
958.21985 1060.20127 4189 -0.00282 -0.29076 0.01481 19.44528 0.00000
1060.20127 1174.76172 4189 -0.00282 -0.32524 0.01478 21.81535 0.00000
1174.76172 1306.18694 4189 -0.00282 -0.30147 0.01472 20.29152 0.00000
1306.18694 1483.14899 4189 -0.00282 -0.23113 0.01467 15.56793 0.00000
1483.14899 1745.13564 4189 -0.00282 -0.07521 0.01445 5.01045 0.00000
1745.13564 2862.11561 4190 -0.00282 0.32439 0.01290 25.36083 0.00000The first output of the function is a list of lists which contains all of the numbers in the data table shown in the textual output above. We can use this if you want direct access to the numbers, for example if you wanted to graph the correlogram using the draw_correlogram function in the graph unit.
pyssage.graph.draw_correlogram(numpy.array(output), metric_title="Moran's I", figoutput=pyssage.graph.FigOutput(figshow=True))Put it all together and leaving out the optional steps we have:
import numpy
import pyssage
# we are assuming that you have imported the coordinates and data as individual lists of numbers, called
# x, y, and data
distances = pyssage.distances.spherical_distance_matrix(numpy.array(x), numpy.array(y))
dist_classes = pyssage.distances.create_distance_classes(distances, "determine pair count", 15)
dc_cons = pyssage.connections.distance_classes_to_connections(dist_classes, distances)
output, output_text, _ = pyssage.correlogram.correlogram(numpy.array(data), dc_cons)
# to show the text output
for line in output_text:
print(line)
# to show the output as a figure on the screen
pyssage.graph.draw_correlogram(numpy.array(output), metric_title="Moran's I", figoutput=pyssage.graph.FigOutput(figshow=True))That is it! There are obviously a lot of options along the way allowing you to tailor your analysis to your own needs, but this describes the basic parts of determining and displaying a correlogram in pyssage.