Skip to content

Correlogram Tutorial

Michael Rosenberg edited this page Mar 24, 2021 · 1 revision

Table of Contents

Correlogram Tutorial

The purpose of this tutorial is to walk a user through the basic steps necessary to calculate a correlogram using pyssage. There are a number of steps involved in going from raw data to final analysis, in order to give the end user a lot more control over what happens at each step, rather than a more simple automation where the decisions are made by the programmer.

For input, we will be starting with a data set where we have measured values located in space (i.e., each data point has a pair of coordinates associated with it). For this example we're going to use an old data set consisting of cancer mortality rates for registration districts in parts of Western Europe for the 1970's. There are 355 locations, each of which has a mortality rate for 40 different sex-specific tumor types. For the purposes of this exercise we'll use just the first of these, which was oral cancer in men.

There are any number of ways you might choose to load this data into Python; for simplicity we will assume you have loaded each of the three values into a separate list of numbers (x, y, and data), although many other forms will work just as well.

In order to begin our analysis, let's start by working backwards from where we want to begin. In order to perform a correlogram analysis we'll need our data and a set of distance class connections. The first of these is already stored in our list, but we'll need to determine the distance class connections. To determine those, we'll need to define a set of distance classes, which to be useful, requires us to have a set of pairwise distance among our points. Let's now move forward through each of these. The first thing we'll need to do is remember to import both pyssage and numpy.

import numpy
import pyssage

Determine a distance matrix

We need a matrix containing the pairwise geographic distances among our points. As our coordinates are stored as longitudes (x) and latitudes (y), it would make sense to calculate these as spherical distances across the surface of the globe. We can do this with the spherical_distance_matrix function of the distances unit. The only required input is our coordinates, entered into the function as numpy arrays, so

distances = pyssage.distances.spherical_distance_matrix(numpy.array(x), numpy.array(y))

This function returns an n × n (355 × 355 in our example) numpy array containing the pairwise distances in km (by default) between our cancer registration locations.

Create Distance Classes

The second step is to define what the distance classes are going to be for the correlogram analysis. One could easily create this manually as the requirements are only a two-column matrix where each row represents a distance class with the columns contains the minimum (inclusive) and maximum (exclusive) distances for that class, but there is also a built-in function to do this using a variety of methods for partitioning the observed distances into classes. The function create_distance_classes in the distances unit allows the user to specify if they want to fix the number of pairs per class or the width of each class or the number of classes with an equal width per class or the number of classes with an equal number of pairs per class, etc. For this example, let's have the function create 15 classes, where each class will have approximately the same number of pairs: this is the "determine pair count" option. Our previously calculated distance matrix is the input

dist_classes = pyssage.distances.create_distance_classes(distances, "determine pair count", 15)

The output of this is a 15 row, 2 column numpy array.

If we were to print it we'd get the following:

>>> print(dist_classes)
[[   0.          220.41551818]
 [ 220.41551818  341.13633316]
 [ 341.13633316  437.56497963]
 [ 437.56497963  523.78634494]
 [ 523.78634494  611.28138124]
 [ 611.28138124  700.56073194]
 [ 700.56073194  786.57501166]
 [ 786.57501166  870.80703837]
 [ 870.80703837  958.21985005]
 [ 958.21985005 1060.20126612]
 [1060.20126612 1174.76172377]
 [1174.76172377 1306.18694287]
 [1306.18694287 1483.14899073]
 [1483.14899073 1745.13563903]
 [1745.13563903 2862.11560677]]
The first class goes from 0 to 220.4 km, the second from 220.4 to 341.1 km, etc.

(Optional) Visualizing the Distance Classes

Although not required for determining the correlogram, there is a built-in function that creates a visualization of the distribution of distances across a set of distance classes. The function draw_distance_class_distribution in the graph unit only requires the distance matrix and the distance classes to create a figure.

pyssage.graph.draw_distance_class_distribution(distances, dist_classes, figoutput=pyssage.graph.FigOutput(figshow=True))

Pairwise distances are plotted on the figure in order from shortest to longest (the blue curve), with distance on the x-axis and cumulative count (or order) on the y-axis. The red lines indicate the boundaries of the distance classes. In our case, we requested that the 15 classes have the same number of pairs in each class, thus the red horizontal red lines are spaced more or less evenly along the y-axis. The actual geographic distances represented by the classes vary a lot from class to class, with smaller interval classes for shorter distances and longer interval classes for larger distances; this is quite common as with any standard distribution of points there tend to be fewer long-range distances among pairs. If we had chosen different options for our distance class creation, the blue line would remain the same, but the positions and numbers of the red lines would shift depending on our choices.

Convert our Distance Classes into Connections

Our final step before the actual correlogram is to create a list of Connections using the distance_classes_to_connections function in the connections unit. The idea here is that the list will contain one Connections object for each distance class. The input into this function is simply our distance classes and distances matrix

dc_cons = pyssage.connections.distance_classes_to_connections(dist_classes, distances)

Correlogram

With that step completed we can calculate our correlogram using the correlogram function of the correlograms unit. The only required inputs are the actual data (which we haven't used yet) as a numpy.array and our list of connections objects from the previous step. By default the function will calculate a correlogram using Moran's I with a random permutation assumption for variance determination, so we'll stick with these.

output, output_text, _ = pyssage.correlogram.correlogram(numpy.array(data), dc_cons)

This function produces three outputs, but the third is only meaningful if we did permutation tests. The second output is a list of strings representing a textual output of the analysis. If we print these strings we get:

for line in output_text:
    print(line)
Moran's I Correlogram

# of data points = 355
Distribution assumption = random

  Min dist     Max dist   # pairs   Expected   Moran's I        SD          Z      Prob
---------------------------------------------------------------------------------------
   0.00000    220.41552      4188   -0.00282     0.68555   0.01474   46.70430   0.00000
 220.41552    341.13633      4189   -0.00282     0.44218   0.01468   30.30310   0.00000
 341.13633    437.56498      4189   -0.00282     0.29106   0.01473   19.95164   0.00000
 437.56498    523.78634      4189   -0.00282     0.17043   0.01473   11.76020   0.00000
 523.78634    611.28138      4189   -0.00282     0.05065   0.01473    3.63120   0.00028
 611.28138    700.56073      4189   -0.00282    -0.07625   0.01470    4.99334   0.00000
 700.56073    786.57501      4189   -0.00282    -0.18170   0.01475   12.12636   0.00000
 786.57501    870.80704      4189   -0.00282    -0.24044   0.01478   16.07953   0.00000
 870.80704    958.21985      4189   -0.00282    -0.28434   0.01481   19.00977   0.00000
 958.21985   1060.20127      4189   -0.00282    -0.29076   0.01481   19.44528   0.00000
1060.20127   1174.76172      4189   -0.00282    -0.32524   0.01478   21.81535   0.00000
1174.76172   1306.18694      4189   -0.00282    -0.30147   0.01472   20.29152   0.00000
1306.18694   1483.14899      4189   -0.00282    -0.23113   0.01467   15.56793   0.00000
1483.14899   1745.13564      4189   -0.00282    -0.07521   0.01445    5.01045   0.00000
1745.13564   2862.11561      4190   -0.00282     0.32439   0.01290   25.36083   0.00000

The first output of the function is a list of lists which contains all of the numbers in the data table shown in the textual output above. We can use this if you want direct access to the numbers, for example if you wanted to graph the correlogram using the draw_correlogram function in the graph unit.

pyssage.graph.draw_correlogram(numpy.array(output), metric_title="Moran's I", figoutput=pyssage.graph.FigOutput(figshow=True))

Summary

Put it all together and leaving out the optional steps we have:

import numpy
import pyssage

# we are assuming that you have imported the coordinates and data as individual lists of numbers, called
# x, y, and data

distances = pyssage.distances.spherical_distance_matrix(numpy.array(x), numpy.array(y))
dist_classes = pyssage.distances.create_distance_classes(distances, "determine pair count", 15)
dc_cons = pyssage.connections.distance_classes_to_connections(dist_classes, distances)
output, output_text, _ = pyssage.correlogram.correlogram(numpy.array(data), dc_cons)

# to show the text output
for line in output_text:
    print(line)

# to show the output as a figure on the screen
pyssage.graph.draw_correlogram(numpy.array(output), metric_title="Moran's I", figoutput=pyssage.graph.FigOutput(figshow=True))

That is it! There are obviously a lot of options along the way allowing you to tailor your analysis to your own needs, but this describes the basic parts of determining and displaying a correlogram in pyssage.

Clone this wiki locally