For this weeks assignment, we were tasked to begin exploring our dataset with some basic exploratory spatial statistics tools from ArcGIS (average nearest neighbor or/and spatial autocorrelation and hot-spot analysis). Since, my underlying problem is to interpret subsurface geologic characteristics throughout the Northern Gulf of Mexico to fill spatial gaps, I need to understand both the distribution of my sampling points (n=13625) as well as the spatial distribution of the subsurface geologic characteristics associated with each sampling point, such as average porosity, initial temperature (°F), and initial pressure (psi). Therefore, to get at the initial distribution of my sampling points, I used average nearest neighbor to identify if the distribution of my sampling points tended to be clustered, random, or dispersed. Results (table 1) showed that my sampling points were significantly clustered, which verifies with the patterns observed visually (figure 1).

Table 1. Resulting z score and p value from average nearest neighbor and spatial autocorrelation tests for entire sampling data and subsampled datasets
Figure 1. Location of sampling points (boreholes; n=13625) throughout the Northern Gulf of Mexico

 

However, since I know the distribution of my sampling points, how could I be sure that the spatial pattern of the subsurface geologic characteristics wouldn’t just reflect the clustered sampling distribution? Therefore, I decided to subsample my data points, first to  a smaller geographic area Mississippi Canyon Outer Continental Shelf (OCS) lease block (n=397; Figure 2A), and then further subsample those points (using the Create Random Points tool in ArcGIS) to select data points (n=50) to give them a clustered, random, and dispersed spatial distribution (determined using average nearest neighbor; Figure 2B). Then, I ran the spatial autocorrelation tool for each subsample (all Mississippi Canyon, and the clustered, random, and dispersed samples within Mississippi Canyon), which identified that despite the distribution of my sampling points, the values for temperature, pressure, and porosity are significantly clustered (Table 1). The next spatial statistic tool requested to test with our dataset was the hot spot analysis tool. I ran this tool on temperature value for the Mississippi Canyon (n=367) data subsample to identify if there are significant spatial clusters of high and low temperatures values. Results show (Figure 2C) that there are significant clusters of high temperatures (red dots and blue triangles) and low temperature (blue dots and blue triangles). Now, the next step is to being exploring the relationships between different subsurface geologic characteristics and different environmental conditions, such as water depth, subsurface depth, geologic age, etc. to identify any correlations that can used to help fill in spatial gaps of subsurface geologic characteristics throughout the Northern Gulf of Mexico.

Figure 2. Location of the subsampled data points in Mississippi Canyon (n=367; A), the subselected data samples with clustered, random, and dispersed distributions (n=50; B), and the results of the hot spot analysis tool for temperatures from the Mississippi Canyon subsample dataset (n=367;C)
Print Friendly, PDF & Email

6 thoughts on “Discerning a variable’s spatial pattern within a clustered dataset

  1. Hi Jen,

    First off, I wanted to say I really like your approach of experimenting with differing subset distribution patterns and testing for effects. It’s a great technique for getting a feel on how datasets behave (as you probably already know). It always takes more work to do that type of analysis so bravo for taking the time to play with it.

    Regarding your project itself, one thing I found myself wondering is how the results of spatial dependence might inform your overall study. I’m still struggling to grasp the idea of Moran I results, but in your context I believe it’s really telling you “Is my given parameter (temp,ect.) fairly homogenous over the extent of the distance between the closest x number of sampling points.

    With your already clustered data set, maybe specifying a fixed distance window for what defines a “neighbor” may help alleviate concerns of clumping – though some areas will simply have better data resolution in whether they think are clumped at that scale (say over 1 km).

    Overall, enjoyed the read!

    • Thanks Max for the comment. The results of this study will be used to create a model, and knowing how the subsurface characteristics vary over space, and if there are any key relationships that drive those spatial patterns will improve the ‘fit’ of our model and help reduce the uncertainty associated with our model. I can understand that Moran’s I can be a difficult concept to understand, as the implication of the results will vary by study, but essentially the way you described it is fairly correct, but remember that when running a tool that outputs a z score, it’s actually testing the null hypothesis that your data is randomly distributed over the spatial extent of your data layer.

      As for your suggestion of setting a minimum nearest neighbor distance to avoid sampling clustered datasets, I agree and it is a very good suggestion. I’ll have to see what tools allow you to do that, as some tools don’t have that functionality built in. Otherwise, I might need to modify the tools in python.

      Thanks again for the suggestions!

  2. hi Jen,

    This is really nice work. I’m intrigued by Figure C – the hot spot analysis, in which points are coded by the Z score (blue is negative, red is positive). Is it correct that these Z scores represent lower than average (blue) and higher than average (red) temperatures?

    If so, it would be very interesting to look more closely at the bathymetry and make some interpretations about geological processes. Specifically, it looks like the area in the top right where there is a cluster of blue points has an “oatmeal running down a slanting table” appearance typical of mass movements, i.e. submarine landslides. In contrast, the ocean floor in the area with the red points (higher than average temperatures?) appears to be smooth and relatively flat. It makes sense to me that mass movement deposits (submarine landslides) would be relatively cool, because they may be less cohesive and permit water to circulate, whereas the rocks below the red points might be more cohesive rocks connected more directly to the mantle, hence with more rapidly warming temperatures.

    Could you create some plots of temperature with depth for these two areas? What sort of temperature profile would you expect in these two settings?

    Do you have information on substrate and porosity? What porosity would you expect in landslide deposits compared to coherent rock? Can you make some interpretations about causes of spatialpattern of temperature?

    I look forward to seeing your next steps and your conjectures!

    Julia

    • Thanks for the comment Julia. As far as I’ve been able to tell from all the information on hot spot analysis, you are correct, these Z scores represent lower than average (blue) and higher than average (red) temperatures. I also agree with you that the next step would be to begin comparing the spatial distribution of temperature (and eventually other parameters) with bathymetry, porosity, and subsurface depth to get at the underlying causal factors. I’m not sure if I have any spatial information on the substates but I’ll check. I’ll also work on driving some layers and graphs to get at the temperature profile and look into the spatial patterns between temperature, bathymetry, subsurface depth, and porosity.

      Thanks for the suggestions!

  3. Hi Jen,
    Nice work! One thing to keep in mind whenever using Average Nearest Neighbor is that the statistical significance it reports is strongly impacted by the size of your study area. Imagine a handful of points randomly deposited in the middle of a large piece of white paper. Now draw a very, very tight convex hull boundary around those points. When you consider the spatial pattern of the points within the very tight boundary you will likely conclude they appear dispersed or random. Next, for those same points, create a very, very large boundary to enclose the points (so there is a ton of white space all round the edges of the points). Now when you observe the spatial pattern of those same points within the large boundary, you will likely conclude that the points seem to be clustered in the center of the boundary/study area. In both cases you are correct. Because the Average Nearest Neighbor tool uses AREA of the study area in the mathematics for computing statistical significance, it is important to understand that study area size *will* impact the result you get. If you take the defaults for this tool, the area for a minimum enclosing rectangle around your points will be used. If the distribution of your points is not very rectangular, you will likely get a lot of white space along the edges of your points and the tool will very likely report a clustered result. The Observed Nearest Neighbor Distance reported is absolutely reliable and will not change at all as a result of the study area used. Consequently, this is the result I find most valuable. This tool is especially effective when you want to compare two distributions within the same fixed study area to understand which distribution (two species or one species for two time periods, for example) is most clustered/dispersed. Here is an example: I want to understand if dengue fever cases are clustered within the village. If the spatial distribution of the disease is random, whether someone gets sick or not is pretty much a matter of bad luck (random chance). If, however, the cases are clustered, I have evidence that *something* (perhaps something I can remediate or modify) is impacting who does and doesn’t get the disease. Step 1: determine the area of the village because I want to use this same value in every analysis I do (to fix the study area size so I can make valid comparisons). Step 2: since I collect case data by household, if the households themselves are clustered this will impact any evaluation I make about the clustering of disease cases. Consequently I measure the Observed Average Nearest Neighbor distance for households to provide a baseline. Step 3: I measure the observed nearest neighbor distance of positive dengue cases and compare this to the baseline… I find that the cases are MUCH more clustered than the households. I can conclude, therefor, that dengue cases are clustered and the clustering cannot be explained by the distribution of households alone… next steps are to determine WHERE the clustering occurs (hot spot or cluster/outlier analysis)… then the next steps are to determine what factors (variables) might explain the observed (hot spot map) spatial distribution of cases (Exploratory Regression). Once I know the factors promoting the disease, I am in a good position to remediate.

    Also (as you say) a large positive z-score represents a statistically significant cluster of high values. A large but negative z-score represents a statistically significant cluster of low values. It is important to remember that a feature might be bright red on the map not because its value is particularly large but because it is a part of a spatial cluster of high values. The Hot Spot tool works by looking at each feature value within the context of neighboring feature values. If the value for a feature is super high, that may be interesting. But if the value is super high and its neighboring values are also super high, then you very likely have a statistically significant hot spot! Even if a feature has a somewhat low value, if it is surrounded by neighbors with super high values, it will likely appear bright red to indicate it is part of a hot spot cluster. A nice thing about the hot spot z-scores is the larger the value, the hotter the hot spot (the more negative a z-score the colder a cold spot is).

    One more thing… if the samples are uneven (some places you managed to take LOTS of samples, other places you could only get a few samples), this will have only a mild impact on results. The hot spot tool will have more information to compute a result in places with lots of samples (it will compare the local mean based on lots of samples to the global mean based on ALL samples for the entire study area and decide if the difference is statistically significant or not). Where there are few samples the local mean will be computed from only a few observations/samples… the tool will compare the local mean (based on only a few pieces of information) to the global mean (based on ALL samples) and determine if the difference is significant.

    Best wishes with your research!
    Lauren

  4. Hey Jen,

    So in discussions with industry scientists and other geophysicists a few items to consider if using this data to assess subsurface temperature trends going forward, i) the bottom hole temperatures reported by BSEE/BOEM that are used in this analyses are not consistently acquired or well documented with regards to uncertainty and approach. Apparently, there is no standard protocol for the acquisition of BHTs by the rig operators etc. so the values can vary as a result of how long the borehole was left exposed before the temperature is noted, and at times may just reflect a gross estimate by the driller or other rig personnel as a whole. ii) the general spatial pattern associated with the data is consistent with what basin modelers for the GOM would expect and is driven largely by sedimentation rate and thickness, but also on composition, where shale and salt bodies act as insulators and thus can have a significant impact on the thermal gradient regardless of location, shelf, slope, basin.

Leave a reply