Tag Archives: Confusion Matrix

Using LiDAR data to assess forest aesthetics in McDonald-Dunn Forest

Bryan Begay



  1. Question asked?

How does forest aesthetics vary depending on forest structure as a result of active management that retains vegetation and landforms?

In order to answer my question I would look at two stands in the McDonald-Dunn forest to do some analysis on how their forest structure is related to forest aesthetics. The first stand is Saddleback which had been logged in 1999 and shows signs of management. The second stand was identified near Baker Creek, and is 1 mile west of Saddleback. Baker Creek was chose for its riparian characteristics, as well as having no signs of management activates.

  1. A description of the data set.

The LiDAR data that I used with my initial analysis was 2008 DOGAMI LiDAR flown over the McDonald-Dunn forest. The RMSE for this data set was 1.5cm. The DEM data set I used was from 2009 has a RMSE of 0.3 m. A Canopy Height Model (CHM) was made in RStudio lidR package that used a digital surface model with a 1 meter resolution. The CHM was used to create an individual tree segmentation, where segmented trees were then converted to point data.


A link to a visualization of the raw point cloud that is georeferenced to its terrain.

  1. Hypotheses: predictions of patterns and processes you looked for.

I suspected that initially the Baker Creek stand would have higher forest aesthetic that would reflect in the stand’s unmanaged vegetation structure.


Since the Saddleback had been managed and cut I figured the more natural structure of the riparian stand would have generally higher forest aesthetics than a stand that has been altered by anthropogenic factors. Some processes that I hypothesized that relates to forest aesthetics to these stands was the spatial point pattern of trees could be related to forest aesthetics. Insert forest aesthetic link:

  1. Approaches: analysis approaches you used.

Exercise 1: Ripley’s K Analysis point pattern analysis

The steps taken to create a point pattern analysis was to identify individual trees and convert the trees into point data. The RStudio lidR package was used to create a Canopy Height Model and then an Individual tree segmentation. Rasters and shapefiles were create to export the data so I could then use the tree polygons to identify tree points. The spatstat  package was used in RStudio as well to perform a Ripley’s K analysis on the point data.

Figure 1. Individual Tree Segmentation using watershed algorithm on Saddleback stand.

Exercise 2: Geographically weighted Regression

The steps taken to do the geographically weighted regression included using the polyongs created from the individual tree segmentation to delineate tree centers. When tree points were created from the centroids of the polygons, which would be inputs for the GWR in ArcMap. A density raster and CHM raster had their data extracted to the point data so that density and tree height could be the variables used in the regression. Tree height was the explanatory variable and density was the independent variable.

Figure 2. Polygon output from Individual tree segmentation using the lidR package. The Watershed algorithm was the means of segmentation, and points were created from polygon centroids in ArcMap.

Exercise 3: Supervised Classification

This analysis involved creating a supervised classification by using training data from NAIP imagery and a maximum likelihood classification algorithm. It involved using the NIR band and creating a false color image that would show the difference spectral reflectance values from conifers and deciduous trees. I used a histogram stretch to visualize the imagery better and spent time gathering quality training data. I then created a confusion matrix by using accuracy points on the training data. I then clipped the thematic map outputs with my individual tree segmentation polygons to show how each tree had their pixels assigned.

  1. Results

The Ripley’s K analysis in ArcMap showed me that Saddleback stand’s trees are dispersed, and the Baker Creek stand’s trees were spatially clustered. GWR outputs told me that the model in the Saddle back stand showed me a map output where tree heights and density were positively related. The adjusted R2 was 0.69 and gave me a good output that showed me the tallest and densest trees were on the edges of Saddleback stand. The Baker Creek stand’s model performed poorly on the point data with an adjusted R2 of 0.5. The outputs only showed relationships could only be modeled on the upper left of the stand. The classified image worked well on Saddleback stand due to less distortion in the NAIP imagery on that stand, and the Baker Creek stand’s classification was not useful since it had significant distortion in the NAIP imagery.

Exercise 1:

Figure 3. ArcMap Ripley’s K function output for Saddleback stand assessing tree points.

Exercise 2:

Figure 4. Geographically weighted regression of Baker Creek and Saddleback stand. The Hotter colors indicate positive relationships between tree density and tree height.

Exercise 3.

Figure 5. Supervised image classification using a maximum likelihood algorithm on Saddleback stand.

  1. What did you learn from your results? How are these results important to science? to resource managers?

I learned that Ripley’s K outputs can differ depending on what packages used. R-studio Ripley’s K outputs told me that both my stands had clustered tree patterning. ArcMap outputs that made more sense told me that my Saddleback stand was actually dispersed. Outputs can be variable if inputs are not explicitly understood or modeled with enough care. I also learned that trying to model a very heterogeneous riparian stand is more difficult because of the variability. This is important for researchers who are interest in riparian areas like Baker Creek since they might need to have more variables to adequately model those stands.

  1. Your learning: what did you learn about software?

I became very familiar with processing and modelling with LiDAR point clouds. I also became familiar with Modelbuilder and learned how to use packages in R like Spatstat. I also found a new method for making a confusion matrix in ArcMap.

  1. What did you learn about statistics or other techniques?

I learned how to do point pattern analysis with Ripley’s K on tree points. This was done in R and in Arc. In Arc using the spatial statistics tool was also something I used and still plan to use. When using GWR I understood what it does, understood the outputs, and learned to properly interpret the results. I also became more concerned with issues of scale and networks that might affect my areas of interest.

Supervised Image classification on forested stands

Question that I asked?

Could I identify functional tree species with supervised image-classification in my stands?

The reason I asked this question was so that If I had to do a geographically weighted regression again it would be valuable to have deciduous or coniferous tree species in my point data for an added variable.

Name of the tool or approach that you used.

The main tool that I used for image classification was the maximum-likelihood classification in ArcMap. I also used the create accuracy assessment points to help create a confusion matrix in excel.

Brief description of steps you followed to complete the analysis.

I downloaded 2016 NAIP imagery in my area of interest and used the high resolution imagery to create a false colored image with the bands being arranged as NIR, Red, and Green. To help delineate broad-leaf vegetation from coniferous vegetation, I applied a histogram equalize stretch that enhanced my ability to identity conifers in the landscape. From there I created a maximum likelihood classification by drawing training data polygons on the false color imagery, which involved me using the Esri digital imagery base map as a reference image.


Once the image classification was complete, I used the create accuracy points on my stand and then extracted the raster values from the thematic map output to those points to create a confusion matrix in Excel. I clipped the thematic map raster to the watershed polygons I made when I did an individual tree segmentation to show what pixel classifications were assigned in my tree tops.


Brief description of results you obtained.

The thematic map output was 83% accurate with conifer and developed land covers performing the worst in the model. The developed  land cover is generally difficult to model in a landscape, and the variability in urban spectral reflectance leads to errors in modeling. The conifer land cover performed more poorly due to my trouble achieving accurate training data with the imagery resolution, and also with the model having trouble delineating conifers from grass and deciduous vegetation. Errors of commission on my part (65% accuracy), and errors of omission (75%  accuracy) lead to the lower accuracy of the conifer land cover (Table 1).  Despite these errors, the thematic map output performed well, and the land cover pixels in my stands showed that conifer trees were accurately assigned in the Saddleback stand (Figure 2). For the baker creek stand the large amount of shadows, sun glare on canopies, and classification cut off, lead to a poor classification of that stand.

Figure 1. The land cover thematic map for the entire NAIP image. The cyan blue color indicates the locations of Saddleback and Baker Creek stands.

Figure 2.The land cover classification output for the Saddleback stand.

Figure 3. The land cover classified output for Baker Creek Stand. Note that the NAIP imagery that was classified did not extend to cover the entire stand. The tree crown polygons were laid below the output to show where the land cover cuts off.

Table 1. Confusion matrix for the thematic map output.


Critique of the method – what was useful, what was not?

Some critiques about this process was that it was time consuming to create training data detailed enough to capture the variation in the scene for my desired accuracy. Sources of errors in the thematic map include shadows, resolution and variable spectral response signatures in the remotely sensed vegetation. Shadows occluded trees that would otherwise stand out, and distorted the classification enough for me to have to add in a land cover classification for shadows to mask them out of the scene. The issue of resolution just means that NAIP imagery was not detailed enough for the applications I asked. Imagery taken from unmanned aerial drones may be a potential avenue for acquiring a more  higher resolution data set. The confusion matrix highlights this issue, with an omission error of 65% for conifers and 75% commission error. It was difficult to determine conifer trees accurately in the training data from the variability of the spectral reflectance and the blurred crowns from the 1 meter resolution.


Since I only did a classification, I didn’t attempt to classify tree functional species to my tree polygons. The process that comes to mind on how to do that is to visibly determine which classification color is more pronounced in a tree top, and then placing that species in the point data as an attribute. This process would be highly time consuming and developing a methodology to streamline the classification of functional tree species to my tree points would be potential future work.Overall, the thematic map outputs are useful for areas like the Saddleback stand that have less shadows and distortion. The map is less useful for areas with high distortion like my Baker Creek stand.

Landscape Patterns as Predictors of Tree Height

Question: Which landscape features correspond to clusters of greater than expected trees?

Methods: I performed two Hot Spot Analyses in ArcMap; one on Hot Spots of tree height and another on hot spots of distance between trees. Both were constrained to the reserved control areas of the HJ Andrews Forest. Hot spots in tree height are regions of greater than expected tree height, while hot spots in distances between trees are regions of greater than expected distance between individual trees (more dispersed trees). Hot spots and cold spots of each analysis generally overlapped. However, hot spots between tree height and spacing did not overlap in all cases, so I wanted to know what landscape features might explain this difference. Covariates I explored included slope, aspect, elevation, and landform. Since the end goal is to find landscape features that may correlate with amount of soil carbon, I conducted this analysis with the assumption that taller trees may correlate with regions of greater soil carbon. I used the package ‘caret’ in R to calculate a  confusion matrix between the Z-scores of height and distance for all the hot spot bins (-3,-2,-1,0,1,2,3), then further constrained the analysis to only the most extreme hot and cold spots (-3 and 3). I then compared mean height, distance, slope and elevation between the four combinations of the extreme hot and cold spots (Table 1).

Results: Regions of taller than expected trees often correspond to regions of greater than expected distances between trees, which agrees with current forest growth models (Fig. 1). Hot spots of tall trees are typically in valleys and cold spots are commonly on ridges (Fig 3 & 4). When we zoom in to the Lookout Mountain area of HJ Andrews, we see that hot spots of tall trees are more concentrated in valleys and on footslopes, and cold spots are closer to mountain ridges (Fig 3). When compared with the distance hot spot map of the same area, we see that cold spots go much further down the footslopes and even into the valleys in some cases (Fig 4). So although we have evidence for a strongly linear relationship between height and distance between trees, we also have evidence that they do not fully explain each other and other landscape features are likely at play.

Fig 1. Distance Z-scores vs. Height Z-scores from hot spot analyses show a linear relationship.

Fig 2. HJ Andrews elevation with reserved control areas in orange and

inset area of Lookout Mountain hot spot maps (below)


Fig 3. Hot Spot Analysis showing hot spots of tree heights (tallest trees)

in the Lookout Mountain area


Fig 4. Hot Spot Analysis showing the greatest distance between trees

in the Lookout Mountain area


An elevation band that correlates with occurrences of tall trees exists up to around 1100 m, after which point number of tall trees drops off substantially (Fig. 5). Certain aspects seem to correlate with taller trees, but those relationships are harder to tease apart and I have yet to fully explore them. Greater slopes tend to correlate with shorter trees, but this relationship is not linear. There is an interesting upwards trend at slopes between 30 and 50 degrees that seems to correlate with slightly taller trees, then a big drop in mean height Z-score at slopes of 60 degrees.

Fig 5. Aspect, elevation and slope compared with Z-scores of mean height.

A comparison of Z-scores from hot spot analyses of height and distances shows that although hot spots of height and distance tightly correlate, covariates that explain them are different (mean slope and elevation). When we compare the most extreme Z-scores to one another, slope, height and distance between trees are not particularly different. Mean elevation in three categories of Z-score is similar, but mean elevation in the fourth group (>3,>3) is significantly lower. A next step is to map out these

Table 1. Comparison between the most extreme Z-scores of tree height and tree spacing.

Height Z-Score Distance Z-Score Mean Height (m) Height_SD Mean Distance (m) Distance_SD Mean Slope (m) Slope_SD Mean Elevation (m) Elevation_SD
<-3 <-3 22.8 10.5 5.1 2.4 27.9 10.5 1285 294
<-3 >3 24.5 11.2 5.6 2 26.9 4.5 1377 153
>3 <-3 34.8 7.1 5 2 31.8 5.9 1310 44
>3 >3 39.5 16.7 4.7 2.6 26.2 11 934 188

Critique: These analyses are still based on Hot Spot Analyses, so they still comes with the same criticisms as previous Hot Spot Analyses. One of these criticisms was that it’s basically a smoothing function. Since the LiDAR dataset I’m using is basically a census of tree heights, running hot spot analyses is reducing the information in that dataset unnecessarily. I have yet to map out regions that were well-predicted and poorly predicted spatially, so I cannot fully discuss the merits of the confusion matrix method.