Eating locally grown food can reduce the carbon-footprint of that food by reducing fuel-miles. Spending locally on food also has important supporting effects on the local economy. Currently the dominance of trucking and shipping freight industries allow large agricultural areas to centralize the production of food and deliver this food to a wide network of markets. Local food networks have declined precipitously, to the point where a supermarket is more likely to carry tropical fruit out of season than local produce in season.
Local food networks have shifted in shape and character as they persist in an increasingly global marketplace. Farmers Markets allow small local farms to sell their produce retail and direct-to-consumer. Grocer’s co-ops and high-end grocery stores continue to stock their produce. Organic wholesale distributors sell their produce to restaurants. CSAs also meet growing demand by allowing consumers to buy shares of a farm’s produce, to be delivered direct-to-consumer on a weekly basis.
There is a lack of scholarly data and research on the character and function of local food distribution networks. The most challenging part of the analysis is gathering the data. This project examines the spatial distribution of farms serving the Corvallis Farmer’s Market. For each vendor in the Corvallis Farmer’s Market list of vendors, I searched the farm name and address and used Google Maps to estimate the location of each farm. Google Maps does not identify rural addresses precisely enough to associate with specific plots of farmland, so estimates of farm size come from survey data.
Survey Depth vs. Breadth
I want covariate information on these farms that a web search will not provide for all farms, so I knew early into the analysis that I wanted to do a survey. To get a decent response rate, I wanted a month to gather responses, so I could push out a few reminder waves. To meet this deadline, the list of farms to survey would have to be complete within a month. The result is a dataset that is small in size, but contains sufficient covariate data to fit a generalized linear model in R.
There are several good options to improve the breadth of the survey: examining vendor lists for Salem Saturday Market and Portland PSU Farmer’s Market, contacting local groceries for a list of their local vendors, and contacting Organic wholesale providers for a list a local vendors. Among the list of farms in the dataset, many more have phone numbers than emails, so a phone survey could potentially improve the number of responses from the current list.
Farms Surveyed vs. Responding Farms
Key Variables
For each farm, I gathered data on the following variables:
- Miles – Distance from Corvallis Farmer’s Market in miles.
- Acres – Size of Farm in Acres
- Age – Years Under Current Management
- Age2 – Squared value of Age term
- Local – Proportion of Sales to Local Markets
Miles – Distance from Corvallis Farmer’s Market
The location of each farm is an estimate based on the address placement market in Google Maps. In rural neighborhoods, the marker frequently falls in the middle of the road and does not clearly correspond to a specific plot area. Some farms also use as a business address an urban location that clearly does not produce the food on location.
I estimated distance from the farms to the Corvallis Farmer’s Market in ArcGIS using the Analysis > Proximity > Point Distance tool in the Toolbox.
In retrospect I am not convinced that distance “as the bird flies” is a good representation of spatial auto-correlation between producers and markets. Fish do no swim as the bird flies from spawning grounds to the oceans, but rather travel along stream networks. Likewise, traffic follows street and highway networks. Estimating distance from farm to market by measuring likely traffic routes may be a better way to measure auto-correlation across spatial distances.
Surveying Farms
I used an email survey to gather covariate information on farms, including farm age, size in acres, and percent of local sales. Out of 47 active emails on the list of Corvallis Farmer’s Market vendors, 30 responded to the survey, a 64% response rate. An additional 63 farms have telephone contact information, but not email contact info, so the response number is only 27% of total vendors, and a round of phone surveys would help the sample to better represent the target population.
The Corvallis Farmer’s Market listed 128 vendors. Salem Saturday Market lists 284 vendors, and Portland PSU Farmer’s Market lists 163 vendors. In addition, the Natural Choice Directory of Willamette Valley lists 12 wholesale distributors likely to carry produce from local farms. At similar response rates, surveying these farms would result in a sizeable dataset for analysis, and would be a fruitful subject for future study. Surveying the addition Farmer’s Market would yield 120 responses at the same response rate, not counting the wholesale distributors, but there would likely be repeat farms in each vendor list.
The variable for proportion of food going to local markets was problematic. Defining what markets qualify as local is difficult, because small differences in distance can be significant to smaller farms, while a larger farm may consider all in-state sales to be local. There is no single appropriate threshold to describe local sales common to the variety of farms in the survey, so responses to the survey employed different definitions of local and are not comparable to each other. Therefore I removed this variable from the model formula and do not use it in future analysis.
Analyzing Spatial Data in R
There are multiple reasons to prefer analyzing spatial data in R. Perhaps the researcher is more familiar with R, like myself. Perhaps the researcher needs to employ a generalized linear model (GLM, GAM or GAMM), or a mixed effects model, for data that is not normally distributed. While ArcGIS has several useful tools for spatial statistics, its statistical tools cannot match the variety and flexibility of R and its growing collection of user-developed packages. My particular interest in handing GIS data in R is to use the social network statistical packages in R to analyze food networks. This dataset on farms does not include network data, but this project brings me one step closer to this kind of analysis, and future surveys could expand the current farm dataset to include network elements important to my research.
The following diagram shows a generalized workflow for transferring spatial data between ArcGIS and R:
Whereas several R packages allow the user to manipulate spatial data, the most popular package is called ‘maptools’. I recommend using this package to open the spatial data file in R, which will import the file in a format similar to a dataframe. The spatial data object differs from a dataframe in several important ways, whereas most statistical operations are designed to work on dataframes in particular. Therefore, I recommend converting the imported object to a dataframe using the R function as.data.frame(). Perform all statistical analyses on the new dataframe, then save significant results to the initial spatial data object, and export the appended spatial data object to a new shapefile using the maptools command writeSpatialShape().
For farms serving the Corvallis Farmer’s Market, I suspected there was a relationship between key variables in the survey and distance from the farmer’s market, but I did have a specific causal theory, so I relied on basic data analysis to inform my model. The relationship between the key variables was uncertain, as shown in this diagram:
First I fit three different models, each using a different variable as the dependent variable in the model. Then I examined the fit of each model to determine if one model best explained the data. The following three charts show the tests for normality of distribution for residual error for each model. With Acres and Age as the dependent variable, the model fit showed heavy tails on the Q-Q plots, whereas Miles as the dependent variable had relatively normal distribution of errors.
The tests of residuals vs. fitted values showed a possibly curvilinear relationship with Acres as the dependent variable, with residual error seeming to increase as acre size increases, then to decrease as acre size increases further. This curvilinear relationship was even more explicit with Age as the dependent variable, meaning this model clearly violates the assumption of errors being independent of the values of the dependent variable. With miles as the dependent variable, the residual errors appear to satisfy the assumption of independence, as they appear independently distributed around zero regardless of the how far the farm was from the market.
In retrospect, these results make sense. This analysis assumes there is a spatial auto-correlation between farms, that farms nearer the Corvallis Farmer’s Market are going to be more similar to each other than to farms further from the market. Conceptually, the best way to fit this auto-correlation to a model is to use Miles from the Farmer’s Market as the outcome variable. This is sufficient a priori reason to select this model, but the model fitting Miles as the dependent variable also best meets the model assumptions of independence and normality of errors. Since I dropped Local from the variable list, then in the selected model Miles is dependent on farm size in acres, age in years under current management, and the squared age term.
One can save the residuals and fitted values for each observation directly to a new shapefile, but the coefficient results for variable terms require a new file. To save a new table in a format that can import cleanly into ArcGIS, use the ‘foreign’ package in R and call the write.dbf() function to save a dataframe into DBF format, a table format that will open cleanly in ArcGIS. The following table shows the coefficient results for the farms in the dataset, using a negative binomial regression:
After performing the analysis in R, I used the maptools function write.Spatial.Shape() to create a new shapefile containing the results of the analysis in its data fields. This shapefile does not contain any significant results or insights that I am aware of. The color of the diamonds signify which farms are closer or farther away from the market based on the covariate data, but because the sample size is so small, the only outliers in the set are either very close or very far from the market relative to the other samples. This is an indication of data deficiency. Here is the final map of results:
However, in spite of failure to obtain initial significant results, I believe this research has real potential to shed light on the nature of local food networks in the Willamette Valley. The farm data gathered for this research is original data, not extant data from a government survey. This branch of inquiry can shed light on a part of the food system that is poorly studied and data deficient. Now that I have a data management framework for moving data from ArcGIS to R for analysis and back, I can tackle more complicated analyses, and spending additional time collecting farm observations for the dataset becomes a worthwhile endeavor.
If I sample data from additional Farmer’s Markets in Salem and Portland, I can determine if the spatial distribution of farms is significantly different between cities, or if the farms serving each city share common attributes. More importantly, sampling additional Farmer’s Markets, and other distribution networks like grocers and wholesale distributors means that some farms in the dataset will serve multiple markets, multiple Farmer’s Markets or possibly also grocers and restaurants. The number of different markets a farm serves could be a significant variable in regression analysis. Connectivity to additional local food markets is network data, so I could use network analysis to determine if connectivity was a significant factor in relation to farm size or age.
There is a natural tension as I gather data between questions that are useful for analysis and questions that are realistic to answer in the scope of a survey. I achieved a 64% response rate by limiting the number of questions I asked to only three. The fewer the questions, the less invasive the questions, the higher the response rate. If I asked for nearly as much info as I desired, my response rate would have dropped to less than 1%. Every variable I add to the survey reduces the odds of response. So developing this dataset requires careful consideration of each additional variable, determining if the benefit to analysis justifies the imposition upon the survey participant. While I feel that there remains room for additional variables in the analysis, I am still looking for candidate variables that justify the increased transactions costs of their inclusion.
This project has been an enjoyable analysis of a topic in which I am very interested. I hope I can get the chance to continue pursuing this research.