Clearly, there is a significant difference between the most common clustering and the best clustering in almost every case . Another possibility is that the rare clusterings having the best BIC scores may not correspond to geographically meaningful EC maps. That is, the best BIC may correspond to a statistically meaningful solution that does not provide insight for soil zone management. Figure 3.9 shows the geographic mappings of the best BIC clusterings from Table 3.8. We assign different colors to each EC data point based on the cluster to which it has been assigned and then graph the data points based on latitude and longitude. From the figures, it is clear that the clusterings correspond to feasible zone management maps. That is, points belonging to the same cluster are often adjacent in geographic space indicating a strong EC mapping relationship. To illustrate in greater detail, consider the CAP dataset results. CAP is a lemon field with soil consisting of clay, sandy-loam, and sandy-clay-loam. Figure 3.8a shows that jobs from Job-5 on have very stable results with similar BIC scores. Figure3.9a shows the best clustering from Job-2048 with 4 clusters with cardinality [2103, 500, 473, 156] and a BIC score of -8918.35. We compare this result with the most common clustering for Job-2048 that occurred 1445 times with a BIC score of -10169.7 and two clusters having 2169, and1063 elements respectively . The visual difference between those two clusterings shows that most of the “disagreement” appears along cluster boundaries. In addition, we consider how clustering results compare to the soil samples taken at the CAP field. Figure3.10c shows the core samples taken at five different locations and their soil type. Out of five core samples available,vertical greenhouse the top two in the figure belong to the same cluster in both the best and the most common clustering .
The other three core samples report clay in the lower left corner followed by sandy-loam and sandy-clay-loam-cl. In the best clustering, they all belong to different clusters while the most common clustering puts all three core samples in the same cluster . Thus, the best clustering corresponds more closely to a core-sample analysis than the most common clustering. Note that for each fixed set of parameters , we ran 1, 228, 800 different experiments. This is the maximum frequency that can occur for a particular value of k resulting in the best or most common BIC. Table 3.8 shows that the most common clustering usually has fewer clusters while the best clustering provides higher resolution and therefore additional information that a farmer may find useful for management.The analysis of the other datasets is similar. In each case, the best BIC score is rare, requiring a large number of repeated trials, each with a different initialization to determine. In all but one case the best clustering differs substantially from the most frequently occurring clustering. The best clusterings correspond to meaningful EC soil maps and those maps correctly register with soil core samples.We start by comparing Centaurus against MZA for the synthetic datasets. We use the number that both FPI and NCE scores report for MZA as the optimal number of clusters. We then use the respective cluster assignment to compute the error rates. Figure 3.11 shows the best assignments produced by Centaurus and MZA and Table 3.9 shows the percentage of incorrectly classified points in each dataset, for the same assignments. For MZA, the best assignment is achieved by Mahalanobis distance and for Centaurus the best assignment is achieved by Full-Untied. MZA clusters the Dataset-1 correctly and reports K = 3 as the ideal number of clusters . For Dataset-2, MZA correctly identifies K = 3 but has a higher error rate of 13.8 A possible reason for this is that MZA only considers a single initial assignment of cluster centers, which in this case converges to a local minimum that is different from the global minimum. Centaurus avoids this kind of error by performing several runs of k-means algorithm before suggesting the optimal cluster assignment. Dataset-3 consists of clusters with correlation across features.
Centaurus provides better results than MZA for this dataset, achieving a percentage error of only 0.1 A possible reason for this is that MZA employs a global covariancematrix and does not consider Tied and Untied options as Centaurus does, which results in better label assignments. Another limitation of MZA is that it uses a free variable, called the fuzziness parameter, and multiple scoring techniques. It is challenging to determine how to set the fuzziness value even though the results are highly sensitive to this value. For the results in this section, we chose the default fuzziness parameter of m = 1.3 as suggested by the author Odeh et al. . Furthermore, for the farm datasets, the MZA scoring metrics do not always agree, providing conflicting recommendation and forcing the user to choose the best clustering. In combination, these limitations make MZA hard to use as a recommendation service for growers who lack the data science background necessary to interpret its results. Centaurus addresses these limitations by providing a high enough number of k-means runs, no free parameters, and more sophisticated ways of computing the covariance matrix in each iteration of its clustering algorithm. It uses a unique scoring method to decide what is a single best clustering that will be presented to a novice user while it provides the diagnostic capabilities that are needed for more advanced users.Moreover, FPI and NCE disagree more often than they agree for these datasets. For the Cal Poly dataset both scores agree only when m = 1.5 suggesting that k = 4 is the best clustering. For other values of m, MZA recommends cluster sizes that range from k = 2 to k = 5. For Sedgwick and m = 2.0 , FPI selects k = 3 and NCE selects k = 2. For UNL, no FPI-NCE pairs agree on the best clustering, with MZA recommending all values of k for different m. Because fine-grained EC measurements are not available for the Cal Poly, Sedgwick, and UNL farm plots, it is not possible to compare the MZA and Centaurus in terms of which produces a more accurate spatial maps from the Veris data. Even with expert interpretation of the conflicting MZA results for Cal Poly and UNL, we do not have access to “ground truth” for the fields.
However, it is possible to compare the two methods with the synthetic datasets shown in Figure 3.1. Note that this evidence suggests Centaurus is more effective for some clustering problems but is not conclusive for the empirical data. Instead, from the empirical data we claim that Centaurus is more utilitarian than MZA because disagreement between FPI and NCE differing possible best clusterings based on user-selected values of m, can make MZA results difficult and/or error-prone to interpret for non-expert users. MZA recommendations may be useful in providing an overall high level “picture” of the Veris data clustering,vertical grow towers but its varying recommendations are challenging to use for making “hard” decisions by experts and non-experts alike. In contrast, Centaurus provides both a single “hard” spatial clustering assignment and a way to explain why one clustering should be preferred over another and which one is “best” when ground truth is not available. In contrast, Centaurus is able to use its variants of k-means, a BIC-based scoring metric, and large state space exploration to determine a single “best” clustering. The only free parameter the user must set is the size of the state space exploration . As the work in this study illustrates, Centaurus can find rare and relatively unique high-quality clusterings when the state space it explores is large.MZA is a stand-alone software package that runs on a laptop or desktop computer. In contrast, Centaurus is designed to run as a highly concurrent and scalable cloud service and uses a single processor per k-means run. As such, it automatically harnesses multiple computational resources on behalf of its users. Centaurus can be configured to constrain the number of resources it uses; doing so proportionately increases the time required to complete a job . For this work, we host Centaurus on two large private cloud systems: Aristotle Aristotle and Jet stream Stewart et al. , Towns et al. . Extensive studies of k-means demonstrate its popularity for data processing and many surveys are available to interested readers Jain et al. , Berkhin . In this section, we focus on k-means clustering for multivariate correlated data. We also discuss the application and need for such systems in the context of farm analytics when analyzing soil electrical conductivity. To integrate k-means into Centaurus, we leverage Murphy’s Murphy work in the domain of Gaussian Mixture Models. This work identifies multiple ways of computing the covariance matrices and using them to determine distances and log-likelihood. To the best of our knowledge, there is no prior work on using all six variants of cluster covariance computation within a k-means system. We also utilize the k-means++Arthur & Vassilvitskii work for cluster center initialization.The research and system that is most closely related to Centaurus, is MZA Fridgen et al. —a computer program widely used by farmers to identify clusters in soil electro-conductivity data to aid farm zone identification and to optimize management.
MZA uses fuzzy k-means Dunn , Bezdek , computes a global covariance and employs either Euclidean Heath et al. , diagonal, or Mahalanobis distance to compute the distance between points. MZA computes the covariance matrix once from all data points and uses this same matrix in each iteration. MZA compares clusters using two different scoring metrics: fuzziness performance index Odeh et al. and normalized classification entropy Bezdek . Centaurus attempts to address some of the limitations of MZA . We also show that although MZA provides multiple scoring metrics to compare cluster quality, the MZA metrics commonly produce different “recommended” clusterings. The authors of x-means Pelleg et al. use Bayesian Information Criterion Schwarz as a score for the univariate normal distribution. Our work differs in that we extend the algorithm and scoring to multivariate distributions and account for different ways of covariance matrix computation in the clustering algorithm. We provide six different ways of computing covariance matrix for k-means for multivariate data and examples that illustrate the differences. Different parallel computational models have been used in other works to speed up the k-means cluster initialization Bahmani et al. , or its overall runtime. Our work differs in that we provide not only a scalable system but include k-means variants, flexibility for a user to select any one or all of the variants, as well as a scoring and recommendation system. Finally, Centaurus is pluggable enabling other algorithms to be added and compared. The Internet of Things is quickly expanding to include every “thing” from simple Internet-connected objects, to collections of intelligent devices capable of everything from the acquisition, processing, and analysis of data, to data-driven actuation, automation, and control. Since these devices are located “in the wild”, they are typically small, resource-constrained and battery-powered. At the same time, low latency requirements of many applications mean that processing and the analysis must be performed near where data is collected. This tension requires new techniques that equip IoT devices with more capabilities. One way to enable IoT devices to do more is to use integrated sensors to estimate the measurements of other sensors, a technique that we call sensor synthesis. Since the number of sensors per device is generally bounded by design constraints, sensor synthesis makes it possible to free up resources in IoT devices for other sensors. We focus on estimating values of measurements where estima-tion error is low, freeing up space for sensors with measurements that are harder to estimate. Many, if not most, IoT systems for precision agriculture depend on and integrate measurements of real-time, atmospheric temperature. Temperature is used to inform and actuate irrigation scheduling, frost damage mitigation, greenhouse management, plant growth modulation, yield estimation, post-harvest monitoring, crop selection, and disease and pest management, among other farm operations Ghaemi et al. , Stombaugh et al. , Ioslovich et al. , Roberts et al. , Gonzalez-Dugoa et al. .