Title: | Hierarchical Clustering with Spatial Constraints |
---|---|
Description: | Implements a Ward-like hierarchical clustering algorithm including soft spatial/geographical constraints. |
Authors: | Marie Chavent [aut, cre], Vanessa Kuentz [aut], Amaury Labenne [aut], Jerome Saracco [aut] |
Maintainer: | Marie Chavent <[email protected]> |
License: | GPL (>=2.0) |
Version: | 2.1 |
Built: | 2025-03-01 03:35:27 UTC |
Source: | https://github.com/chavent/clustgeo |
This function calculates the proportion of inertia explained by the partitions in K
clusters
for a range of mixing parameters alpha
. When the proportion
of explained inertia calculated with D0
decreases, the proportion of explained inertia
calculated with D1
increases. The plot of the two curves of explained
inertia (one for D0
and one for D1
) helps
the user to choose the mixing parameter alpha
.
choicealpha(D0, D1, range.alpha, K, wt = NULL, scale = TRUE, graph = TRUE)
choicealpha(D0, D1, range.alpha, K, wt = NULL, scale = TRUE, graph = TRUE)
D0 |
a dissimilarity matrix of class |
D1 |
an other dissimilarity matrix of class |
range.alpha |
a vector of real values between 0 and 1. |
K |
the number of clusters. |
wt |
vector with the weights of the observations. By default, wt=NULL corresponds to the case where all observations are weighted by 1/n. |
scale |
if TRUE the two dissimilarity matrices are scaled i.e. divided by their max. |
graph |
if TRUE, two graphics (proportion and normalized proportion of explained inertia) are drawn. |
An object with S3 class "choicealpha" and the following components:
Q |
a matrix of dimension |
Qnorm |
a matrix of dimension |
M. Chavent, V. Kuentz-Simonet, A. Labenne, J. Saracco. ClustGeo: an R package for hierarchical clustering with spatial constraints. Comput Stat (2018) 33: 1799-1822.
data(estuary) D0 <- dist(estuary$dat) # the socio-demographic distances D1 <- as.dist(estuary$D.geo) # the geographic distances between the cities range.alpha <- seq(0,1,0.1) K <- 5 cr <- choicealpha(D0,D1,range.alpha,K,graph=TRUE) cr$Q # proportion of explained pseudo inertia cr$Qnorm # normalized proportion of explained pseudo inertia
data(estuary) D0 <- dist(estuary$dat) # the socio-demographic distances D1 <- as.dist(estuary$D.geo) # the geographic distances between the cities range.alpha <- seq(0,1,0.1) K <- 5 cr <- choicealpha(D0,D1,range.alpha,K,graph=TRUE) cr$Q # proportion of explained pseudo inertia cr$Qnorm # normalized proportion of explained pseudo inertia
Data refering to n=303 French municipalities of gironde estuary (a south-ouest French county).
The data are issued from the French population census conducted by the National Institute
of Statistics and Economic Studies. The dataset is an extraction of four quantitative
socio-economic variables for a subsample of 303 French municipalities located on the
atlantic coast between Royan and Mimizan. employ.rate.city
is the employment rate
of the municipality, that is the ratio of the number of individuals who have a job to
the population of working age (generally defined, for the purposes of international
comparison, as persons of between 15 and 64 years of age). graduate.rate
refers
to the level of education of the population that is the highest degree declared by the
individual. It is defined here as the ratio for the whole population having completed
a diploma equivalent or of upper level to two years of higher education
(DUT, BTS, DEUG, nursing and social training courses, license, maitrise, master, DEA, DESS, doctorate, or Grande Ecole diploma).
housing.appart
is the ratio of apartment housing. agri.land
is the part of
agricultural area of the municipality.
The R dataset estuary is a list of three objects:
dat: a data frame with the description of the n=303 municipalities on p=4 socio-demographic variables.
D.geo: a matrix with the geographical distances between the town hall of the n=303 municipalities.
map: an object of class SpatialPolygonsDataFrame
with the map of the gironde estuary.
Original data are issued from the French population census of National Institute of Statistics and Economic Studies for year 2009. The agricultural surface has been calculated on data coming from the French National Institute of Geographical and Forestry Information. The calculation of the ratio and recoding of categories have been made by Irstea Bordeaux.
M. Chavent, V. Kuentz-Simonet, A. Labenne, J. Saracco. ClustGeo: an R package for hierarchical clustering with spatial constraints. Comput Stat (2018) 33: 1799-1822.
data(estuary) names(estuary) head(estuary$dat)
data(estuary) names(estuary) head(estuary$dat)
Implements a Ward-like hierarchical clustering
algorithm including soft contiguity constraints. The algorithm takes as
input two dissimilarity matrices D0
and D1
and a mixing
parameter alpha between 0 an 1. The dissimilarities can be non euclidean
and the weights of the observations can be non uniform. The first matrix
gives the dissimilarities in the "feature space". The second matrix gives
the dissimilarities in the "constraint" space. For instance, D1
can be a matrix of geographical distances or a matrix build from
a contiguity matrix. The mixing parameter alpha
sets the importance
of the constraint in the clustering process.
hclustgeo(D0, D1 = NULL, alpha = 0, scale = TRUE, wt = NULL)
hclustgeo(D0, D1 = NULL, alpha = 0, scale = TRUE, wt = NULL)
D0 |
an object of class |
D1 |
an object of class "dist" with other dissimilarities between the same n observations. |
alpha |
a real value between 0 and 1. This mixing parameter gives the
relative importance of |
scale |
if TRUE the two dissimilarity matric |
wt |
vector with the weights of the observations. By default, wt=NULL corresponds to the case where all observations are weighted by 1/n. |
The criterion minimized at each stage is a convex combination of
the homogeneity criterion calculated with D0
and the homogeneity
criterion calculated with D1
. The parameter alpha
(the weight
of this convex combination) controls the importance of the constraint
in the quality of the solutions. When alpha
increases,
the homogeneity calculated with D0
decreases whereas the
homogeneity calculated with D1
increases.
Returns an object of class hclust
.
M. Chavent, V. Kuentz-Simonet, A. Labenne, J. Saracco. ClustGeo: an R package for hierarchical clustering with spatial constraints. Comput Stat (2018) 33: 1799-1822.
data(estuary) # with one dissimilarity matrix w <- estuary$map@data$POPULATION # non uniform weights D <- dist(estuary$dat) tree <- hclustgeo(D,wt=w) sum(tree$height) inertdiss(D,wt=w) inert(estuary$dat,w=w) plot(tree,labels=FALSE) part <- cutree(tree,k=5) sp::plot(estuary$map, border = "grey", col = part) # with two dissimilarity matrix D0 <- dist(estuary$dat) # the socio-demographic distances D1 <- as.dist(estuary$D.geo) # the geographical distances alpha <- 0.2 # the mixing parameter tree <- hclustgeo(D0,D1,alpha=alpha,wt=w) plot(tree,labels=FALSE) part <- cutree(tree,k=5) sp::plot(estuary$map, border = "grey", col = part)
data(estuary) # with one dissimilarity matrix w <- estuary$map@data$POPULATION # non uniform weights D <- dist(estuary$dat) tree <- hclustgeo(D,wt=w) sum(tree$height) inertdiss(D,wt=w) inert(estuary$dat,w=w) plot(tree,labels=FALSE) part <- cutree(tree,k=5) sp::plot(estuary$map, border = "grey", col = part) # with two dissimilarity matrix D0 <- dist(estuary$dat) # the socio-demographic distances D1 <- as.dist(estuary$D.geo) # the geographical distances alpha <- 0.2 # the mixing parameter tree <- hclustgeo(D0,D1,alpha=alpha,wt=w) plot(tree,labels=FALSE) part <- cutree(tree,k=5) sp::plot(estuary$map, border = "grey", col = part)
Computes the inertia of a cluster i.e. on a subset of rows of a data matrix.
inert( Z, indices = 1:nrow(Z), wt = rep(1/nrow(Z), nrow(Z)), M = rep(1, ncol(Z)) )
inert( Z, indices = 1:nrow(Z), wt = rep(1/nrow(Z), nrow(Z)), M = rep(1, ncol(Z)) )
Z |
matrix data |
indices |
vectors representing the subset of rows |
wt |
weight vector |
M |
diagonal distance matrix |
data(estuary) n <- nrow(estuary$dat) Z <- scale(estuary$dat)*sqrt(n/(n-1)) inert(Z) # number of variables w <- estuary$map@data$POPULATION # non uniform weights inert(Z,wt=w)
data(estuary) n <- nrow(estuary$dat) Z <- scale(estuary$dat)*sqrt(n/(n-1)) inert(Z) # number of variables w <- estuary$map@data$POPULATION # non uniform weights inert(Z,wt=w)
The pseudo inertia of a cluster is calculated from a dissimilarity matrix and not from a data matrix.
inertdiss(D, indices = NULL, wt = NULL)
inertdiss(D, indices = NULL, wt = NULL)
D |
an object of class "dist" with the dissimilarities between the n observations.
The function |
indices |
a vector with the indices of the subset of observations. |
wt |
vector with the weights of the n observations |
M. Chavent, V. Kuentz-Simonet, A. Labenne, J. Saracco. ClustGeo: an R package for hierarchical clustering with spatial constraints. Comput Stat (2018) 33: 1799-1822.
data(estuary) n <- nrow(estuary$dat) Z <- scale(estuary$dat)*sqrt(n/(n-1)) inertdiss(dist(Z)) # pseudo inertia inert(Z) #equals for euclidean distance w <- estuary$map@data$POPULATION # non uniform weights inertdiss(dist(Z),wt=w)
data(estuary) n <- nrow(estuary$dat) Z <- scale(estuary$dat)*sqrt(n/(n-1)) inertdiss(dist(Z)) # pseudo inertia inert(Z) #equals for euclidean distance w <- estuary$map@data$POPULATION # non uniform weights inertdiss(dist(Z),wt=w)
Plot two curves of explained
inertia (one for D0
and one for D1
) calculated with
choicealpha
.
## S3 method for class 'choicealpha' plot( x, norm = FALSE, lty = 1:2, pch = c(8, 16), type = c("b", "b"), col = 1:2, xlab = "alpha", ylab = NULL, legend = NULL, cex = 1, ... )
## S3 method for class 'choicealpha' plot( x, norm = FALSE, lty = 1:2, pch = c(8, 16), type = c("b", "b"), col = 1:2, xlab = "alpha", ylab = NULL, legend = NULL, cex = 1, ... )
x |
an object of class |
norm |
if TRUE, the normalized explained inertia are plotted. Otherwise, the explained inertia are plotted. |
lty |
a vector of size 2 with the line types of the two curves. See par |
pch |
a vector of size 2 specifying the symbol for the points of the two curves. See par |
type |
a vector of size 2 specifying the type of lines of the two curves. See par |
col |
a vector of size 2 specifying the colors the two curves. See par |
xlab |
the title fot the x axis. |
ylab |
the title fot the y axis. |
legend |
a vector of size two the the text for the legend of the two curves. |
cex |
text size in the legend. |
... |
further arguments passed to or from other methods. |
M. Chavent, V. Kuentz-Simonet, A. Labenne, J. Saracco. ClustGeo: an R package for hierarchical clustering with spatial constraints. Comput Stat (2018) 33: 1799-1822.
data(estuary) D0 <- dist(estuary$dat) D1 <- as.dist(estuary$D.geo) # the geographic distances between the cities range.alpha <- seq(0,1,0.1) K <- 5 cr <- choicealpha(D0,D1,range.alpha,K,graph=FALSE) plot(cr,cex=0.8,norm=FALSE,cex.lab=0.8,ylab="pev", col=3:4,legend=c("socio-demo","geo"), xlab="mixing parameter") plot(cr,cex=0.8,norm=TRUE,cex.lab=0.8,ylab="pev", col=5:6,pch=5:6,legend=c("socio-demo","geo"), xlab="mixing parameter")
data(estuary) D0 <- dist(estuary$dat) D1 <- as.dist(estuary$D.geo) # the geographic distances between the cities range.alpha <- seq(0,1,0.1) K <- 5 cr <- choicealpha(D0,D1,range.alpha,K,graph=FALSE) plot(cr,cex=0.8,norm=FALSE,cex.lab=0.8,ylab="pev", col=3:4,legend=c("socio-demo","geo"), xlab="mixing parameter") plot(cr,cex=0.8,norm=TRUE,cex.lab=0.8,ylab="pev", col=5:6,pch=5:6,legend=c("socio-demo","geo"), xlab="mixing parameter")
This function calculates the Ward aggregation measures between pairs of singletons.
wardinit(D, wt = NULL)
wardinit(D, wt = NULL)
D |
a object of class "dist" with the dissimilarities between the n obsevations.
The function |
wt |
vector with the weights of the observations. By default, wt=NULL corresponds to the case where all observations are weighted by 1/n. |
The Ward agreggation measure between to singletons i and j weighted by wi and wj is : (wiwj)/(wi+wj)dij^2 where dij is the dissimilarity between i and j.
Returns an object of class dist with the Ward aggregation measures between the n singletons.
M. Chavent, V. Kuentz-Simonet, A. Labenne, J. Saracco. ClustGeo: an R package for hierarchical clustering with spatial constraints. Comput Stat (2018) 33: 1799-1822.
This function performs the pseudo within-cluster inertia of a partition from a dissimilarity matrix.
withindiss(D, part, wt = NULL)
withindiss(D, part, wt = NULL)
D |
an object of class "dist" with the dissimilarities between the n observations.
The function |
part |
a vector with group membership. |
wt |
vector with the weights of the observations |
M. Chavent, V. Kuentz-Simonet, A. Labenne, J. Saracco. ClustGeo: an R package for hierarchical clustering with spatial constraints. Comput Stat (2018) 33: 1799-1822.