Next: colorcomponents, Previous: close, Up: Internal Routines [Contents][Index]
cluster, data, centers=centers [, index=index,
size=size, sample=sample, phantom=phantom,
maxit=maxit, rms=rms, second=second, /update,
/iterate, /vocal, /quick, /record, /ordered]
This routine divides vectors into clusters, based on their proximity to
cluster centers. data
, centers
,
index
, sample
, phantom
, and
maxit
are input variables; centers
,
index
, size
, rms
, and
second
are output variables and must therefore be named
variables. The arguments are:
The data points. The first dimensions of data
selects the
components of the vectors; i.e. the first vector is
data(*,0)
if data
has two dimensions. Each data
point is assigned to the cluster whose center is closest of all cluster
centers to the data point at that time.
The cluster centers. If centers
is a scalar on entry, then it
denotes the number of clusters to divide the data into, and a random
sample (of appropriate size) of the data vectors is taken as initial
guess for the cluster center positions. If centers
is an array on
input, then each centers(*,k…)
is taken to contain the
(initial) position of a cluster center. The final (possibly updated)
cluster centers are returned in centers
on exit.
The cluster assignment of the data points. If index
is an array
of appropriate size on entry, and if phantom
is not specified, then
the elements of index
denote the initial cluster membership of the
datapoints. On exit, index
contains the cluster numbers that the
data points have been assigned to. No or an undefined index
implies /phantom
.
The cluster sizes. If size
is specified, then the number of
elements in each cluster is returned in it upon exit from the routine.
The data point sample size. If sample
is an integer larger than
one, then it indicates the size of a random sample of data points that
should be treated. If sample
equals 1, then a sample ten times
bigger than the number of clusters is used. If sample
is not
specified or falls outside the range mentioned before, then all data
points are treated.
If phantom
is specified, then it indicates that the
clusters should be pre-stocked with phantom members, to partially
suppress movement of the cluster centers during clustering. The value
assigned to phantom
, if an integer larger than 1, indicates
how many phantom members to assign to each cluster prior to treatment of
the data points. If phantom
is not specified, and
index
is an array of appropriate size, then the clusters
contain the members indicated by index
prior to clustering.
If phantom
equals 1, then 10 phantom members are assigned
to each cluster prior to clustering. Any phantom members are removed
after clustering, before exiting the routine.
maxit
specifies the maximum number of iterations that is
allowed for this call to cluster
. If maxit
is not
specified, then the number of iterations is unlimited.
If rms
is specified then the average root-mean-square
distance of the members of each cluster to its center is returned in it.
The keywords are:
/update
signals that the cluster center positions must be updated during the
clustering, so that each cluster center at any time during the
clustering equals the average position of all members in the cluster at
that time (including any phantom members). If /update
is not
specified, then the cluster centers do not move during clustering.
/iterate
specifies that updating must be iterated until the cluster centers are
stable and upon exit all data points are members of the cluster whose
center is the closest one of all cluster centers. /iterate
implies /update
.
/vocal
specifies that the number of reclustered data points (and the number of
changed clusters, if /iterate
is also specified) are printed.
/quick
specifies that only data points in clusters that were changed during the
last iteration or so far during the current one should be treated during
the current iteration. The other data points are left unchanged. This
keyword implies /iterate
and /update
.
/order
specifies that there is some degree of order in data
, so
that there is more than a random chance that the current data point and
the previous data point belong in the same cluster. The time penalty of
this option is small, so it is selected by default. Specify
/noorder
to deselect it. This option only affects the first
iteration.
/record
specifies that the cluster positions and sizes should be written to file
cluster.out after each iteration. This option has effect only if
/iterate
was also selected.
cluster,data,c,i
with undefined i
yields cluster centers
which are not distrubuted evenly. cluster,data,c,i,/iterate
yields a more even distribution. For large data sets it is advised to
first use cluster,data,c,i,/sample,/iterate
(or with
sample=sample_size
) to get fairly evenly distributed cluster
centers, and then cluster,data,c,i
to assign all data points to
the clusters.
When random samples of data points are treated, then the current clustering algorithm is known as the Continous k-Means Algorithm.
See also: Topology
Next: colorcomponents, Previous: close, Up: Internal Routines [Contents][Index]