Next: , Previous: , Up: Internal Routines   [Contents][Index]


15.5.97 cluster

cluster, data, centers=centers [, index=index, size=size, sample=sample, phantom=phantom, maxit=maxit, rms=rms, second=second, /update, /iterate, /vocal, /quick, /record, /ordered]

This routine divides vectors into clusters, based on their proximity to cluster centers. data, centers, index, sample, phantom, and maxit are input variables; centers, index, size, rms, and second are output variables and must therefore be named variables. The arguments are:

data

The data points. The first dimensions of data selects the components of the vectors; i.e. the first vector is data(*,0) if data has two dimensions. Each data point is assigned to the cluster whose center is closest of all cluster centers to the data point at that time.

centers

The cluster centers. If centers is a scalar on entry, then it denotes the number of clusters to divide the data into, and a random sample (of appropriate size) of the data vectors is taken as initial guess for the cluster center positions. If centers is an array on input, then each centers(*,k…) is taken to contain the (initial) position of a cluster center. The final (possibly updated) cluster centers are returned in centers on exit.

index

The cluster assignment of the data points. If index is an array of appropriate size on entry, and if phantom is not specified, then the elements of index denote the initial cluster membership of the datapoints. On exit, index contains the cluster numbers that the data points have been assigned to. No or an undefined index implies /phantom.

size

The cluster sizes. If size is specified, then the number of elements in each cluster is returned in it upon exit from the routine.

sample

The data point sample size. If sample is an integer larger than one, then it indicates the size of a random sample of data points that should be treated. If sample equals 1, then a sample ten times bigger than the number of clusters is used. If sample is not specified or falls outside the range mentioned before, then all data points are treated.

phantom

If phantom is specified, then it indicates that the clusters should be pre-stocked with phantom members, to partially suppress movement of the cluster centers during clustering. The value assigned to phantom, if an integer larger than 1, indicates how many phantom members to assign to each cluster prior to treatment of the data points. If phantom is not specified, and index is an array of appropriate size, then the clusters contain the members indicated by index prior to clustering. If phantom equals 1, then 10 phantom members are assigned to each cluster prior to clustering. Any phantom members are removed after clustering, before exiting the routine.

maxit

maxit specifies the maximum number of iterations that is allowed for this call to cluster. If maxit is not specified, then the number of iterations is unlimited.

rms

If rms is specified then the average root-mean-square distance of the members of each cluster to its center is returned in it.

The keywords are:

/update

signals that the cluster center positions must be updated during the clustering, so that each cluster center at any time during the clustering equals the average position of all members in the cluster at that time (including any phantom members). If /update is not specified, then the cluster centers do not move during clustering.

/iterate

specifies that updating must be iterated until the cluster centers are stable and upon exit all data points are members of the cluster whose center is the closest one of all cluster centers. /iterate implies /update.

/vocal

specifies that the number of reclustered data points (and the number of changed clusters, if /iterate is also specified) are printed.

/quick

specifies that only data points in clusters that were changed during the last iteration or so far during the current one should be treated during the current iteration. The other data points are left unchanged. This keyword implies /iterate and /update.

/order

specifies that there is some degree of order in data, so that there is more than a random chance that the current data point and the previous data point belong in the same cluster. The time penalty of this option is small, so it is selected by default. Specify /noorder to deselect it. This option only affects the first iteration.

/record

specifies that the cluster positions and sizes should be written to file cluster.out after each iteration. This option has effect only if /iterate was also selected.

cluster,data,c,i with undefined i yields cluster centers which are not distrubuted evenly. cluster,data,c,i,/iterate yields a more even distribution. For large data sets it is advised to first use cluster,data,c,i,/sample,/iterate (or with sample=sample_size) to get fairly evenly distributed cluster centers, and then cluster,data,c,i to assign all data points to the clusters.

When random samples of data points are treated, then the current clustering algorithm is known as the Continous k-Means Algorithm.

See also: Topology


Next: , Previous: , Up: Internal Routines   [Contents][Index]