When does the size of the dataset change your analysis strategy? There are 4 options
in PATN that may not be appropriate for large datasets-

bulletAssociation (when the generation of a lower symmetric matrix is involved) and three
options that are based on this association matrix
bulletHierarchical classification (such as flexible UPGMA)
bulletOrdination (SSH in PATN) and
bulletANOSIM

Once you have more than around 100 objects, the traditional hierarchical clustering
is less appropriate than a non-hierarchical strategy. Examining a dendrogram of more
than 100 objects is overwhelming, unless you have an intimate knowledge of the data
and the processes that are generating the variation.

Ordination of more than 100 objects can pose greater problems. The Ordination Plot
will display objects that are visible on the external parts of clusters. By omission,
or examining of the coordinates, you can infer to location on the plot of unseen
objects. It may be useful to have an ordination for a few hundred objects just to
examine the overall distribution of objects and get the PCC (the directions of the
variables in the ordination space) orientations.

When you get to thousands of objects, generation time and memory requirements in
the generation of association values starts to be a consideration. PATN will happily
try and produce the association matrix, but it may not be wise for you to ask for
it! Hierarchical classification of thousands of objects really is getting crazy.
Ordination will be even worse in that the computer resources required for it are
the greatest in PATN. Waiting a few hours for a useful outcome maybe appropriate,
but when it is of marginal value, think again.

A Suggested Strategy for Larger Datasets

bulletDon't generate a (pair-wise) association matrix between objects
bulletDo non-hierarchical classification (which uses object-centroid association values)
bulletGroup statistics and box and whisker plots as usual
bulletExport row group centroids
bulletImport centroids as a new PATN dataset
bulletRun association and SSH
bulletUse PCC, MCAO and ANOSIM as required

NOTE: If there are more than 100 objects in the Data Table, PATN will by default-

bulletun-select the generation of pair-wise association values
bulletselect non-hierarchical classification with an appropriate measure
bulletun-select ordination

Run Times for PATN Datasets (3GHz, Windows XP, 512MB memory)
 

# Objects # Variables Analysis Time (seconds)
100 100 Gower Metric on rows & columns, UPGMA (defaults), SSH (defaults) and all variables
applied to ANOSIM & MCAO (100 iterations) and PCC
2
200 200 " 4
300 300 " 12
500 500 " 50
1000 1000 " 33
1000 50 " 17
2000 50 Non-hierarchical classification only (Gower metric) 54
5000 50 " 210

Large Datasets and Dendrograms

The non-hierarchical classification algorithm in PATN will not generate a dendrogram. The algorithm will create a set of k groups. Often what is wanted is a dendrogram of these groups. In PATN V3, this can be achieved fairly easily.

When any classification is run, PATN automatically produces group statistics. For each variable in each group PATN reports the following values-

  1. Minimum
  2. First quartile
  3. Median
  4. Mean
  5. Third quartile
  6. Maximum

An example of this file -

To produce a group dendrogram do the following-

  1. Select File | Export Evaluation Data | Row Group Statistics
  2. Edit the file (or write a program or script) to produce to produce an Excel file in the form of Rows groups by column means or medians. The first row of the table will be group 1 and the first column will be variable 1 mean or median. The second column will be variable 2 mean or median and so on for each of the k groups. Make sure you have both row and column labels and save the file in Excel format.
  3. Import the Excel file into PATN
  4. Select the same association measure that you used for the non-hierarchical classification
  5. Select hierarchical clustering with defaults, and select the number of final groups that you think you may want
  6. Run the classification and display the dendrogram
  7. If needed, alter the number of groups and re-run.

 

PATN was developed by Lee Belbin and CSIRO and subsequently by Lee Belbin (Blatant Fabrications Pty Ltd, ABN: 96 106 672 379) with V1 coding by students at Griffith University (Queensland). Issues relating to this web site should be directed to patninfo@patn.com.au. PATN and this Web site are Copyright © 2004 Blatant Fabrications Pty Ltd. All rights reserved.

Last modified: Sunday April 11, 2010.