|
| |
When does the size of the dataset change your analysis strategy? There are 4
options
in PATN that may not be appropriate for large datasets-
 | Association (when the generation of a lower symmetric matrix is
involved) and three
options that are based on this association matrix |
 | Hierarchical classification (such as flexible UPGMA) |
 | Ordination (SSH in PATN) and |
 | ANOSIM |
Once you have more than around 100 objects, the traditional hierarchical
clustering
is less appropriate than a non-hierarchical strategy. Examining a dendrogram of
more
than 100 objects is overwhelming, unless you have an intimate knowledge of the
data
and the processes that are generating the variation.
Ordination of more than 100 objects can pose greater problems. The Ordination
Plot
will display objects that are visible on the external parts of clusters. By
omission,
or examining of the coordinates, you can infer to location on the plot of unseen
objects. It may be useful to have an ordination for a few hundred objects just
to
examine the overall distribution of objects and get the PCC (the directions of
the
variables in the ordination space) orientations.
When you get to thousands of objects, generation time and memory requirements in
the generation of association values starts to be a consideration. PATN will
happily
try and produce the association matrix, but it may not be wise for you to ask
for
it! Hierarchical classification of thousands of objects really is getting crazy.
Ordination will be even worse in that the computer resources required for it are
the greatest in PATN. Waiting a few hours for a useful outcome maybe
appropriate,
but when it is of marginal value, think again.
A Suggested Strategy for Larger Datasets
 | Don't generate a (pair-wise) association matrix between objects |
 | Do non-hierarchical classification (which uses object-centroid
association values) |
 | Group statistics and box and whisker plots as usual |
 | Export row group centroids |
 | Import centroids as a new PATN dataset |
 | Run association and SSH |
 | Use PCC, MCAO and ANOSIM as required |
NOTE: If there are more than 100 objects in the Data Table, PATN will by
default-
 | un-select the generation of pair-wise association values |
 | select non-hierarchical classification with an appropriate measure |
 | un-select ordination |
Run Times for PATN Datasets (3GHz, Windows XP, 512MB memory)
| # Objects |
# Variables |
Analysis |
Time (seconds) |
| 100 |
100 |
Gower Metric on rows & columns, UPGMA (defaults), SSH
(defaults) and all variables
applied to ANOSIM & MCAO (100 iterations) and PCC |
2 |
| 200 |
200 |
" |
4 |
| 300 |
300 |
" |
12 |
| 500 |
500 |
" |
50 |
| 1000 |
1000 |
" |
33 |
| 1000 |
50 |
" |
17 |
| 2000 |
50 |
Non-hierarchical classification only (Gower metric) |
54 |
| 5000 |
50 |
" |
210 |
Large Datasets and Dendrograms
The non-hierarchical classification algorithm in PATN will not generate a
dendrogram. The algorithm will create a set of k groups. Often what is wanted is
a dendrogram of these groups. In PATN V3, this can be achieved fairly easily.
When any classification is run, PATN automatically produces group statistics.
For each variable in each group PATN reports the following values-
- Minimum
- First quartile
- Median
- Mean
- Third quartile
- Maximum
An example of this file -

To produce a group dendrogram do the following-
- Select File | Export Evaluation Data | Row Group Statistics
- Edit the file (or write a program or script) to produce to produce an
Excel file in the form of Rows groups by column means or medians. The first
row of the table will be group 1 and the first column will be variable 1
mean or median. The second column will be variable 2 mean or median and so
on for each of the k groups. Make sure you have both row and column labels
and save the file in Excel format.
- Import the Excel file into PATN
- Select the same association measure that you used for the
non-hierarchical classification
- Select hierarchical clustering with defaults, and select the number of
final groups that you think you may want
- Run the classification and display the dendrogram
- If needed, alter the number of groups and re-run.
|