|
The dataDataset: Medals.xls This dataset is essentially socio-economic in nature. The data comes from a range of institutions that collect international data such as the United Nations. This dataset was generated and distributed with PATN because most will find it easy to understand the variables and their significance to national differences. This does not necessarily mean that the data is 'simple'. Far from it. There are 18 variables and 191 countries. I selected the 18 variables on the basis of interest and their potential to profile nation states. Many more variables could have been chosen-
The goal
The dataset was collected to pursue my interest in Australia’s unhealthy over-interest in elite sport. Divide medals by population and you will see what I mean. The dataset raises questions about approaches to national 'development'. Primarily, it is a useful basic dataset to profile nations, and learn from the experience of others. Sadly, few national governments appear capable of seeing how others have approached development, and therefore destined to re-invent the wheel, and often experience the negative outcomes. Pity that the public rather than politicians always seem to be the 'guinea pigs'. I would encourage you to post any insights into what the dataset may be hinting at to the discussion group on the PATN Web site http://www.patn.com.au/phpBB2. Any ideas of other useful national statistics may also be appreciated. The analysis1. Import the dataWhile the Medals dataset is included with PATN as a PATN-formatted file, it maybe helpful to demonstrate a simple import from Excel™. Here is the original data-
Note that the dataset in Excel includes "#N/A" as 'not-applicable' or missing data. This code is automatically detected by PATN on import. Missing data is typical in many datasets and some thought is required about how it is handled. Run PATN and select the bottom button (Import data from an external file)-
and point PATN at the Excel file containing the data- in this case, there are a number of Worksheets in the spreadsheet, so select the worksheet that is to be imported, in this case, the one named 'ALL'- and here is the resulting Data Table in PATN- Note: 191 rows have been imported, the labels are detected and used by PATN, missing data is identified by '..' and that PATN has automatically generated some basic statistics (called 'Visible Statistics' in PATN). 2. Examine the dataIt is wise to scan the Visible Statistics of an imported file to see if the process worked well, and to detect issues that will need to be addressed as a part of the analysis. In this case, I'd like to use, the minimum, maximum and the number of missing values as visible stats. To select these, use Tools | Options and select the required summary statistics. You can also set the number of decimals to be displayed (it does not change the data) here. Given the size of some of the variables 0 or 1 decimal place is appropriate- After examining the stats, it appears that there are some countries and variables that have a level of missing data that could make for a less than robust analysis. So, my first step is to eliminate those countries that have more than 5 out of the 18 variables missing. To do this, use CTRL and the left mouse button and selecting the labels of those countries to be eliminated. Once they are all highlighted, you could either make them extrinsic, or simply delete them. I'll delete these countries (click the right mouse button on any of the county labels and select delete)- Andora (12) We now have 176 countries left, none with more than 2 missing values. What
about the variables? After examining the stats, I'll set as extrinsic the
following variables with greater than 10 missing values. Select the column
labels as done with the countries, but now press the
Medals (99) We now have 8 variables in the analysis (intrinsic variables). Why eliminate countries with more than 5 missing values or variables with more than 10? It is a best guess threshold after a careful examination of the data. The extreme, of eliminating all countries or variables with missing data is too extreme as PATN handles 'fair' levels of missing data. The other extreme would be leave the data 'as is'. This would in some cases, generate estimates of resemblance or relationship that were based on too few values, leading to potential unreliability of the outcomes. Somewhere in between seems fair. A lot of pattern Analysis is like this. 3. Preliminary analysisI'll now do a quick analysis (which is very easy with PATN) to what the structure is like and if there are other issues to address. I never do just one analysis as you will see! First, select the analysis button or from the menu Data | Analysis-
AssociationThe Association tab is selected and PATN is 'greying-out' association to suggest it probably shouldn't happen. Why? There are more than 100 objects and PATN is suggesting that non-hierarchical classification is preferable. Read what PATN says about this in the box below. But, if we want an ordination, we need the association matrix. So, we check "Generate a lower-symmetric matrix of associations" and select the Gower Metric. PATN would have selected the Gower Metric as a default on examining the imported data. Why? I won't go into details here, but to say that the variables have widely different ranges and that the Gower metric (range standardized Manhattan distance) is appropriate. ClassificationWe then select the Classification tab and then select hierarchical classification and the Flexible UPGMA option. In tests, this strategy seems the most robust. In summary, flexible UPGMA with a beta value of -0.1 seems to be able to classify realistic artificial data better than other methods. You can only evaluate such algorithms by the analysis of many datasets where 'truth' is known. This, of course, is no easy matter when realistic data are complex. This is not the place to dive into too much theory. The beta value controls what is referred to as space-dilation and space-contraction. It is known from artificial and real datasets, that as differences become greater between objects, measures of association tend to under-estimate the true difference. Setting beta to -0.1 dilates the space defined by the variables so as to recover a better estimate of true difference. A value of zero would not contract or dilate the space while a positive value (such as 0.1) would contract the space. Select 6 groups. Hierarchical classification produces everything from 1 group, to in out case, 191 groups. Selecting 6 here simply tells PATN how many groups you would like to use as the level of summary. Why 6? Seems like a good number to me. It is more a matter of what your brain can handle easily than anything else, such as the number of real groups in the data, if there are any. 6 is a good number of groups of things that can be easily enough understood. We can change this later if we wish, but for the moment, let's try 6. By the way, if we chose 30 here for example, I'll bet (if the data is 'good') that PATN would summarise 30 meaningful groups. Classification reduces a larger number of objects to a smaller number of groups. 6 is easier than 191. Helpful. |
| > 0.3 – try again! | |
| 0.2 to 0.3 – not great | |
| 0.15 to 0.2 – lower would be better | |
| 0.1 to 0.15 – possibly ok | |
| < 0.1 - not bad at all | |
| <0.05 – very good (is something wrong!!) |
So, 0.0994 is not too bad for complex data, but we may be able to improve on this.
We have a result. That is the easy part. Now we have to analyse what PATN is suggesting. That's the hard part. While we could have selected 'All evaluations' on the first analysis dialog box, this would have only evaluated the extrinsic variables. I'd prefer the lot. So, either click the evaluation button or select Data | Evaluation and select the 'Box Whisker' tab and press the 'Add All>>' button. This will run all variables (extrinsic and intrinsic) across the 6 groups seeking to find how they are distributed using box and whisker plots and associated Kruskal-Wallis (KW) values. More on KW later.

Next, select the PCC / MCAO tab and also 'Add All' variables for PCC. Don't bother with MCAO. What is PCC? It is an abbreviation for Principal Coordinate Correlation. It's a hangover term from the old days of DOS PATN where I correlated a set of variables with the axes of another ordination technique called principal coordinate analysis. Basically, what PCC does is to use multiple linear regression to fit a set of variables into an ordination space. In our case, PCC will take each of the 18 variables and a) give us the best fit direction and b) correlation. This will show us diagrammatically, how the variables can help define the directions in the ordination we have produced.

While it is only the intrinsic variables that generated the 3-dimensional ordination, there is no harm in seeing what this means to all variables. It won't change the ordination, it is just an evaluation.
Next, as in the analysis, PATN display the evaluation recipe for you to check-

We press OK and the evaluation is done in a fraction of a second. We can now look at the analysis with the help of the evaluations. Note that the PATN menu bar is useful for this phase. The buttons that are greyed out refer to steps that have not been run as yet (in this case a two-way table and MCAO). The buttons (from left to right) are (display) association matrix, dendrogram, two-way table, ordination, ANOSIM, box and whisker plots, PCC, MCAO and recipe.

First up, click the box and whisker button and the following window will be displayed (only the first two variables are shown here)-
This is a very useful set of graphs to examine how the variables are distributed across the 6 groups. By default, the graphs are sorted by decreasing Kruskal-Wallis value. The higher the KW value, the better the variable is at discriminating between the 6 groups. In this case, Life Expectancy is best (KW=141.4) and Electricity/capita is next best (KW= 128.32). Here is the complete table of KW values (easily generated from PATN's Export | Evaluation Data menu).
x - extrinsic at time of analysis
->x - made extrinsic for re-analysis
From this table we can see that two extrinsic variables (Cars/1000 and University%) show good discrimination between groups even if they were extrinsic. Looking at the values, it would appear that variables below deforestation rate are no where near as good as those above. Look at the plots. The left bar is the minimum, the left edge of the box is the 1st quartile (25% of values are below this line), the vertical line in the box is the median (50% below), the circle is the mean, the right box edge is the 3rd quartile (25% of the values are above this line) and the right bar is the maximum. Deforestation rate does show fairly good discrimination across the groups while Coal/capita doesn't.
So what?
| Variable | KW Value | Extrinsic |
| Life Exp | 141.4 | |
| Electricity/Capita | 128.3 | |
| CO2/cap | 115.5 | |
| GNP/Cap | 114.9 | |
| Cars /1000 | 113.4 | x |
| Univ% | 109.2 | x |
| People/Doctor | 107.7 | |
| Literacy | 89.5 | |
| Deaths/1000 | 87.8 | |
| Deforestation Rate | 52.8 | |
| Coal/Capita | 37.2 | ->x |
| Ed /GNP% | 26.9 | ->x |
| Population Density | 18.7 | ->x |
| Prot Land% | 10.7 | x |
| Arable land/1K | 10.0 | ->x |
| Pop | 8.6 | ->x |
| Medals | 5.6 | x |
| Mil%GNP | 5.6 | x |
Remember that variables can be considered as comprising two parts, signal (what we are seeking) and noise (what we would like to remove). Some variables, at least as far as this dataset goes, have a high-level of noise. If we eliminate them from the analysis (as intrinsics), we may be able to increase the signal. We could use the same argument to suggest that Cars/1000 and Univ% could be made intrinsic. Probably, but for this demo, I'll leave them out (as extrinsics) and see what they do next.
I'll now set Coal/cap, Ed/GNP%, Population Density, Arable land/1000 and population as extrinsic and re-do the analysis and see what happens. I did take a quick look at the ordination plot and it did suggest we had a rational pattern, but to save space, on with the re-analysis (with the same parameters)
The stress has now dropped from 0.0994 to 0.0646. That's a goo sign, but remember we now have 8 variables, not 13 and it will be easier to squeeze 8 into 3 than it is 13 into 3.

When the analysis is done, re-do the evaluation exactly the same as before - include all 18 variables in. I'll examine the box and whisker plots and associated KW values carefully, but this time, I'll just show the PCC values to demonstrate how they work. For this, press the PCC button and note that PATN highlight the PCC tab in the Data Table. This tab has (in this case) 4 values - x, y, z and r-squared. The x,y, and z are the coordinates in the ordination space of the variable. For example, Life Expectancy is -0.1, -0.3, -1.0 and 0.9 (rounded to 1 decimal place). This means that a vector in the ordination space will have the tip at these coordinates and the r-squared value is excellent (this vector would account for 90% of the variation in the life expectancy values). The table of PCC values can be view from the right-mouse button menu menu on the ordination plot, or by exporting the PCC values from the Data menu.

Let's see the main PATN display, the ordination plot. This is the display where you will spend most of your time analysing the results. Believe me! For the display below, I have rotated the plot manually (SHIFT + left mouse button) to a position where it is easy to see the overall structure. There is nothing like rotating the objects (the countries) every which way. Note also that I have identified Group 5 (the affluent countries as it turns out) by click in the group 5 area not on a country. This has in turn automatically displayed on the left side of the plot, the best overall KW values across the 6 groups. I have also pressed 'G' for PATN to display the group, rather than the individual object colours. This makes it easier to see the groups generated by the hierarchical classification.
There is a basic affluence trend from the SW (group 1, the poorest) to the NE (group 5, the richest). As 'poor' and 'rich' are not variables, this hints at the type of evaluation that needs to be done.

Clicking on countries in the ordination plot will quickly give you an idea of what the overall structure is. It is efficient to identify the outliers, like Yemen in the SE corner of the plot above. As can be seen from the group colour, Yemen is a single member group. Note: When you click on an object (a country here) in the ordination plot, the object is also highlighted in the data table, and vice versa.

An important point about outliers, is that they have strong influence on ordinations. Ordination is the same as regression, as ordination is based on regression. There is therefore a strong argument to eliminate serious outliers as they will influence the ordination well-beyond their single-object status.
So, I'll make Yemen extrinsic and go again on the analysis. Highlight Yemen in the data Table by clicking on its label (left mouse button) and then either right licking on the label and selecting Make Extrinsic or click on the make extrinsic button on the PATN toolbar. This will place Yemen out of the analysis but available for any interpretation my wish to perform.
No change in parameters for this re-analysis. This time, the SSH stress is down to 0.0632. We should be more than happy about this given the number of objects and the complexity of the dataset.

Now we can get to work on the evaluation proper. Select the Evaluation button and run all evaluation options (ANOSIM, Box and Whisker, PCC and MCAO) on all variables; intrinsics and extrinsics.
First, let's look at how all the variables (intrinsic and extrinsic) are distributed across the new 6 groups (extrinsic variables are marked with an "x")-
| Variable | KW Value | Extrinsic? |
| Life Exp | 135.0 | |
| Electricity/Capita | 129.4 | |
| Literacy | 122.8 | |
| CO2/cap | 119.4 | |
| People/Doctor | 115.4 | |
| Univ% | 110.8 | x |
| GNP/Cap | 109.2 | |
| Cars /1000 | 107.2 | x |
| Deaths / 1000 | 97.3 | |
| Coal/Capita | 60.3 | x |
| Deforestation Rate | 35.9 | |
| Ed /GNP% | 24.8 | x |
| Mil%GNP | 23.7 | x |
| Prot Land% | 17.0 | x |
| Medals | 13.7 | x |
| Population Density | 10.9 | x |
| Arable land/1K | 10.6 | x |
| Pop | 6.9 | x |
Things to note:
1. Most of the intrinsic values are effective discriminators across the 6 groups, except maybe 'Deforestation rate'. This is strongly backed up by the box and whisker plots. You can judge the effectiveness of the box and whisker plots by their ability to create an effective decision tree that could be used to discriminate the objects between groups. In this case, it is easy to build up such a decision tree starting from the best discriminating variable first, then second best, down to (if necessary) 'deaths /1000' .
2. 'Deforestation rate' (an intrinsic) doesn't seem to be that useful as a group discriminator. We could cull it as an extrinsic and go back and re-analyse, but that is probably not necessary as the stress is so good, and the classification also makes a lot of sense.
3. Some extrinsic variables show good discrimination. This implies that although some extrinsic variables had an uncomfortable number of missing values, the values that are there do seem to display a signal that aligns well with the classification. I'd include cars/1000 and Univ% in this category. It means that we can use these two variables to help interpret the groups and to correlate with the intrinsic variables.
ANOSIM will tell us how effective (different) the 6 groups are on the basis of the 'within-group and 'between-group' values of the Gower metric. ANOSIM is a type of F-test using association rather than variables. Click the 'A-button' on the PATN toolbar and you will see this window-

This tells us that none of the 100 randomised solutions (swapping objects between the 6 groups) is better than the grouping that PATN generated. With a standard analysis in PATN, this is far from surprising. A value > 5% would suggest a poor classification which may indicate a lot of noise or poor variables or poor sampling.
As we have not yet run an analysis on the variables, ANOSIM on variables is not available.
PATN's PCC routine uses multiple linear regression to fit each selected variable into the ordination space (1, 2 or 3-dimensions). The result is a set of coordinates that represents the tip of the vector of the variable. SSH centres the coordinates of the objects so this vector represents the best fit direction of the variable. An r-squared value provides some estimate on how good the fit was. For example, an r-squared value of 0.7 means that 70% of the variation of the variable is accounted for by the vector.
Pressing the PCC button on the PATN toolbar will highlight the PCC TAB in the Data Table.

Alternatively, you could export the PCC values from the Data Menu. If we do that and then sort on the r-squared values we get the following table. I've added an extra column to designate the extrinsic variables (marked with an "x")
How does the 'utility' of the variables as measured by the r-squared value differ from the those highlighted by the Kruskal-Wallis value and why? First, why? Remember that the KW values are based on the 6 groups while the r-squared value is based on the coordinates of the countries in the ordination plot. KW is 'clumped' and r-squared isn't.
I have tabulated both the KW and PCC values below and included a difference in the ranks as the last column. For example, the difference in the rank of Deaths/1000 for KW and r-squared is 7; different.
| Variable | X | Y | Z | rSquared | Extrinsic? | KW | Rank Difference |
| Life Exp | 0.12 | -0.40 | -0.91 | 0.90 | 135.0 | 0 | |
| Deaths/1000 | -0.12 | 0.81 | 0.57 | 0.85 | 97.3 | 7 | |
| People/Doctor | -0.97 | 0.16 | 0.15 | 0.72 | 115.4 | 2 | |
| GNP/Cap | -0.31 | 0.70 | -0.64 | 0.58 | 109.2 | 3 | |
| Electricity/Capita | -0.24 | 0.68 | -0.70 | 0.58 | 129.4 | 3 | |
| Medals | -0.21 | 0.91 | -0.35 | 0.35 | x | 13.7 | 9 |
| Coal/Capita | -0.18 | 0.94 | -0.30 | 0.18 | 60.3 | 3 | |
| Deforestation Rate | -0.78 | -0.42 | 0.46 | 0.07 | 35.9 | 3 | |
| Prot Land% | -0.42 | 0.91 | -0.02 | 0.05 | x | 17.0 | 5 |
| Mil%GNP | -0.77 | 0.59 | -0.22 | 0.05 | x | 23.7 | 3 |
| CO2/cap | 0.69 | -0.67 | -0.30 | 0.04 | x | 119.4 | 7 |
| Univ% | -0.24 | 0.96 | 0.14 | 0.04 | x | 110.8 | 5 |
| Population Density | 0.18 | -0.78 |