|
The dataDataset: Medals.xls This dataset is essentially socio-economic in nature. The data comes from a range of institutions that collect international data such as the United Nations. This dataset was generated and distributed with PATN because most will find it easy to understand the variables and their significance to national differences. This does not necessarily mean that the data is 'simple'. Far from it. There are 18 variables and 191 countries. I selected the 18 variables on the basis of interest and their potential to profile nation states. Many more variables could have been chosen-
The goal
The dataset was collected to pursue my interest in Australia’s unhealthy over-interest in elite sport. Divide medals by population and you will see what I mean. The dataset raises questions about approaches to national 'development'. Primarily, it is a useful basic dataset to profile nations, and learn from the experience of others. Sadly, few national governments appear capable of seeing how others have approached development, and therefore destined to re-invent the wheel, and often experience the negative outcomes. Pity that the public rather than politicians always seem to be the 'guinea pigs'. I would encourage you to post any insights into what the dataset may be hinting at to the discussion group on the PATN Web site http://www.patn.com.au/phpBB2. Any ideas of other useful national statistics may also be appreciated. The analysis1. Import the dataWhile the Medals dataset is included with PATN as a PATN-formatted file, it maybe helpful to demonstrate a simple import from Excel™. Here is the original data-
Note that the dataset in Excel includes "#N/A" as 'not-applicable' or missing data. This code is automatically detected by PATN on import. Missing data is typical in many datasets and some thought is required about how it is handled. Run PATN and select the bottom button (Import data from an external file)-
and point PATN at the Excel file containing the data- in this case, there are a number of Worksheets in the spreadsheet, so select the worksheet that is to be imported, in this case, the one named 'ALL'- and here is the resulting Data Table in PATN- Note: 191 rows have been imported, the labels are detected and used by PATN, missing data is identified by '..' and that PATN has automatically generated some basic statistics (called 'Visible Statistics' in PATN). 2. Examine the dataIt is wise to scan the Visible Statistics of an imported file to see if the process worked well, and to detect issues that will need to be addressed as a part of the analysis. In this case, I'd like to use, the minimum, maximum and the number of missing values as visible stats. To select these, use Tools | Options and select the required summary statistics. You can also set the number of decimals to be displayed (it does not change the data) here. Given the size of some of the variables 0 or 1 decimal place is appropriate- After examining the stats, it appears that there are some countries and variables that have a level of missing data that could make for a less than robust analysis. So, my first step is to eliminate those countries that have more than 5 out of the 18 variables missing. To do this, use CTRL and the left mouse button and selecting the labels of those countries to be eliminated. Once they are all highlighted, you could either make them extrinsic, or simply delete them. I'll delete these countries (click the right mouse button on any of the county labels and select delete)- Andora (12) We now have 176 countries left, none with more than 2 missing values. What
about the variables? After examining the stats, I'll set as extrinsic the
following variables with greater than 10 missing values. Select the column
labels as done with the countries, but now press the
Medals (99) We now have 8 variables in the analysis (intrinsic variables). Why eliminate countries with more than 5 missing values or variables with more than 10? It is a best guess threshold after a careful examination of the data. The extreme, of eliminating all countries or variables with missing data is too extreme as PATN handles 'fair' levels of missing data. The other extreme would be leave the data 'as is'. This would in some cases, generate estimates of resemblance or relationship that were based on too few values, leading to potential unreliability of the outcomes. Somewhere in between seems fair. A lot of pattern Analysis is like this. 3. Preliminary analysisI'll now do a quick analysis (which is very easy with PATN) to what the structure is like and if there are other issues to address. I never do just one analysis as you will see! First, select the analysis button or from the menu Data | Analysis-
AssociationThe Association tab is selected and PATN is 'greying-out' association to suggest it probably shouldn't happen. Why? There are more than 100 objects and PATN is suggesting that non-hierarchical classification is preferable. Read what PATN says about this in the box below. But, if we want an ordination, we need the association matrix. So, we check "Generate a lower-symmetric matrix of associations" and select the Gower Metric. PATN would have selected the Gower Metric as a default on examining the imported data. Why? I won't go into details here, but to say that the variables have widely different ranges and that the Gower metric (range standardized Manhattan distance) is appropriate. ClassificationWe then select the Classification tab and then select hierarchical classification and the Flexible UPGMA option. In tests, this strategy seems the most robust. In summary, flexible UPGMA with a beta value of -0.1 seems to be able to classify realistic artificial data better than other methods. You can only evaluate such algorithms by the analysis of many datasets where 'truth' is known. This, of course, is no easy matter when realistic data are complex. This is not the place to dive into too much theory. The beta value controls what is referred to as space-dilation and space-contraction. It is known from artificial and real datasets, that as differences become greater between objects, measures of association tend to under-estimate the true difference. Setting beta to -0.1 dilates the space defined by the variables so as to recover a better estimate of true difference. A value of zero would not contract or dilate the space while a positive value (such as 0.1) would contract the space. Select 6 groups. Hierarchical classification produces everything from 1 group, to in out case, 191 groups. Selecting 6 here simply tells PATN how many groups you would like to use as the level of summary. Why 6? Seems like a good number to me. It is more a matter of what your brain can handle easily than anything else, such as the number of real groups in the data, if there are any. 6 is a good number of groups of things that can be easily enough understood. We can change this later if we wish, but for the moment, let's try 6. By the way, if we chose 30 here for example, I'll bet (if the data is 'good') that PATN would summarise 30 meaningful groups. Classification reduces a larger number of objects to a smaller number of groups. 6 is easier than 191. Helpful. |
| > 0.3 – try again! | |
| 0.2 to 0.3 – not great | |
| 0.15 to 0.2 – lower would be better | |
| 0.1 to 0.15 – possibly ok | |
| < 0.1 - not bad at all | |
| <0.05 – very good (is something wrong!!) |
So, 0.0994 is not too bad for complex data, but we may be able to improve on this.
We have a result. That is the easy part. Now we have to analyse what PATN is suggesting. That's the hard part. While we could have selected 'All evaluations' on the first analysis dialog box, this would have only evaluated the extrinsic variables. I'd prefer the lot. So, either click the evaluation button or select Data | Evaluation and select the 'Box Whisker' tab and press the 'Add All>>' button. This will run all variables (extrinsic and intrinsic) across the 6 groups seeking to find how they are distributed using box and whisker plots and associated Kruskal-Wallis (KW) values. More on KW later.

Next, select the PCC / MCAO tab and also 'Add All' variables for PCC. Don't bother with MCAO. What is PCC? It is an abbreviation for Principal Coordinate Correlation. It's a hangover term from the old days of DOS PATN where I correlated a set of variables with the axes of another ordination technique called principal coordinate analysis. Basically, what PCC does is to use multiple linear regression to fit a set of variables into an ordination space. In our case, PCC will take each of the 18 variables and a) give us the best fit direction and b) correlation. This will show us diagrammatically, how the variables can help define the directions in the ordination we have produced.

While it is only the intrinsic variables that generated the 3-dimensional ordination, there is no harm in seeing what this means to all variables. It won't change the ordination, it is just an evaluation.
Next, as in the analysis, PATN display the evaluation recipe for you to check-

We press OK and the evaluation is done in a fraction of a second. We can now look at the analysis with the help of the evaluations. Note that the PATN menu bar is useful for this phase. The buttons that are greyed out refer to steps that have not been run as yet (in this case a two-way table and MCAO). The buttons (from left to right) are (display) association matrix, dendrogram, two-way table, ordination, ANOSIM, box and whisker plots, PCC, MCAO and recipe.

First up, click the box and whisker button and the following window will be displayed (only the first two variables are shown here)-
This is a very useful set of graphs to examine how the variables are distributed across the 6 groups. By default, the graphs are sorted by decreasing Kruskal-Wallis value. The higher the KW value, the better the variable is at discriminating between the 6 groups. In this case, Life Expectancy is best (KW=141.4) and Electricity/capita is next best (KW= 128.32). Here is the complete table of KW values (easily generated from PATN's Export | Evaluation Data menu).
x - extrinsic at time of analysis
->x - made extrinsic for re-analysis
From this table we can see that two extrinsic variables (Cars/1000 and University%) show good discrimination between groups even if they were extrinsic. Looking at the values, it would appear that variables below deforestation rate are no where near as good as those above. Look at the plots. The left bar is the minimum, the left edge of the box is the 1st quartile (25% of values are below this line), the vertical line in the box is the median (50% below), the circle is the mean, the right box edge is the 3rd quartile (25% of the values are above this line) and the right bar is the maximum. Deforestation rate does show fairly good discrimination across the groups while Coal/capita doesn't.
So what?
| Variable | KW Value | Extrinsic |
| Life Exp | 141.4 | |
| Electricity/Capita | 128.3 | |
| CO2/cap | 115.5 | |
| GNP/Cap | 114.9 | |
| Cars /1000 | 113.4 | x |
| Univ% | 109.2 | x |
| People/Doctor | 107.7 | |
| Literacy | 89.5 | |
| Deaths/1000 | 87.8 | |
| Deforestation Rate | 52.8 | |
| Coal/Capita | 37.2 | ->x |
| Ed /GNP% | 26.9 | ->x |
| Population Density | 18.7 | ->x |
| Prot Land% | 10.7 | x |
| Arable land/1K | 10.0 | ->x |
| Pop | 8.6 | ->x |
| Medals | 5.6 | x |
| Mil%GNP | 5.6 | x |
Remember that variables can be considered as comprising two parts, signal (what we are seeking) and noise (what we would like to remove). Some variables, at least as far as this dataset goes, have a high-level of noise. If we eliminate them from the analysis (as intrinsics), we may be able to increase the signal. We could use the same argument to suggest that Cars/1000 and Univ% could be made intrinsic. Probably, but for this demo, I'll leave them out (as extrinsics) and see what they do next.
I'll now set Coal/cap, Ed/GNP%, Population Density, Arable land/1000 and population as extrinsic and re-do the analysis and see what happens. I did take a quick look at the ordination plot and it did suggest we had a rational pattern, but to save space, on with the re-analysis (with the same parameters)
The stress has now dropped from 0.0994 to 0.0646. That's a goo sign, but remember we now have 8 variables, not 13 and it will be easier to squeeze 8 into 3 than it is 13 into 3.

When the analysis is done, re-do the evaluation exactly the same as before - include all 18 variables in. I'll examine the box and whisker plots and associated KW values carefully, but this time, I'll just show the PCC values to demonstrate how they work. For this, press the PCC button and note that PATN highlight the PCC tab in the Data Table. This tab has (in this case) 4 values - x, y, z and r-squared. The x,y, and z are the coordinates in the ordination space of the variable. For example, Life Expectancy is -0.1, -0.3, -1.0 and 0.9 (rounded to 1 decimal place). This means that a vector in the ordination space will have the tip at these coordinates and the r-squared value is excellent (this vector would account for 90% of the variation in the life expectancy values). The table of PCC values can be view from the right-mouse button menu menu on the ordination plot, or by exporting the PCC values from the Data menu.

Let's see the main PATN display, the ordination plot. This is the display where you will spend most of your time analysing the results. Believe me! For the display below, I have rotated the plot manually (SHIFT + left mouse button) to a position where it is easy to see the overall structure. There is nothing like rotating the objects (the countries) every which way. Note also that I have identified Group 5 (the affluent countries as it turns out) by click in the group 5 area not on a country. This has in turn automatically displayed on the left side of the plot, the best overall KW values across the 6 groups. I have also pressed 'G' for PATN to display the group, rather than the individual object colours. This makes it easier to see the groups generated by the hierarchical classification.
There is a basic affluence trend from the SW (group 1, the poorest) to the NE (group 5, the richest). As 'poor' and 'rich' are not variables, this hints at the type of evaluation that needs to be done.

Clicking on countries in the ordination plot will quickly give you an idea of what the overall structure is. It is efficient to identify the outliers, like Yemen in the SE corner of the plot above. As can be seen from the group colour, Yemen is a single member group. Note: When you click on an object (a country here) in the ordination plot, the object is also highlighted in the data table, and vice versa.

An important point about outliers, is that they have strong influence on ordinations. Ordination is the same as regression, as ordination is based on regression. There is therefore a strong argument to eliminate serious outliers as they will influence the ordination well-beyond their single-object status.
So, I'll make Yemen extrinsic and go again on the analysis. Highlight Yemen in the data Table by clicking on its label (left mouse button) and then either right licking on the label and selecting Make Extrinsic or click on the make extrinsic button on the PATN toolbar. This will place Yemen out of the analysis but available for any interpretation my wish to perform.
No change in parameters for this re-analysis. This time, the SSH stress is down to 0.0632. We should be more than happy about this given the number of objects and the complexity of the dataset.

Now we can get to work on the evaluation proper. Select the Evaluation button and run all evaluation options (ANOSIM, Box and Whisker, PCC and MCAO) on all variables; intrinsics and extrinsics.
First, let's look at how all the variables (intrinsic and extrinsic) are distributed across the new 6 groups (extrinsic variables are marked with an "x")-
| Variable | KW Value | Extrinsic? |
| Life Exp | 135.0 | |
| Electricity/Capita | 129.4 | |
| Literacy | 122.8 | |
| CO2/cap | 119.4 | |
| People/Doctor | 115.4 | |
| Univ% | 110.8 | x |
| GNP/Cap | 109.2 | |
| Cars /1000 | 107.2 | x |
| Deaths / 1000 | 97.3 | |
| Coal/Capita | 60.3 | x |
| Deforestation Rate | 35.9 | |
| Ed /GNP% | 24.8 | x |
| Mil%GNP | 23.7 | x |
| Prot Land% | 17.0 | x |
| Medals | 13.7 | x |
| Population Density | 10.9 | x |
| Arable land/1K | 10.6 | x |
| Pop | 6.9 | x |
Things to note:
1. Most of the intrinsic values are effective discriminators across the 6 groups, except maybe 'Deforestation rate'. This is strongly backed up by the box and whisker plots. You can judge the effectiveness of the box and whisker plots by their ability to create an effective decision tree that could be used to discriminate the objects between groups. In this case, it is easy to build up such a decision tree starting from the best discriminating variable first, then second best, down to (if necessary) 'deaths /1000' .
2. 'Deforestation rate' (an intrinsic) doesn't seem to be that useful as a group discriminator. We could cull it as an extrinsic and go back and re-analyse, but that is probably not necessary as the stress is so good, and the classification also makes a lot of sense.
3. Some extrinsic variables show good discrimination. This implies that although some extrinsic variables had an uncomfortable number of missing values, the values that are there do seem to display a signal that aligns well with the classification. I'd include cars/1000 and Univ% in this category. It means that we can use these two variables to help interpret the groups and to correlate with the intrinsic variables.
ANOSIM will tell us how effective (different) the 6 groups are on the basis of the 'within-group and 'between-group' values of the Gower metric. ANOSIM is a type of F-test using association rather than variables. Click the 'A-button' on the PATN toolbar and you will see this window-

This tells us that none of the 100 randomised solutions (swapping objects between the 6 groups) is better than the grouping that PATN generated. With a standard analysis in PATN, this is far from surprising. A value > 5% would suggest a poor classification which may indicate a lot of noise or poor variables or poor sampling.
As we have not yet run an analysis on the variables, ANOSIM on variables is not available.
PATN's PCC routine uses multiple linear regression to fit each selected variable into the ordination space (1, 2 or 3-dimensions). The result is a set of coordinates that represents the tip of the vector of the variable. SSH centres the coordinates of the objects so this vector represents the best fit direction of the variable. An r-squared value provides some estimate on how good the fit was. For example, an r-squared value of 0.7 means that 70% of the variation of the variable is accounted for by the vector.
Pressing the PCC button on the PATN toolbar will highlight the PCC TAB in the Data Table.

Alternatively, you could export the PCC values from the Data Menu. If we do that and then sort on the r-squared values we get the following table. I've added an extra column to designate the extrinsic variables (marked with an "x")
How does the 'utility' of the variables as measured by the r-squared value differ from the those highlighted by the Kruskal-Wallis value and why? First, why? Remember that the KW values are based on the 6 groups while the r-squared value is based on the coordinates of the countries in the ordination plot. KW is 'clumped' and r-squared isn't.
I have tabulated both the KW and PCC values below and included a difference in the ranks as the last column. For example, the difference in the rank of Deaths/1000 for KW and r-squared is 7; different.
| Variable | X | Y | Z | rSquared | Extrinsic? | KW | Rank Difference |
| Life Exp | 0.12 | -0.40 | -0.91 | 0.90 | 135.0 | 0 | |
| Deaths/1000 | -0.12 | 0.81 | 0.57 | 0.85 | 97.3 | 7 | |
| People/Doctor | -0.97 | 0.16 | 0.15 | 0.72 | 115.4 | 2 | |
| GNP/Cap | -0.31 | 0.70 | -0.64 | 0.58 | 109.2 | 3 | |
| Electricity/Capita | -0.24 | 0.68 | -0.70 | 0.58 | 129.4 | 3 | |
| Medals | -0.21 | 0.91 | -0.35 | 0.35 | x | 13.7 | 9 |
| Coal/Capita | -0.18 | 0.94 | -0.30 | 0.18 | 60.3 | 3 | |
| Deforestation Rate | -0.78 | -0.42 | 0.46 | 0.07 | 35.9 | 3 | |
| Prot Land% | -0.42 | 0.91 | -0.02 | 0.05 | x | 17.0 | 5 |
| Mil%GNP | -0.77 | 0.59 | -0.22 | 0.05 | x | 23.7 | 3 |
| CO2/cap | 0.69 | -0.67 | -0.30 | 0.04 | x | 119.4 | 7 |
| Univ% | -0.24 | 0.96 | 0.14 | 0.04 | x | 110.8 | 5 |
| Population Density | 0.18 | -0.78 | -0.60 | 0.03 | x | 10.9 | 3 |
| Cars /1000 | -0.79 | 0.54 | -0.30 | 0.03 | x | 107.2 | 6 |
| Pop | -0.45 | -0.76 | -0.46 | 0.01 | x | 6.9 | 3 |
| Ed /GNP% | -0.78 | 0.50 | -0.37 | 0.01 | x | 24.8 | 4 |
| Arable land/1K | -0.23 | 0.93 | 0.30 | 0.00 | x | 10.6 | 0 |
| Literacy | 0.82 | 0.09 | 0.57 | 0.00 | 122.8 | 15 |
Literacy with a difference of 15 is the 3rd highest KW value and the equal lowest r-squared value. What does this imply? It could imply problems in the ordination. There maybe a few problems (which we will discuss later) but overall, this is unlikely. Looking at the Box and Whisker plot for Literacy, you can see why it has a high KW value; the only overlap between 1st and 3rd quartiles is between groups 2 and 6.

The order in the B&W plots is group 1(lowest)-3-2-6-4-5 (highest). You can order the group centroids in the ordination plot so they almost form a straight line from group 1 (left)-3-2-4-6-5 (right). The only difference in order is a swap between groups 4 and 6. One would think a vector for Literacy could therefore go from lower left to upper right, but this isn't the case.

Instead, it goes a right angles to the main distribution which accounts for the low r-squared value.

The conclusion is that the direction of variation within groups is driving the positioning of the vector. More could be said about differences but I'll leave that to you and for brevity, move on.
Looking at the overall distribution of countries in the ordination (above), it makes a lot of sense. The 'poorest' countries (dark blue) are down one end while the 'richest' countries (yellow) are up the other end, and those in between from an economic perspective, are in between in the plot.
The orange group (group 6) are interesting. They seem to be the oil states. Some are close to the rich group 6 and some are closer to group 4. Exporting the group compositions and sorting on group gives us (the ID is the country sequence number from the Data Table: handy if we want to re-sort by Data Table order). Note that in distribution on the SSH plot, the countries are rather scattered. Does this suggest that they have been more difficult for the ordination than the classification?
| Label | ID | Group |
| Bahrain | 12 | 6 |
| Brunei | 24 | 6 |
| Kuwait | 84 | 6 |
| Libya | 91 | 6 |
| Oman | 117 | 6 |
| Qatar | 126 | 6 |
| Saudi Arabia | 133 | 6 |
| United Arab Emirates | 166 | 6 |
Let's display all the major determining 'orientations' in the ordination plot from the PCC and see what we can learn.

'GNP/Cap' and 'Electricity/Cap' are co-linear suggesting that these variables are highly correlated. These two variables align with Group 5 (see below), the most affluent countries. The opposite direction, as expected identifies the 'poorest' countries (Group 1). 'Deaths per 1000' and 'People per doctor' may at first appear co-linear, but are not as closely related. These relationships make logical sense, but we could run a classification of variables to quantify this relationship (all variables should be normalised before doing this - and don't forget to save the normalised dataset to another PTN file).
The orientation of 'Deaths per 1000' aligns as expected with Group 2, with Botswana right at the tip. 'Life expectancy' should align roughly with group 5 (positive relationship - see the box and whisker plot), but appears oriented to align with the negatively related Group 2.
While extrinsic due to the large number of missing values, I've elected to display the 'Medal's variable for two reasons - I'm interested in it, and the r-squared value is acceptable at 0.345. This means that this vector is accounting for 58.74 of the variation of the non-missing values. Not bad. As expected, it aligns positively with the most affluent countries in Group 5, and inverse to Group 1 and is near co-linear with 'GNP/Cap' and 'Electricity/Cap'. This makes sense. None of the other variables have a significant enough r-squared value to bother with.
Qatar is an outlier. It was also on the previous ordination, but I decided to leave it in an see what happens. Maybe I should have eliminated it! To get an idea of what problems SSH had, impose the Minimum Spanning Tree (MST: use the right mouse button in the ordination plot window and select Display MST or simply press "M" on the keyboard)-

Here Qatar is annotated (by clicking the left mouse on it) in the ordination plot. Note that Qatar is connected by the MST to another group 6 member; it is the United Arab Emirates. Makes eminent sense. This shows us that MST is a powerful tool to identify countries that SSH had problems placing. The classification probably got it right, but SSH may need 4 dimensions to place Qatar 'correctly'. Note all other other group 6 members are 'well-connected' by the MST (and so are most other groups!).
| Mean (closest country to the centroid) - Chad, Extreme - Niger, Marginals - Bhutan (group 3), Mali (group 3) | |
| lowest electricity usage/capita | |
| lowest CO2/capita | |
| lowest GNP/capita | |
| lowest literacy | |
| lowest university % | |
| lowest cars/1000 | |
| 2nd lowest life expectancy | |
| 2nd highest deaths/1000 | |
| highest number of people/doctor | |
| highest deforestation |
| Mean - Swaziland, Extreme - Botswana, Marginals - Tanzania (group 3), Rwanda (group 1) | |
| lowest life expectancy | |
| low CO2/capita | |
| low electricity usage/capita | |
| 2nd lowest university % | |
| 2nd lowest GNP/capita | |
| 2nd highest people/doctor | |
| high literacy | |
| high deforestation | |
| highest deaths/1000 |
| Mean - Papua new Guinea, Extreme - Cape Verde, Marginals - Togo(group 1), Republic of Congo (group 2), Seychelles (group 4) | |
| 2nd lowest CO2/capita | |
| low electricity usage/capita | |
| intermediate literacy | |
| low university % | |
| low GNP/capita | |
| low cars/1000 | |
| intermediate life expectancy | |
| intermediate deaths/1000 |
| Mean - Suriname, Extreme - none, Marginals - Italy (group 5), Nicaragua (group 3), Jordan (group 6) | |
| low CO2/capita | |
| low electricity usage/capita | |
| low GNP/capita | |
| low cars/1000 | |
| 2nd lowest deaths/1000 | |
| 2nd lowest people/doctor | |
| intermediate university % | |
| high life expectancy | |
| 2nd highest literacy |
| Mean - Finland, Extreme - Norway, Marginals - Israel (group 4) | |
| lowest people/doctor | |
| intermediate deaths/1000 | |
| 2nd highest CO2/capita | |
| highest life expectancy | |
| highest electricity/capita | |
| highest coal/capita | |
| highest literacy | |
| highest university % | |
| highest GNP/capita | |
| highest cars/1000 |
| Mean - Bahrain, Extreme - Qatar, Marginals - Saudi Arabia (group 4), Oman (group 3) | |
| lowest coal/capita | |
| low people/doctor | |
| 2nd highest life expectancy | |
| 2nd highest electricity usage/capita | |
| 2nd highest CO2/capita | |
| 2nd highest GNP/capita | |
| 2nd highest cars/1000 |
If we were to reduce the number of groups, how would our 6 groups merge? To see what happens, display the dendrogram and use the right mouse button to display it at a group level like so-

PATN labels the groups by the first (in terms of sequence number in the data Table) country in the group. Why? PATN always generates the dendrogram in a way that has group 1 at the top, group 2 next down, then group 3 ... group k. Therefore you always know that the kth group is k from the top. For example, Australia is the 'label' for group 5 because it is the first country in group 5 because the Data Table is in alphabetic order. Therefore, PATN uses a label to help you to quickly identify what that group may be. That's my logic!
If we want to simplify the groups (reduce to 5, 4 or 3 groups) we would in order-
Join groups 1 & 2 (giving 5 groups)
Join groups 5 & 6 (giving 4 groups)
Join group 3 with 1 & 2 (giving 3 groups)
Join group 4 with 5 & 6 (giving 2 groups)
More groups could be defined by re-running the classification and asking for whatever number of groups you want. Re-analysis is so easy in PATN, there is little need for a dendrogram slicing routine such as the old GDEF in DOS PATN to re-define the number of groups.
We could analyse any or all of the original 18 variables. For brevity, I'll analyse the 8 intrinsic variables we have at this point. More of the 18 variables could be used with the usual caveats. The level of missing data is less of an issue for the variables as there are up to 176 values (countries) per variable for the current dataset. The only variable you may think twice about is 'Medals' with 99 missing values out of the 176.
In PATN, the variables could be analysed in parallel with the countries, but the values of the variables are so disparate, some form of standardisation is required first. The countries were analysed with the Gower Metric, which has in-built range standardisation. This approach enabled each variable to contribute equally to country differences, and also enable the evaluations to display raw data values.
What option should be used? Take the simplest option: range standardisation.
First, save the current project by pressing CTRL S, then do a Save As either from the File menu or using ALT F A and select a different project name. I'm up to 'Medals 5' at this stage as I have save all of the intermediate datasets.
Next, press the transformation/standardisation button
and
select and range standardisation on all of the intrinsic variables (columns)-

Press 'Run' and you will end up with all the intrinsic variables ranging from zero to one inclusive. This will enable us to compare the variables because they are now on the same scale. Take a look at the Visible Stats and you will see that this is so.
We may as well run both the countries (use the same parameters as before) and the variables analysis. This will enable us to swap back and forth if needed. The standardisation, will have no effect on the analysis of countries as the previous use of Gower Metric effectively range standardized anyway. So why didn't we do this from the start? Simple, I prefer to see the raw values on the evaluation, not standardised values.
Select Gower Metric for the variables, flexible UPGMA with beta=-0.1 and
choose 3 groups; more than enough for 8 variables. Run this, and really the main
interest this time with only 8 'objects' - the variables, is the association
matrix (press the
button)-
| GNP/Cap | Life Exp | CO2/Cap | Literacy | Deforestation Rate | Deaths/1000 | People/doctor | |
| Life Exp | 0.62 | ||||||
| CO2/Cap | 0.10 | 0.71 | |||||
| Literacy | 0.76 | 0.22 | 0.84 | ||||
| Deforestation Rate | 0.57 | 0.34 | 0.63 | 0.38 | |||
| Deaths/1000 | 0.34 | 0.61 | 0.32 | 0.64 | 0.35 | ||
| People/doctor | 0.17 | 0.73 | 0.09 | 0.86 | 0.63 | 0.31 | |
| Electricity/Cap | 0.07 | 0.65 | 0.06 | 0.78 | 0.59 | 0.32 | 0.14 |
and the dendrogram-

The three 'per capita' variables are closely related as expected. These three variables are also closely related to 'People per doctor'. In other words, affluence (high GNP, CO2 and Electricity use) equates roughly with a low number of people per doctor. Interestingly 'Life expectancy' and 'Literacy' are related, but not quite as closely (from the association matrix, we see a value of 0.22 which is a fair relationship). Deforestation rate is related to Deaths per 1000 but again, not highly (0.35).
I hope that this example has been useful.
Lee Belbin
|
PATN was developed by Lee Belbin
and CSIRO and subsequently by Lee Belbin (Blatant Fabrications Pty Ltd, ABN: 96
106 672 379) with V1 coding by students at Griffith University
(Queensland). Issues relating to this web site
should be directed to
patninfo@patn.com.au.
PATN and this Web site are Copyright © 2004
Blatant Fabrications Pty Ltd. All rights reserved.
|