|
Ecological DatasetDataset: Use_Marine_Data.xlsThis case-study is typical of those seen in community ecology; species abundances measured at a set of sites and with associated 'environmental' data. The dataset was collected to test a specific hypothesis - that run-off from a rubbish site near Australia's Casey Station in Antarctica influenced marine benthic species composition and abundance. The study work was part of a PhD thesis by Dr Jonny Stark of the Australian Antarctic Division, Hobart, Tasmania (see references below). The DataThe dataset contains species abundances, physical and chemical environmental variables. The design of the experiment was to sample marine benthic species and associated environmental properties at sites close to, and far away from the influence of a rubbish site. While it would have been nice to have been able to setup a BACI design (before-after-control-impact), no systematic samples were ever taken of the variables before the establishment of the rubbish site. Therefore, a ‘control’ site was selected at a location that was hopefully out of the area of potential influence of the rubbish tip, but with (also hopefully) other superficially similar physical properties to the putatively impacted sites. The sites marked as ‘BB’ are from Brown’s Bay near to the rubbish site (putatively impacted). Those marked “OB”, the control sites, are from O’Brian’s Bay, the next bay to the south of Brown’s Bay. In each area, the samples are from two locations. In the case of Brown’s Bay, closer to, and further from the shore (the rubbish site). Further details of the survey can be obtained from the references. There are 16 sites from Brown Bay (2 lots of 8) and 16 from O'Brien Bay (also 2 lots of 8). There are 78 variables, that I have broken into 5 classes; biota, biotic summary statistics, chemistry, sediment grainsize and one a-priori variable (location). For 'location', I have simply labelled each of the four areas 1, 2, 3 and 4. I have done this as an 'a-priori' grouping. Think of it as a type of extrinsic variable; one that will not be involved in the analysis but may be handy in the evaluation. Variables
ReferencesAustin, M.P. & Belbin, L.(1982). A new approach to the species classification problem in floristic analysis. Australian Journal of Ecology 7, 75-89. Belbin, L., Faith, D.P. & Minchin, P.R.(1984). Some Computer Algorithms Contained in the Numerical Taxonomy Package NTP. CSIRO Division of Land Use Research Technical Memorandum 84/23. October 1984, Canberra. 31p. Belbin, L. (1991). Semi-strong hybrid scaling, a new ordination algorithm. Journal of Vegetation Science, 2, 491-496. Belbin, L., Faith, D.P. & Milligan, G.M. (1992). A comparison of two approaches to beta flexible clustering. Multivariate Behavioural Research, 27, 417-433. Clarke, K. R. (1993). Non-parametric multivariate analysis of changes in community structure. Australian Journal of Ecology 18, 117-143. Stark J. S. (1998) Heavy metal pollution and macrobenthic assemblages in soft-sediments in two Sydney estuaries. Aust. J. Mar. Freshwat. Res. 49, 533–40. Stark J. S. (2000) The distribution and abundance of softsediment macrobenthos around Casey Station, East Antarctica. Polar Biol. 23, 840–50. Stark, J. S. (2000). The distribution and abundance of soft-sediment macrobenthos around Casey Station, East Antarctica. Polar Biology 23, 840–850. Stark J. S. (2001) Human impacts and assemblages in marine soft-sediments at Casey Station, Antarctica. PhD Thesis, University of New England, Armidale, NSW. Stark J. S., Riddle M. J., Snape I. & Scouller R. C. (in press) Human impacts in Antarctic marine soft-sediment assemblages: Correlations between multivariate biological patterns and environmental variables. Est. Coast. Shelf Science. The GoalTo see if PATN can be used to understand the nature of the variation of marine benthic communities and to identify what if any environmental properties seem to be controlling the distribution of species. Do we have a difference in species communities from putatively impacted to ‘non-impacted’ similar sites? We will see. The AnalysisStep 1: Import the dataThe original data is in a Microsoft Excel™ file with the samples forming the rows of the data matrix. Here is what a component of the original data looks like in Excel™-
In this dataset, all the biotic variables happen to be on the left of the Excel file and all the 'environmental' variables are on the right-hand side. To PATN, it would not matter at all what order the variables or the samples were in; the results would remain the same. The order of the variables has no influence on the analysis. So, to PATN. We start PATN and select Import from the '"What would you like to do?" opening screen. As there are in this case, a number of worksheets in the original Excel file (but not the one on the Web site), PATN will ask which worksheet is to be imported. In this case, I select one called 'PATN', that has both the biotic and environmental variables together in the one worksheet. PATN will scan through the worksheet on entry, checking values for setting up defaults, and creating the 'visible stats' values to the right and to the bottom of the Data Table. PATN will also display the record number during the import and then, on completion, display the total number of rows and columns in the Data Table. PATN has automatically picked up the first row as the column labels and the first column as the row labels. Here is the resulting Data Table. As you can see below, PATN is reporting 32 rows (samples) and 78 variables. Note that the visible stats are currently displaying (by default with PATN V3.1), the minimum and maximum values and the number of values > 0. Because we have a mixture of intrinsic (the variable that will be used in the analysis phase) and extrinsic (the variables that will be used only as an aid to evaluating the analysis), the row stats will be 'messy'. There are some environmental variables with very high values while many of the abundances are zero. The row statistics contain biotic and environmental values. The column stats are however very useful.
Step 2: Identifying Extrinsic VariablesThe next step is to identify to PATN, those variables that we only want involved in the interpretation of the analysis; the extrinsics. In this case, all the extrinsic variables are on the right-hand side of the Data table so it makes it easier. Click the left mouse button on the column marked 'individuals', scroll across to the right-hand side of the Data Table using the scroll bar, hold down the SHIFT-key and click the column, marked 'Location'. All the extrinsic variables are now highlighted. Next click the 'Extrinsic' button
'Intrinsic stats', 'Groups' and 'Ordination'. The intrinsic stats tab applied only the intrinsic variables. 'Groups' will list the group number for each of the rows. As now, these numbers will be 'missing' as no groups have been created as yet. If there were a-priori groups, they would also be listed on this tab. The ordination tab is used to display the ordination coordinates. As with the groups, they are currently set to missing as an ordination has not yet been created. If you now click the Intrinsic Stats TAB, you will get a summary of only the
intrinsic (biotic) variables. We can now see that sample BB2S1P13 ranges from 0 to 128 individual counts of species and that there are 15 non-zero values. Scanning down the sites, we can see that there is a site with a maximum of 6 and one with a maximum of 200. Quite a range, but expected for this type of data. Is it ok to leave the coding as is? We will address that below. Step 3: Save the Data as a PATN Project FileAt the moment, the imported data, with our newly created extrinsic variables has not been saved. If we were to try and exit PATN, it would warn you to save the file. But, it is a good idea to save the data at regular intervals so that you can, if required, back-up and take a different analysis route. Now is a good time to save the data. Just press CTRL-S or select File | Save, or File | Save As and call it Casey-1; our first version of the data. Step 4: Examine the Values of the Intrinsic VariablesExamining the intrinsic values is always a wise move, regardless of whether you think the data is just fine. If you have collected the data, it is hard to stand back and make sure that it is appropriate. It is far easier to do a few quick checks before you get into serious analysis than trying to 'back-peddle' when you find some strange outlier caused by a slip of the fingers, or a stray cosmic ray, or even some digits falling down the cracks between various bits of software. The intrinsic data is species abundance values. Typical ecological data. The way that you must start to think in Pattern Analysis, is "what is PATN going to do with these values?" This, of course takes a little practice, and this is why the use-case. The default 'visual stats' are minimum, maximum and number of values > 0. They are the defaults as these parameters are the most useful for data such as this. Scanning the columns, the minimum values should all be 0, and this seems to be the case here. The maximum values vary considerably, from 1 to 200. This is an issue that will need to be addressed. Why? PATN has already figured out that the Bray and Curtis association measure will be best for these data. How has it done this? PATN has examined the Data Table and noted that the number of zeros in the table suggests that the data seem to be 'ecological'; counts of species. This time, it got it right. Even if it wasn't species data, the conclusion as to an appropriate measure of association would likely have been the same. The Bray and Curtis association measure is effectively 'the sum of the differences between two objects (sites/samples in this case) over the sum of the sums of the two objects (see the PATN Help). Therefore, if nothing is done to recode the values, the species with the higher abundances will be more influential than those species with low abundances. Is this what is wanted? Usually the answer is "no". What should be done then? Step 5: Transforming the Abundance ValuesWe probably need to down-weight the species with high abundance values. This must, by the way, up-weight species with lower abundance values! Very few analysts seem to recognise this fact. OK, how should be down-weight & up-weight? The easiest method is to use some form of transformation. [A transformation in PATN means that a value in the Data Table can be changed without any need to reference other values]. Common choices of transformation include (in order of increasing intensity) square root, cube root and log. How intense? This is a hard question to answer. The more intense the transformation, the less influence abundant species will have and the more influence less abundant species will have. So, if you think there is some systematic signal in the less abundant species, up-weighting should do little harm.
Next, simply click the histogram button, just to the right of the Data Table button on the PATN button bar; the one that looks like a histogram. The histogram and the summary statistics will now be displayed. Sure enough, quite skewed. The histogram also confirms that the minimum value is 0 and the maximum value is 200. So, back to transformations. What is the square root, cube root and log of 200? Answer: ~14, ~6 and ~2 (base 10) respectively. If I have the opportunity, I usually asked the person who collected the data how they would discriminate between various values (in this case abundances). Most of the time, I found that around 5 values were sufficient to provide a level of discrimination that made sense ecologically. For example, 0 mapped to 0, 1-3 mapped to 1, 4-10 mapped to 2, 11-30 mapped to 3, 31-100 mapped to 4 and 101+ mapped to 5. Coming from an analysts perspective, 5 or so categories seems to reduce the noise with little loss of signal.
Click on the transformation/ standardization button on the top-right of the
PATN Toolbar,
That's it. All the abundance values will now be transformed such that abundant species will not swamp the analysis, and hopefully, the rarer, but meaningful species will play a more important role. Fingers crossed. Just to be safe, do a File | Save As, and save the transformed data to a new PATN Project file - call it something different to the unstandardized file. I'm going to call this project file "Casey_2". Remember: If you think you can avoid this transformation decision, you ARE making a decision anyway - you have chosen to have an analysis strongly influenced by species with higher abundances! Step 6: First AnalysisIt's time for a preliminary analysis. There could be some really noisy
species in this dataset, so we may need some refinement. That's what PATN is
designed for; interaction. Click on the analysis button on the PATN Toolbar.
Next, click in the "All Evaluation" box as this will automatically apply all the evaluation options (box and whisker plots, ANOSIM, PCC and MCAO) to all extrinsic variables. This will save us having to manually run the third phase (prepare, analyse and evaluate), although we can easily run any part of the process in seconds if we wish to. At this stage, we will not run any analysis of the variables (the columns of the Data Table). Again, we can do it later if we want to.
6.1 AssociationAssociation is the first an most important step in an analysis. The subsequent classification, ordination or network will depend totally on the measure of association chosen. Get is wrong here and recovery of real patterns may be an isue.
Why Bray and Curtis? This measure has been shown to best relate to what we may term 'ecological distance'. How do we know this? More on this below, but for the moment, we also know that even this measure tends to underestimate true distance. Simply put, once two objects (sites in this case) fail to have many or any species in common, it is difficult to estimate the true association. Bray and Curtis will go to a value of '1', when we know that some association values must exceed '1'. PATN will produce a lower-symmetric matrix of Bray and Curtis values ranging from 0-1 where 0 means that the two objects (sites here) are identical and 1 means that the two sites share no species in common and are maximally different. Bray and Curtis is therefore a distance-type measure, not like the (Pearson's Product Moment) correlation coefficient. All association measures in PATN are distance-type. What's a lower symmetric matrix? Think of a spreadsheet where the row and column are both labelled with our 32 sites. We can enter the value of the Bray and Curtis association between site 1 and site 1 (itself) as 0. Ditto with all self comparisons, so we really don't need to store those values as we know what the result will be. Comparing site 1 with site 2 will give the same result as comparing site 2 with site 1. The association is therefore symmetric (two-step is not totally but we won't go there for the moment). Therefore we also don't have to store one side of the comparisons. 'Lower symmetric' means we just store the lower-left part. You will see this structure if, after running the analysis, you click the association matrix button. 6.2 ClassificationNext, click on the classification tab and let's see what we may want to change from PATN's defaults.
Flexible UPGMA weights objects equally (Flexible WPGMA weights groups equally), and is the technique that also best correlates ultrametric distance with our association matrix. That is, if you were to measure the implied associations from the resulting dendrogram, those would most closely match the input association matrix. The beta value controls the distortion of the 'space' defined by the species; hopefully some form of 'ecological space'. Negative beta values 'dilate' the space (make objects and groups appear to move away from each other as fusion progresses, just like stars in our galaxy and galaxies in our universe). Positive values do the opposite, this is called 'space contracting'. A beta value of '0' is termed space conserving. I use a default beta value of -0.1 (slightly dilating) as it counteracts the underestimation of larger Bray and Curtis values. Lance and Williams (the inventors of the flexible formula which gave rise to the 'flexible' strategy) used to use Flexible WPGMA with a beta value of -0.25, but the beta values are not totally equivalent in terms of the degree of dilation/contraction. Lance and Williams didn't like dendrograms that were 'chained' (where a joins b, then c joins a and b, then d joins a, b, and c etc). While chaining does tend to make the dendrogram difficult to interpret (you often have a few large groups and all the rest are single- membered groups) , such as structure may be reflecting reality. A degree of chaining often reflects species partitioning environment gradients. If you decrease the value of beta toward -0.5, it will force each group in the dendrogram (regardless of where you slice it) to have an equal number of members. This is a rarity in reality, unless you know what is driving the variation and sampled carefully to partition it! So, I suggest you stick with beta = -0.1 until you know more than I do. The number of groups? This is a "how long is a piece of string" question. The dendrogram presents everything from n groups (where n=number of objects) to 1 group, but the structure of the dendrogram will give you some idea if 'natural' (read well-defined') groups exist, and maybe a hint on how many groups may be appropriate. From my experience, a quality dataset of 400 objects can lend itself to interpreting over 200 meaningful groups. But, try communicating 200 groups in a seminar or a journal paper - not likely! Many years ago, Bill Williams suggested that the a rule of thumb of the square root of the number of objects seemed about right. He was implying that information increased with the square root of the number of objects. This is the starting point in PATN. Personally, I like 5 for just about everything. Why? Most people can hold the concept of 5 different things in their mind concurrently. This means, if your going to communicate an outcome, 5 groups maybe a very effective number. I'm going to use 4 here as I know there are four sample locations. Simple. Think of classification as reducing the n rows by m columns in the Data Table to g (where g=number of groups) rows by m variables. 6.3 OrdinationThe term ordination is derived from the German word for axes. In Pattern Analysis, ordination refers to a suite of techniques that attempt to reduce the dimensionality of a dataset. The Casey data has 42 intrinsic variables (the species), that could be considered as axes (using abundance as the scale) - all at right angles to each other. Tough to visualize, but if we could view the 42-dimensional space. we would see a set of points representing the sites. Ordination attempts to squeeze the 42 dimensions down to a few that we can visualize; 2 or 3 if we are lucky. The measure of success is summarised by stress. In PATN, the ordination method is called semi-strong hybrid multidimensional scaling, or SSH for short. Multidimensional scaling (MDS) is probably the simplest, robust and most effective ordination technique currently around. What MDS does is to first ask you how many dimensions you want to try for. Same issue here as the number of groups, but you know that it will have to be 1, 2 or 3 in PATN, and 3 is the default. MDS then will randomly allocate some x, y and z coordinates to each site. It will then measure the Euclidean distance between each pair of sites. This will generate another lower-symmetric matrix. MDS then compares (using regression) the Bray and Curtis association values with their corresponding Euclidean distance values. As you may expect, the result will almost certainly show a poor relationship, but the MDS algorithm will know which way to move the sites in the Euclidean space to improve the fit. The points are moved and then the Euclidean distance is re-measured, followed by another regression. This process repeats (iterates) until any movement of the sites will decrease the fit. The degree of fit is the stress. SSH has two features that are in improvement over basic MDS. First, unlike all other MDS programs, SSH can mix linear (metric) regression with ordinal (non-metric) regression. From above, we know that association measures such as Bray and Curtis will be accurate for smaller association values but will underestimate larger association values. What this means is that smaller association values will be linearly related to true distance and, at best, larger association values will be ordinally related to true distance. Therefore SSH asks for a cut-point association value. Linear regression will be used on all association values below this value and ordinal regression will be used on association values above this. Simulation studies suggest that a value of around 0.9 produces good results. That's the 'hybrid' part of SSH. So what is 'semi-strong'? Most non-metric (ordinal regression) MDS programs will not break tied values of association. This means that 1's will remain 1's. We know however that many 1's are greater than 1, in some cases, much greater. If there are a lot of zero values in your Data Table, chances are that there will be lots of 1's in your Bray and Curtis association matrix. It is by the way, a good idea to view the histogram of association values, once the analysis has been run. Semi-strong ordinal regression will allow tied values of association to be broken. Doesn't this seem like a logical strategy?
What else is on the ordination tab? Number of random starts. Why? Well, SSH, like most Pattern Analysis algorithms (including classification) can rarely if ever guarantee to find an optimal solution. PATN's algorithms are heuristic. They use simple rules of thumb to try and find the best solution. By default, SSH tries 10 different random starts (randomly allocated coordinates) and picks the one that gives the best result. Ten is usually enough to be reasonably confidant that you have found the best, but never completely. If you want more, try more, but be warned, SSH run-time is proportional to the cube of the number of objects! 200 objects may be fine, but for 300, you make to take an extended coffee break. Think of the random starts like this: The 'solution surface' of SSH is like a landscape, where your looking for the absolute lowest point, the lowest stress value. Each random start is like a parachutist who is blindfolded and dropped from an airplane and given the instructions "Land and then proceed downhill till you can't go downwards any further. Stop and report your altitude". As you can see from this analogy, having 10 parachutists is a lot better than having one and relying on the fact that they have found the overall lowest point in the landscape! Random number seed? PATN has an inbuilt random number generator that is used to generate a sequence of random values. Entering the same seed value as a previous run will (hopefully) result in the same values being produced. Number of iterations? This is how many times SSH will move the points and do the regressions. Fifty is usually more than enough. If the movement is too small, or the stress goes up, SSH will stop at that iteration anyway.
Now you can press Run, and the Analysis Recipe window will be displayed. It lists the options selected for the analysis. Just run your eye over it an check that it is indeed what you want to do. Note that by selecting 'All Evaluations', PATN has automatically added in all the extrinsic variables to all standard evaluation modules (box and whisker, ANOSIM, PCC and MCAO). More on those later. At this point, you can save the recipe to a text file by pressing Save and/or just press OK and stand back! PATN will do all the analysis and evaluations, then it is just a matter of examining all the results at your leisure. The analysis and evaluations should only take a few seconds on the Casey dataset. SSH will report the stress automatically. As PATN uses the ordination as the basis of its display, this is one of the more important values to watch.
Stress!
SSH uses Kruskal's stress formula √ (Σ (Dout-Dest)2 / Σ Dout) where 'Dout' is the Euclidean distance as measured in the reduced dimensional space and 'Dest' is the estimate of that distance from the regression. Stress varies from 0 (perfect fit) to asymptotically approaching 1. Here is my take on stress-
|