Exercise 3
The aim of this interactive exercise is to illustrate how useful some statistical software packages can be when you have a large data set. Therefore as far as possible the calculations in this exercise are only carried out using Excel, SPSS and Minitab. This example is also used in the interactive exercise 4 from Chapter 7. Therefore you may wish to save your spread sheet with the data set to be used in this additional exercise. The exercise relates to 3.8, 6.3, 6.6, 6.7.and appendix b. In addition we introduce you to the reduced principal axis regression.
EXAMPLE W6.1. Gender and the girth of yew trees
As part of a larger investigation of seedling recruitment in Taxus baccata (yew) trees in the Wyre Forest, Shropshire, an undergraduate measured the height (m) and DBH (diameter at breast height, m) of all yew trees within the Wyre Forest. Yews are dioecious and found as separate male and female trees. Table W6.2. includes the full data set given in relation to the area within the forest indicated by the stand code, the gender of the tree, height (m) and DBH (m).
Table W6.2: DBH (m), height (m) and gender of Taxus baccata in the Wyre Forest, Shropshire, 2005.
Stand code |
DBH (m) of male trees |
Height (m) |
DBH (m) of female trees |
Height (m) |
|---|---|---|---|---|
8065 |
0.43 |
11.8 |
0.47 |
11.3 |
0.23 |
5.2 |
- |
- |
|
0.25 |
6.7 |
- |
- |
|
Total number of trees |
3 |
1 |
||
8066 |
0.72 |
13.2 |
0.42 |
4.7 |
0.51 |
14.8 |
0.35 |
9.3 |
|
0.68 |
12.6 |
0.34 |
5.2 |
|
0.59 |
14.5 |
0.66 |
10.4 |
|
0.49 |
9.7 |
0.40 |
8.5 |
|
0.57 |
9.9 |
0.33 |
9.0 |
|
0.38 |
7.6 |
0.37 |
7.9 |
|
0.79 |
7.4 |
0.66 |
8.4 |
|
0.72 |
8.0 |
0.77 |
8.8 |
|
0.61 |
6.4 |
0.26 |
8.9 |
|
0.79 |
9.3 |
0.41 |
8.2 |
|
0.82 |
10.0 |
0.43 |
8.3 |
|
0.48 |
8.9 |
0.15 |
1.8 |
|
0.48 |
14 |
0.30 |
7.9 |
|
0.50 |
10.4 |
0.54 |
15.0 |
|
0.53 |
10.2 |
0.39 |
9.8 |
|
0.45 |
8.5 |
0.63 |
9.7 |
|
0.72 |
9.3 |
0.37 |
9.7 |
|
0.60 |
11.6 |
0.64 |
9.9 |
|
0.64 |
13.5 |
0.64 |
9.0 |
|
0.69 |
6.6 |
0.53 |
9.0 |
|
0.20 |
7.4 |
- |
- |
|
0.45 |
10.0 |
- |
- |
|
0.41 |
9.7 |
- |
- |
|
Total number of trees |
24 |
21 |
||
8067 |
0.55 |
16.8 |
0.73 |
13.5 |
0.58 |
13.0 |
0.83 |
16.0 |
|
0.39 |
17.0 |
0.39 |
6.5 |
|
0.41 |
8.8 |
0.29 |
8.6 |
|
0.30 |
5.7 |
0.95 |
9.7 |
|
0.45 |
7.5 |
0.24 |
4.7 |
|
0.46 |
5.8 |
0.53 |
7.2 |
|
0.32 |
4.8 |
0.69 |
8.4 |
|
0.35 |
7.0 |
0.49 |
7.7 |
|
0.32 |
6.7 |
0.35 |
8.2 |
|
0.37 |
11.2 |
0.28 |
8.8 |
|
- |
- |
0.42 |
9.8 |
|
Total number of trees |
11 |
12 |
||
8068 |
0.72 |
13.4 |
0.90 |
15.5 |
0.49 |
8.5 |
0.67 |
8.2 |
|
0.45 |
8.5 |
0.74 |
12.1 |
|
0.49 |
16.5 |
0.73 |
11.0 |
|
0.52 |
11.2 |
0.58 |
8.7 |
|
0.51 |
9.6 |
0.96 |
14.4 |
|
0.56 |
12.2 |
0.90 |
11.6 |
|
0.22 |
8.5 |
0.28 |
11.6 |
|
0.35 |
10.0 |
0.60 |
11.8 |
|
0.36 |
11.8 |
0.52 |
12.0 |
|
0.48 |
10.4 |
0.41 |
9.8 |
|
0.31 |
10.0 |
0.57 |
14.5 |
|
0.38 |
12.5 |
0.58 |
13.5 |
|
0.26 |
10.6 |
0.31 |
10.1 |
|
0.41 |
10.7 |
0.62 |
10.5 |
|
0.33 |
11.5 |
0.27 |
7.3 |
|
0.17 |
6.3 |
- |
- |
|
0.30 |
13.5 |
- |
- |
|
0.48 |
7.0 |
- |
- |
|
0.67 |
15.0 |
- |
- |
|
0.56 |
13.9 |
- |
- |
|
0.64 |
9.6 |
- |
- |
|
0.40 |
10.5 |
- |
- |
|
0.35 |
11.5 |
- |
- |
|
0.29 |
7.0 |
- |
- |
|
0.54 |
12.4 |
- |
- |
|
0.36 |
11.0 |
|||
Total number of trees |
27 |
16 |
||
8069 |
1.02 |
13.0 |
0.21 |
6.0 |
0.55 |
6.5 |
0.41 |
6.0 |
|
0.28 |
5.5 |
0.55 |
6.9 |
|
0.46 |
8.5 |
0.62 |
10.1 |
|
0.53 |
7.6 |
0.89 |
9.5 |
|
0.50 |
10.0 |
0.53 |
7.6 |
|
0.78 |
8.5 |
0.57 |
10.7 |
|
0.53 |
11.5 |
0.59 |
10.9 |
|
0.43 |
11.5 |
0.58 |
12.5 |
|
0.59 |
12.4 |
0.76 |
12.4 |
|
0.19 |
10.4 |
0.59 |
8.0 |
|
0.62 |
11.4 |
0.67 |
13.0 |
|
0.42 |
16.8 |
0.70 |
12.2 |
|
0.35 |
8.1 |
0.61 |
14.6 |
|
0.73 |
10.8 |
0.58 |
13.0 |
|
0.74 |
10.5 |
0.71 |
9.6 |
|
0.40 |
10.0 |
0.60 |
6.7 |
|
0.78 |
11.0 |
- |
- |
|
0.44 |
8.6 |
- |
- |
|
0.40 |
9.2 |
- |
- |
|
0.66 |
12.7 |
- |
- |
|
0.37 |
13.0 |
- |
- |
|
0.10 |
9.6 |
- |
- |
|
0.60 |
16.2 |
- |
- |
|
0.97 |
11.2 |
- |
- |
|
0.60 |
11.4 |
- |
- |
|
0.81 |
11.0 |
- |
- |
|
0.76 |
9.7 |
- |
- |
|
0.42 |
11.0 |
- |
- |
|
0.50 |
6.2 |
- |
- |
|
0.39 |
10.0 |
- |
- |
|
0.50 |
11.8 |
- |
- |
|
0.29 |
8.2 |
- |
- |
|
0.79 |
14.0 |
- |
- |
|
Total number of trees |
34 |
0.17 |
||
8070 |
0.60 |
10.9 |
0.26 |
3.6 |
0.46 |
11.9 |
0.47 |
6.5 |
|
0.69 |
12.8 |
0.62 |
8.5 |
|
0.53 |
12.5 |
0.59 |
12.5 |
|
0.75 |
17.0 |
0.44 |
9.0 |
|
0.85 |
18.3 |
0.28 |
6.8 |
|
0.51 |
8.5 |
0.54 |
8.0 |
|
0.33 |
10.5 |
0.28 |
9.9 |
|
0.82 |
8.0 |
0.56 |
14.0 |
|
0.52 |
12.0 |
0.63 |
16.0 |
|
0.63 |
10.5 |
- |
- |
|
0.74 |
7.6 |
- |
- |
|
0.27 |
7.5 |
- |
- |
|
0.21 |
6.5 |
- |
- |
|
0.66 |
11.0 |
- |
- |
|
0.12 |
0.7 |
- |
- |
|
0.62 |
8.7 |
- |
- |
|
0.73 |
10.0 |
- |
- |
|
Total number of trees |
18 |
10 |
||
8071 |
0.19 |
15.2 |
0.48 |
10.5 |
0.49 |
13.1 |
0.49 |
13.0 |
|
0.78 |
14.9 |
- |
- |
|
Total no. of trees |
3 |
2 |
||
8072 |
0.70 |
12.4 |
0.73 |
10.0 |
0.36 |
8.5 |
0.41 |
9.7 |
|
0.16 |
4.5 |
0.33 |
15.0 |
|
0.46 |
8.5 |
0.22 |
4.5 |
|
0.48 |
9.5 |
0.28 |
9.0 |
|
0.88 |
12.0 |
0.55 |
12.7 |
|
0.65 |
13.0 |
- |
- |
|
0.43 |
15.0 |
- |
- |
|
0.35 |
11.0 |
- |
- |
|
0.35 |
12.3 |
- |
- |
|
0.75 |
12.0 |
- |
- |
|
Total number of trees |
11 |
6 |
||
8073 |
0.57 |
7.0 |
0.61 |
12.5 |
0.73 |
12.0 |
0.43 |
5.8 |
|
0.44 |
14.0 |
0.34 |
11.0 |
|
0.39 |
11.8 |
0.56 |
7.0 |
|
Total number of trees |
4 |
4 |
||
8074 |
- |
- |
0.70 |
10.6 |
- |
- |
0.90 |
11.8 |
|
- |
- |
1.01 |
10.9 |
|
- |
- |
0.58 |
6.9 |
|
Total number of trees |
0 |
4 |
||
8075 |
1.15 |
16.0 |
0.82 |
13.8 |
- |
- |
0.53 |
13.8 |
|
Total number of trees |
1 |
2 |
||
8076 |
0.44 |
12.0 |
0.68 |
11.4 |
Total number of trees |
1 |
1 |
||
8077 |
1.00 |
13.5 |
0.60 |
5.5 |
0.62 |
6.0 |
- |
- |
|
0.94 |
15.5 |
- |
- |
|
Total number of trees |
3 |
1 |
||
8080 |
0.63 |
12.5 |
0.52 |
5.5 |
0.42 |
10.5 |
0.88 |
15.5 |
|
- |
- |
0.77 |
13.0 |
|
Total number of trees |
2 |
3 |
1 |
Full calculation for Q W3.1
We can refer to appendix b.We know that the student wishes to test for an association between two variables (DBH and height).
B2.2: Is there an association between two or more variables?
In this example there are two treatment variables (DBH and height), each measured in metres which is an interval scale of measurement. Neither variable is under the control of the investigator. From the table it is clear that if the association appears to be linear then a principal axis regression may be used to model the association and if the data are parametric then a Pearson's correlation may be used to test the association.
Experimental design |
Test |
|---|---|
You have two variables. The data are counts arranged in discrete categories (nominal or ordinal). At least one of these variables has more than two categories. You wish to test for an association between the variables. |
Chi-squared test for association (5.3.) or G test for association (5.5.2.) |
You have two variables. The data are counts arranged in discrete categories (nominal or ordinal). Both variables have only two categories. The association does not need to be linear. You wish to test for an association between the variables. |
Chi-squared test for association with Yates correction (5.4.2.) or G test for association (5.5.2.) |
You have two treatment variables. The data are non- parametric* and can be ranked (ordinal or interval). The association appears to be linear. You wish to test for an association. |
Spearman's rank correlation (6.2.) |
You have two treatment variables. The data are parametric. The association appears to be linear. You wish to test for an association. |
Pearson's product moment correlation (6.3.) |
You have two treatment variables, one of which is under the control of the investigator. You wish to test for an association or you wish to model the association and/or predict y values for given x values within the range of your observations. The distribution appears to be linear. |
Simple linear regression (6.5.) |
You have two treatment variables and neither is under the control of the investigator. The two variables are measured on the same scale e.g. mm. You wish to model the association and/or predict y values for given x values within the range of your observations. The distribution appears to be linear. |
Principal axis regression (6.6.) |
You have two treatment variables and neither is under the control of the investigator. The two variables are measured on different scales e.g. grams and arbitrary units. You wish to model the association and/or predict y values for given x values within the range of your observations. The distribution appears to be linear. |
Ranged principal axis regression (6.7.). |
None of these tests seems to be right for your data. For example you have three or more treatment variables; you have more than one y value for each x value; you do not have a linear distribution. |
Grafen & Hails, 2002. Legendre & Legendre, 1998. Sokal & Rohlf, 1981. |
The criteria for using a principal axis regression are that you:
- Need two treatment variables recorded for each item.
- Have an association that appears to be linear and the points reasonably scattered.
- Have confirmed using a correlation analysis that the association is significant.
- Have neither treatment variable under the investigators control. Therefore both treatment variables are subject to sampling error.
- Have two treatment variables that are measured on the same scale.
- Have three or more pairs of measurements.
It is clear that criteria 1, 4, 5 and 6 are met. Confirmation that the association appears to be linear and the points reasonably scattered comes from examination of Fig W6.2.

Figure W6.2: Height (m) against DBH (m) for male yew trees
We also need to confirm that the association is significant and have suggested using a Pearson's product moment correlation for this.
To use a Pearson's product moment correlation you:
- Wish to test for an association between treatment variables.
- Have two treatment variables measured for each item.
- Have parametric data. Both variables must be approximately Normally distributed (BOX 3.2).
- Have a distribution that appears to be linear when plotted and the points are reasonably scattered.
- There is no requirement for there to be an independent or a dependent variable.
It is clear that all these for the Pearson's product moment correlation are met other than criterion 3. We need to confirm that the data are parametric.
Therefore, at this point we know we can use Pearson's product moment correlation if the data are parametric. If they are we can carry out the correlation analysis and if significant we can then use the principal axis regression.
2 |
Full calculation of Q W3.2 in Excel
Step 1: Enter the data on a spreadsheet as shown below entering column labels in the first row and ensuring that your data for each sample is in a single column.

Step 2: Generate descriptive statistics in order to obtain values for mean, median, mode and standard deviation. From the top toolbar select 'Tools', then 'Data Analysis' and then 'Descriptive Statistics'.

Click on 'OK'. You will get a dialogue window into which you enter the input data range for DBH male trees. Enter the data cell locations be clicking on the top cell B1 and, holding down the mouse button, drag down to the bottom of the data set. Remember to click in the box for labels in the first row.
Select 'Output Range', and enter the location where you want the output. Select 'Summary Statistics' by clicking in the box.

Click on 'OK'.
Note the values returned for:
Mean = 0.5319
Median = 0.5
Mode = 0.35
Standard deviation = 0.2018

Repeat this step for the Height of the male trees.

We next need to construct a frequency chart for both DBH and height.
First, decide on the frequency classes. For DBH, we shall group the data in 0.1 metre intervals. For this, Excel requires a 'Bin Array'. This is the reference to intervals into which you want to group the data values. If we use the values 0.1, 0.2, 0.3 etc, Excel will count the number of all the values starting from zero up to and including 0.1. For the next grouping, it will count the number of values greater than 0.1 up to and including 0.2, and so on.
Enter the bin array in a column on the Excel spread sheet.
Then, highlight a column of equal length and place '=' in it. This is where the frequency values for each frequency class will be returned.

To obtain frequencies, click on the fx paste function on the top tool bar. From the dialogue box, select the' Function Category: Statistical' and the 'Function name: Frequency'.
Click on OK.

In the dialogue box which opens, first enter the cell locations for the DBH data (A2:A149).
Then click in the box for Bins array and enter the cell location of the frequency classes (D2:D14). Note that each set of values is listed after the respective box and the actual frequencies are given below the boxes.

In order to return the complete set of frequencies, hold down 'Control' and 'Shift' (() and then 'Return'. (If you only click on OK, only the first frequency value will be entered.)

A frequency histogram can now be drawn using the 'Chart Wizard'.
Select 'Chart Wizard' from the top tool bar.
- Step 1 of 4 - select the 'Chart type', click on the column chart to select it.
- Step 2 of 4 - enter the cell locations on the data range tab. Then click on the Series tab. Here ensure that only series 2 (the frequencies) is present in the Series box. The cell locations will appear in the 'Values' box. In the 'Category (X) axis labels' box place the sell locations for the frequency classes. Check that the chart appears as expected. Click on OK.


- Step 3 of 4 - here you can add the chart title and axes labels. Click on the tabs to customise the chart. (The legend can be removed by selecting the 'Legend' tab and then deselecting the 'Show legend' box).
- Step 4 of 4 - Select the chart location and click OK.


The completed chart can then be edited further if required.
Click on one of the columns to select columns. Click the right mouse button and select 'Format Data Series'. Under the 'Options' tab, reduce the 'Gap width' to '0'. This will expand the columns so that they meet.

Click on OK to obtain the completed chart.
Repeat the process for the frequency chart for height.

The DBH data can be sorted into ascending order by using the 'Sort Ascending' tool on the top tool bar.
. This will help when determining % of trees falling within mean ± 1 standard deviation.

Step 3: Go through the individual tests for parametric data.
DBH |
Height |
|
|---|---|---|
Is the data on an interval scale? |
Yes - it is measured in m. |
Yes - it is measured in m. |
Does the distribution appear to be a 'bell' shaped curve? |
No - it seems to have a longer tail for large values. |
Yes - if the small outlier is ignored - but this is probably an immature tree. |
Do about 68% of your observations fall within the range |
Mean = Standard deviation = s = 0.2018 m The measurements between these are 0.34 m to 0.74 m. Looking at the sorted data for DBH, the 22nd to 122nd trees fall into this range - a total of 101 trees. This is 100 x (101/142) = 71%, which is fairly close to 68%. |
Mean = Standard deviation = s = 2.999 m The measurements between these are 7.6 m to 13.5 m. Looking at the sorted data for height, the 25th to 123rd trees fall into this range - a total of 99 trees. This is 100 x (99/142) = 70%, which is fairly close to 68%. |
Does median = mode = mean? |
Median = 0.500 m Mean = 0.5319 m Mode = 0.35 The mode seems to be a somewhat smaller than the median and the mean, which are relatively close together. |
Median = 10.500 m Mean = 10.567 m Mode = 8.5 m The mode seems to be a somewhat smaller than the median and the mean, which are relatively close together. |
For the goodness of fit chi-squared test, see BOX 5.2.
From this information, it looks as if there may be some problems with assuming that the DBH is parametric, but height is closer to being Normally distributed.
Full calculation of Q W3.2 in SPSS
Step 1: Set up the variables.
When SPSS starts up, select 'variable view' using the tabs at bottom-left. You should get something like this:

For the first variable name, type in 'DBH' (SPSS won't accept spaces or brackets in variable names), and for the second 'height'. Default properties are set for each variable.

Use the label field to give labels to the variables that are a little more informative that their SPSS names.

The DBH data is to the nearest 0.01 m, so the 'decimals' field for DBH is OK. The heights are to the nearest 0.1 m, so the decimals field for height can be changed to 1 by clicking in it, and using the 'up-and-down' arrows that appear to make the adjustment.

Transfer to data view using the tabs at bottom-left, and enter the data.

Step 2: Perform the analysis.
[Slight digression: For part of the analysis later, we will need to sort the data. To do this. Go to 'Data', 'sort cases'.

Click on dbh to highlight it, then click on the arrow to transfer dbh across into the 'sort by' window.

Click on 'OK', and the data will be sorted in ascending order of dbh.

The same procedure can be used to sort the data in order of height.
This process will be needed later when we consider the criteria for a Normal distribution.
End of digression.]
Go to 'Analyze,' 'Descriptive Statistics', 'Frequencies'.

Click on 'DBH' to highlight it, then click on the arrow to transfer it into the 'variables' window. Repeat for 'height'.

Click on 'Statistics', and make sure that 'mean', 'median', and 'std. deviation' (standard deviation) are selected.

Click on 'continue'
Click on 'charts', and select 'histograms'.

Click on 'continue', and then on 'OK'. The results will appear in a separate window.
Frequencies

Frequency Table
(Two long frequency tables appears here)
Histogram


Step 3: Go through the individual tests for parametric data.
DBH |
Height |
|
|---|---|---|
Is the data on an interval scale? |
Yes - it is measured in m. |
Yes - it is measured in m. |
Does the distribution appear to be a 'bell' shaped curve? |
No - it seems to have a longer tail for large values. |
Yes - if the small outlier is ignored - but this is probably an immature tree. |
Do about 68% of your observations fall within the range |
Mean = Standard deviation = s = 0.2018 m The measurements between these are 0.34 m to 0.74 m. Sort the data by dbh (see the digression above). Looking at the sorted data, the 22nd to 122nd trees fall into this range - a total of 101 trees. This is 100 x (101/142) = 71%, which is fairly close to 68%. |
Mean = Standard deviation = s = 2.999 m The measurements between these are 7.6 m to 13.5 m. Sort the data by height (see the digression above). Looking at the sorted data, the 25th to 123rd trees fall into this range - a total of 99 trees. This is 100 x (99/142) = 70%, which is fairly close to 68%. |
Does median = mode = mean? |
Median = 0.500 m Mean = 0.5319 m Modal class is centred on 0.5 m. While median and mean are both within the modal classes, they are quite a bit different from each other. |
Median = 10.500 m Mean = 10.567 m Modal classes (from the histogram) are centred on 10, 11 and 12 m. The mode seems to be a little larger than the median and the mean, which are relatively close together. |
For the goodness of fit chi-squared test, see BOX 5.2.
From this information, it looks as if there may be some problems with assuming that the DBH is parametric, but height is closer to being Normally distributed.
Full calculation of Q W3.2 in Minitab
Step 1: Enter the data into the worksheet window, using sensible headings for the columns.

Step 2: Sort the data and calculate some descriptive statistics.
To sort the data, make sure that commands are enabled by clicking in the session window and going to 'Editor', 'Enable Commands'.

At the MTB > prompt, type 'sort c1 c3'. This will sort the elements in column 1, and place them in ascending order in column 3.

Give column 3 a name - 'sorted DBH' will do.
Repeat the process to sort the heights.

Generate some descriptive statistics - we will need the median, mean and standard deviation. Go to 'Stat', 'Basic Statistics', 'Display descriptive statistics'. Select column 1 by clicking on it to highlight it, then clicking on 'select' to transfer it to the 'Variables' window.

Repeat for 'height'.
Click on 'Statistics', and make sure that median, mean and standard deviation are selected.

Click on 'OK'.
Click on 'Graphs' and select 'Histogram of data'.

Click on 'OK', and on 'OK' again. The descriptive statistics appear in the session window, and the histograms appear in two new windows.
Descriptive Statistics: DBH (m), Height (m)



Step 3: Go through the individual tests for parametric data.
DBH |
Height |
|
|---|---|---|
Is the data on an interval scale? |
Yes - it is measured in m. |
Yes - it is measured in m. |
Does the distribution appear to be a 'bell' shaped curve? |
No - it seems to have a longer tail for large values. |
Yes - if the small outlier is ignored - but this is probably an immature tree. |
Do about 68% of your observations fall within the range |
Mean = Standard deviation = s = 0.2018 m The measurements between these are 0.34 m to 0.74 m. Looking at the sorted data for DBH, the 22nd to 122nd trees fall into this range - a total of 101 trees. This is 100 x (101/142) = 71%, which is fairly close to 68%. |
Mean = Standard deviation = s = 2.999 m The measurements between these are 7.6 m to 13.5 m. Looking at the sorted data for height, the 25th to 123rd trees fall into this range - a total of 99 trees. This is 100 x (99/142) = 70%, which is fairly close to 68%. |
Does median = mode = mean? |
Median = 0.500 m Mean = 0.5319 m Modal classes (from the histogram) are centred on 0.4 m and 0.5 m. While median and mean are both within the modal classes, they are quite a bit different from each other. |
Median = 10.500 m Mean = 10.567 m Modal classes (from the histogram) are centred on 10, 11 and 12 m. The mode seems to be a little larger than the median and the mean, which are relatively close together. |
For the goodness of fit chi-squared test, see BOX 5.2.
From this information, it looks as if there may be some problems with assuming that the DBH is parametric, but height is closer to being Normally distributed.
For the remainder of this exercise we will assume that both data sets are parametric. To use a more rigorous assessment you could compare the observed values to those expected from a Gaussian equation as outline in BOX 5.2. If you do this let us know what you find. You could also try transforming the data (3.9) to improve its normality. If you try this let us know the outcome. We will add your results to our website, fully acknowledged. We have not included these steps here as it will make this exercise too long.
3 |
Q W3.3Assuming that these data are parametric carry out a correlation analysis to determine if you may or may not reject the null hypothesis. H0: There is no association between the DBH (m) and height (m) in male Taxus baccata trees at the Wyre forest, Shropshire. H1: There an association between the DBH (m) and height (m) in male Taxus baccata trees at the Wyre forest, Shropshire. Can you reject the null hypothesis? [If you would like to save a record of your answer, please type it into this Word document] |
Full calculation of Q W3.3 in Excel
Step 1. Enter the data set into a new Excel spreadsheet.

Step 2: From the top tool bar, select Tools, Data Analysis and then Correlation from the tools list. The Correlation dialogue box will open.
Step 3: In the Correlation dialogue box, enter the Input range. Ensure that the Input range box is highlighted, then click on the top left hand cell of the data array and drag down and across the entire data set. The cell locations will be entered into the box.
Click in 'Columns' to show that the data is grouped by columns and click in the box 'Labels in the First Row'. This tells Excel that there is non-numerical data in these cells.
Next, select the output options. To return the output data below the input data, select 'Output Range:' and then click in the box (the cursor will now flash in the box). Scroll over an area where you want the results to be displayed. Note that you could just select a couple of cells - Excel will determine the actual size that it requires for the results table. Note too, that it is essential to click in the box as well as selecting the 'Output Range' button. If this is not done the location is entered into the 'Input Range' box and the analysis cannot be completed.

Step 4: Click on OK.
The correlation value is given in the table (0.416)

This represents a modest correlation and must be checked for significance in the statistical tables for Pearson's product moment correlation. With n= 142 and degrees of freedom = 141, the critical value is 0.168. So the correlation value is significant.
Full calculation of Q W3.3 in SPSS
Set up the variables.
When SPSS starts up, select 'variable view' using the tabs at bottom-left. You should get something like this:

For the first variable name, type in 'DBH' (SPSS won't accept spaces or brackets in variable names), and for the second 'height'. Default properties are set for each variable.

Use the label field to give labels to the variables that are a little more informative that their SPSS names.

The DBH data is to the nearest 0.01 m, so the 'decimals' field for DBH is OK. The heights are to the nearest 0.1 m, so the decimals field for height can be changed to 1 by clicking in it, and using the 'up-and-down' arrows that appear to make the adjustment..

Transfer to data view using the tabs at bottom-left, and enter the data.

Step 2: Perform the test.
Go to 'Analyze', 'Correlate', 'Bivariate'.

Click on 'dbh' to highlight it, then click on the arrow to transfer it across to the 'variables' box. Repeat for 'weight'. Make sure that 'Pearson' is selected.

Click on 'OK'. The results will appear in a new window.
Correlations

Step 3: Decide what the results mean.
SPSS computes a complete correlation matrix, so it includes rather meaningless things like the correlation of 'dbh' with itself, and performs the correlations both ways ('dbh' correlated with 'height, and 'height' correlated with 'dbh') which is a bit of a waste of effort for a Pearson test.
The important numbers here are the correlation coefficient of 0.416 for dbh and height (or height and dbh), which is large and positive (as dbh increases, so height increases); and the 'Sig. (2-tailed)', which is less than 0.001: this is the p-value, and gives the probability that there is no correlation between the test variables. We therefore conclude that there is a positive correlation between 'dbh' and 'height' at better than p = 0.001.
Full calculation of Q W3.3 in Minitab
Step 1: Enter the data into the worksheet window, using sensible headings for the columns.

Step 2: Analyse the data.
Go to 'Stat', 'Basic Statistics', 'Correlation'.

Click on 'C1 DBH (m)' to select it, then click on the arrow to transfer it into the 'variables' window. Repeat for 'C2 Height (m)'.

Click on 'OK'. The result will appear in the session window.
Correlations: DBH (m), Height (m)
![]()
Step 3: Decide what the result means
The correlation coefficient is 0.416, and the p-value is less than 0.001. This means that we have a correlation with a significance of better than p = 0.001.
We have now confirmed all the criteria for using a principal axis regression analysis are met. Excel, SPSS and Minitab will not directly calculate a principal axis regression. However, you have already calculated the mean and variance for DBH and height. We will now show you how to calculate the sums of products. With these values you can then complete the regression calculation simply using a calculator (BOX 6.5).
How to calculate the sums of products [sp (xy)] using Excel
Sum of products : sp(xy) = ∑xy - ∑x∑y/n
can also be written as sp(xy) = ∑(x-μx)(y-μy)
In Excel, we can use the Covariance Data Analysis tool which calculates the value of the formula :
The value of Covariance given by Excel can be multiplied by n to obtain the value of sp(xy).
Step 1: Continuing to use the same spreadsheet as above, on the top tool bar, click on 'Tools', then 'Data Analysis' and select 'Covariance' from the list in the box which opens. Click on OK.
Step 2: In the Covariance dialogue box, enter the Input range. Ensure that the Input range box is highlighted, then click on the top left hand cell (for DBH) of the data array and drag down and across the entire data set (for DBH and Height of male trees). The cell locations will be entered into the box.
Click in 'Columns' to show that the data is grouped by columns and click in the box 'Labels in the First Row'. This tells Excel that there is non-numerical data in these cells.

Next, select the output options. To return the output data below the input data, select 'Output Range:' and then click in the box (the cursor will now flash in the box). Scroll over an area where you want the results to be displayed. Note that you could just select a couple of cells - Excel will determine the actual size that it requires for the results table. Note too, that it is essential to click in the box as well as selecting the 'Output Range' button. If this is not done the location is entered into the 'Input Range' box and the analysis cannot be completed. Click on OK.
Step 3: The Covariance value is returned in the Output cell DBH/Height and the value is 0.250.

Step 4: The Covariance value must now be multiplied by n (n=142). Select an empty cell by clicking on it and type'='. Click on the cell (I 162) containing the Covariance value (the cell location will appear in the selected cell). Type '*' for multiplication and then enter the value of n (n=142) = I162*142 and press return. The value of sp(xy) will be returned to the cell.
sp(xy) = 35.51293662
How to calculate the sums of products [sp (xy)] using SPSS
Step 1: Set up the variables.
When SPSS starts up, select 'variable view' using the tabs at bottom-left. You should get something like this:

For the first variable name, type in 'DBH' (SPSS won't accept spaces or brackets in variable names), and for the second 'height'. Default properties are set for each variable.

Use the label field to give labels to the variables that are a little more informative than their SPSS names.

The DBH data is to the nearest 0.01 m, so the 'decimals' field for DBH is OK. The heights are to the nearest 0.1 m, so the decimals field for height can be changed to 1 by clicking in it, and using the 'up-and-down' arrows that appear to make the adjustment..

Transfer to data view using the tabs at bottom-left, and enter the data.

Step 2: Perform the calculation.
The first step is to calculate the product of DBH and height for each individual tree. Go to 'Transform', 'Compute'.

Type 'product' in the 'Target variable' window.
Click on 'DBH' to highlight it, then click on the arrow to transfer it to the 'expression' window. Type '*' (or click on the '*' button), and then transfer 'height' to the expression window.

Click on 'OK', and the new column will appear in the data view.

We next need to find the sums of these three columns, and this is best done using the descriptive statistics option. Go to 'Analyze', 'Descriptive Statistics', 'Descriptives'.

Click on 'DBH' to highlight it, then click on the arrow to transfer it to the 'variable(s)' window. Use the same method to transfer 'height' and 'product'.

Click on 'options', and make sure than 'sum' is selected.

Click on 'Continue', and then on 'OK'. The results will appear in a separate window.
Descriptives

We now have all the bits we need to find the sum of the products. The formula is
Sum of (product) - [(sum of DBH) x (sum of height)]/(number of trees).
This can be done easily on a calculator, or using the 'Transform', 'Compute' facility of SPSS (although this presents the result in a rather silly way - giving the same value to every element in a new column).

And the results gives:

Hence the sum of the products of DBH and Height is 35.51 m2.
How to calculate the sums of products [sp (xy)] using Minitab
Step 1: Enter the data into the worksheet window, using sensible headings for the columns.

Step 2: Analyse the data.
The first thing we need to do is to work out the product of the DBH and the height for every tree. Go to 'Calc', 'Calculator'.

In the 'store result in variable' window, type 'DBH x Height'.
Click in the 'expression window to make it active. Click on 'C1 DBH' in the left-hand window to highlight it, then click on 'select' to transfer it into the 'expression' window. Type '*' (which means 'multiply') (or click on the '*' button), and then transfer 'C2 height' into the expression window.

Click on 'OK', and the results will appear as a new column in the worksheet window.

There are 142 trees (scroll down to the bottom of the worksheet window to confirm this).
We now have all the bits we need to find the sum of the products. The formula is
Sum of (DBH x Height) - [(sum of DBH) x (sum of height)]/(number of trees).
Go to 'Calc', 'Calculator' again, and clear the 'store result in variable' and 'expression' windows.
Type 'SP' in the 'store result in variable' window.

Click in the expression window to activate it. Scroll down the function list until you get to 'sum'. Click on 'sum' to highlight it, then click on 'select' to enter it into the expression window.

Click on 'C3 DBH x Height' in the left-hand window to highlight it, then click on 'select' to enter it into the 'sum' function.

Move the cursor outside the 'sum' function's brackets, and type '-' (or press the '-' button).
Construct the remainder of the expression using the 'sum' function and the columns in the left-hand window.

Click on 'OK', and the result will appear in a new column in the worksheet window.

Step 3: Decide what the result means.
The sum of the products of DBH and height is 35.51 m2.
4 |
Full calculation for Q W3.4
- Calculation of terms
- First decide which of the variables to call x and which to call y.
Let DBH be the x variable and height be the y variable.
- Calculate basic terms
From the previous calculations we know that:
- The mean of the DBH (x) is 0.5319 m, and its variance is 0.0407 m2.
- The mean of the height (y) is 10.567 m, and its variance is 8.995 m2.
- The total number of pairs of observations is 142.
- Calculate the variance for products xy (s2xy)
The sum of products sp (xy) = 35.51294
- The slope of the principal axis (b1) is calculated as
Taking this in small steps:
- What is the regression equation?
y = 10.567 + 35.5808 (x - 0.5319)
= 10.567 + 35.5808x - 18.9254
= - 8.35843 + 35.5808x
- How to draw the line
In the same way as other regressions you need to calculate the x and y values for 3 points using x values that lie within your data set, and your regression equation. Your line then passes through these points within the range of your data set. This is the regression line and if you add it to your figure it should be labelled with the regression equation.
In this example if
x = 0.2, y = - 8.35843 + (35.5808 × 0.2) = - 8.35843 + 7.11616 = 1.24227
x = 0.5, y = - 8.35843 + (35.5808 × 0.5) = - 8.35843 + 17.7904 = 9.43197
x = 1.0, y = - 8.35843 + (35.5808 × 1.0) = - 8.35843 + 35.5808 = 27.22237
The regression line is shown on fig W6.3.

- Testing the significance of the line
The principal axis regression allows you to identify a regression line that reflects the linear nature of the association between your two variables. However, unlike the simple linear regression there is no simple associated process for testing a hypothesis relating to this association. Therefore, you should also use a correlation analysis to confirm the strength and significance of the association between the two variables.
We carried this out earlier in this exercise. Therefore, from our analysis we may conclude that there is a highly significant association (r = 0.416, p < 0.001) between the DBH (m, variable x) and height (m, variable y) in male Taxus baccata trees at the Wyre forest, Shropshire such that y = - 8.35843 + 35.5808x (Fig W6.3).
- Validity of the process
A cursory inspection of Fig. W6.3 indicates that the line we have obtained is not a very good fit to the data. This is because, although the measurements are both in the same units (metres), they are of rather different magnitudes (the heights are about 10 m, while the values of DBH are about 0.5 m). This means that the shape of the ellipse whose major axis we are trying to find is distorted by the unequal scale on the two axes, and the final result is not a very good fit.
One way of overcoming this problem is to use ranged principal axis regression, in which the data are transformed so that the range is from 0 to 1; the regression line is found for the transformed data, and then the equation of the line is transformed back to the original scales. The range of height is from 0.7 m to 18.3 m, and the range of DBH is from 0.12 m to 1.15 m. If we use a subscript T to denote the transformed data, then for height we have HT = (H - 0.7)/(18.3 - 0.7) ---------------(1)
and for DBH we have DBHT = (DBH - 0.12)/(1.15 - 0.12) --------------(2)
The transformations can easily be performed using a spreadsheet such as Excel. Once this has been done, the principal axis regression line can be found in the normal way, giving HT = interceptT + (slopeT x DBHT). The reverse transformation is performed by substituting (1) and (2) into this equation and rearranging to make H the subject, and this gives y = 4.035 + 12.281x. This actually looks a lot better that the principal axis regression (see Fig. W6.4).

Figure W6.4: Scatterplot of height (m) against DBH (m) for male yew trees
Another method (not mentioned in the book) is reduced principal axis regression. In this, the data is transformed so that the mean is zero and the standard deviation is 1. This has the advantage over ranged principal axis regression in that the transformation is not sensitive to outliers (data points well away from the bulk of the measurements). The regression line of the transformed data is found, and then transformed back to the original variables, as for ranged principal axis regression. At this point, because of a quirk of the mathematics, we find that the slope of the regression line is simply the ratio of the standard deviations of our original data. This means that reduced principal axis regression is very easy to do, and this may be one reason why it is so often used in research papers. For the yew trees, the standard deviations of the height and the DBH are 2.999213 m and 0.201800 m respectively, giving a slope of 2.999213/0.201800 = 14.862289. The reduced principal axis regression line is therefore:
y = 10.5669 + 14.862289(x - 0.5319)
which gives y = 2.662 + 14.862x
See Fig. W6.5 for a comparison of various regression lines.

Fig. W6.5: Scatterplot of heights of male yew trees (m) against DBH (m) showing regression lines.
Line 1: Simple regression of height on DBH: y = 7.30 + 6.01x
Line 2: Principal axis regression: y = -8.36 + 35.58x
Line 3: Ranged principal axis regression: y = 4.03 + 12.28x
Line 4: Reduced principal axis regression: y = 2.66 + 14.86x