A Compendium of Reproducible Research about Descriptive Statistics and Linear Regression
Abstract This document can be used as an illustrated tutorial about Wessa.net and FreeStatistics.org. It illustrates how Reproducible Research can be integrated in Statistics Education based on a Compendium of Reproducible and Reusable Research. 1
1
Introduction
This case study illustrates how free online statistical software (R modules) can be used in data analysis. This tutorial can be used at an undergraduate level the prerequisites are as follows: • a basic understanding of Descriptive Statistics • an interest in scientific data analysis • basic ICT skills All statistical techniques are based on the online “e-Handbook of Statistical Methods” [16] that is available online as HTML and PDF documents. The structure of the e-Handbook is like a cookbook: it contains a list of statistical methods in alphabetical order. There is also an overview of all methods that is ordered by type of analysis. In my experience with students it is best to use the electronic PDF version of the e-Handbook because it can be easily browsed and searched. Each method is explained in simple words and with a minimum of mathematical background. In addition, the documents contains numerous illustrations and examples which may also be beneficial. The case-based approach of this tutorial and the technology that allows to reproduce/reuse every aspect of the research, is consistent with the pedagogical paradigm of (social) constructivism. The fact that this tutorial has been designed as a compendium makes every aspect of the portayed research reproducible which empowers anyone with a healthy interest in statistics to interact (and conduct experiments) with the underlying statistical science. From the statistical point of view, the following concepts are used in this tutorial:
The Essay on Statistical Research in Psychology
Many students, including myself, are unaware how important statistics can be to the research process in Psychology. In this report, I will discuss the method used to perform researches as well as the forms of data used through statistic in Psychology. This will include the advantages and disadvantages of each form used. Introduction Many studying Psychology might be shocked that statistics is a ...
1 Acknowledgements: The Compendium Platform for Reproducible Research was partially funded by the OOF 2007/13 grant of the K.U.Leuven Association.
1
• Measures of Central Tendency (arithmetic mean and median) • Explorative Data Analysis (such as Notched Boxplots) • The importance of outliers • Statistical Hypothesis Testing (one-sided and two-sided) • Simple and multiple regression Analysis A final word of caution: The models that are presented here are illustrative and simplistic in nature. Nevertheless the document should contain some interesting and challenging concepts for students at an undergraduate level.
2
2.1
How much should we pay for coffee?
Problem
We want to import Arabica coffee from Colombia and sell it in the USA. Therefore we would like to explore and estimate the relationship between the monthly time series Yt (the US retail price in US cents per lb) and Xt (the price paid to growers in Colombia in US cents per lb).
We suspect that the increase of prices paid to growers (on the long run) is substantially lower than the rise in retail prices in the USA. 2.1.1 Hypotheses
Hypothesis 1 Arabica coffee prices paid to growers in Colombia are constant. We define H0 : αX = c and HA : αX = c where c is any “positive constant number” for the following equation: Xt = αX + et for t = 1, 2, …T . Note: if the coffee prices are constant this implies that the constant αX can be used to make predictions about Xt on the long run. Hypothesis 2 Retail prices of Arabica coffee in the USA are constant. We define H0 : αY = c and HA : αY = c where c is any “positive constant number” for the following equation: Yt = αY + et for t = 1, 2, …T . Note: if the coffee prices are constant this implies that the constant αY can be used to make predictions about Yt on the long run. Hypothesis 3 The retail prices of Arabica coffee (in the USA) can be (partially) explained by the price paid to growers in Colombia. We define H0 : β = 0 and HA : β > 0 for the following equation: Yt = α + βXt + … + et for t = 1, 2, …T . Note: we use a one-sided hypothesis test because we expect β to be positive. Hypothesis 4 The long run growth of retail prices of Arabica coffee (in the USA) cannot be explained by the prices paid to growers in Colombia. In other words, an additional trend variable is needed to account for the long run increase in US retail prices. We define H0 : γ = 0 and HA : γ > 0 for the following equation: Yt = α + βXt + γt + … + et for t = 1, 2, …T . Note: we use a one-sided hypothesis test because we expect γ to be positive. 2
The Essay on The Rising Price of Food
Recent years have seen dramatic increases in the world prices for food commodities. The first half of the year 2008 saw the price of rice go up by 50% and generally speaking, similar increases in other food commodities such as maize, soybeans and wheat have been seen across the world, resulting in various forms of panic. In the Philippines, farmers have begun hoarding supplies of rice, while ...
Arabica Price in Colombia
value
40 0
60
80
100
120
140
160
50
100
150 time
200
250
300
350
Figure 1: Run Sequence Plot – Arabica Coffee Price in Colombia link of reproducible computation
2.2
Data
The following time series were obtained from the International Coffee Organization (ICO): • The univariate time series of prices paid to growers in Colombia in US cents/lb (Figure 1 ) [1]. This time series is denoted Xt for t = 1, 2, …T where T is the number of observations. • The univariate time series of retail prices in the USA in US cents/lb (Figure 2) [2]. This time series is denoted Yt for t = 1, 2, …T where T is the number of observations. Both time series range from January 1977 – December 2006 and can be viewed in a Google Spreadsheet (click to open Google Spreadsheet) [4]. As an alternative, the bivariate time series of Arabica Coffee in Colombia and the USA in US cents/lb (Figure 3) can be retrieved from the archive at FreeStatistics.org [3].
2.3
Analysis
Each hypothesis is investigated in turn, based on statistical models that are consistent with the theoretical concepts that are outlined in Chapter 1 (Explorative Data Analysis) [16]. 2.3.1 Univariate Analysis of Xt
There are several ways to estimate αX in Xt = αX +et for t = 1, 2, …T . First we investigate if the arithmetic mean could be an appropriate choice for estimating αX . We know that the arithmetic mean can be very sensitive with respect to outliers. Therefore, we compute the winsorized and trimmed (arithmetic) mean of Xt (click to reproduce).
The Essay on Slaughterhouse five Realitivity Of Time
Many writers in history have written science fiction novels and had great success with them, but only a few have been as enduring over time as Kurt Vonneguts Slaughterhouse-Five. Slaughterhouse-Five is a personal novel which draws upon Vonneguts experiences as a scout in World War Two, his capture and becoming a prisoner of war, and his witnessing of the fire bombing of Dresden in February of 1945 ...
The arithmetic mean of Xt is 77.54 US cents/lb [5] when all observations are considered. However, if we “winsorize” the observations by making extreme
3
Arabica Price in the USA
value
250 0
300
350
400
450
50
100
150 time
200
250
300
350
Figure 2: Run Sequence Plot – Arabica Coffee Price in the USA link of reproducible computation
Plot of x
120 160 450
Plot of y
cents/lb
cents/lb 0 50 100 150 200 250 300 350
80
40
250 0
350
50
100
150
200
250
300
350
time
time
Histogram of x
Histogram of y
Frequency
Frequency 40 60 80 100 x 120 140 160 180
60
0 20
0
20
40
60
250
300
350 y
400
450
Figure 3: Run Sequence Plots and Histograms – Arabica Coffee Price in Colombia (x) and the USA (y) link of reproducible computation
4
Robustness of Central Tendency
Winsorized Mean(j/n)
75 0
76
77
78
79
20
40
60 j
80
100
120
Figure 4: Robustness of Central Tendency – Arabica Coffee Price in Colombia link of reproducible computation values less extreme [10] then the result in Figure 4 is obtained. As we move along the x-axis (from left to right) we observe how the winsorized mean decreases as more extreme values are made less extreme. If we “winsorize” about 60 observations on both sides of the distribution (j = 60), the arithmetic mean is approximately 76 cents/lb. If we consider the hypothesis H0 : αX = 77.54 and HA : αX = 77.54 then we observe that 77.54 falls outside of the 95% confidence interval when j = 60. Hence, we reject the nullhypothesis that the arithmetic mean is equal to 77.54 at the 5% type I error level. In other words, the extreme values in Xt cause the arithmetic mean to be biased and therefore the arithmetic mean is probably not a good choice for αX . How about other measures of central tendency? It is often suggested that the median should be used instead of the arithmetic mean. The reseaon is that the median is not sensitive to outliers. Computation [5] shows that the median is 76.78 cents/lb which is smaller than the original arithmetic mean of 77.54 cents/lb. When the median and the arithmetic mean are unequal then we may conclude that the distribution is skewed (not symmetric).
The Essay on Cobweb Theorem Price Figure Supply
Cobweb model (know as Hog Cycle) is a dynamic analysis which provides an explanation for certain types of cyclical behaviours due to regular fluctuation in price and output. This model, as most other economic models, is based on some assumptions. It assumes that prices are determined by current prices i. e. farmers expect that the future price will be the same as the present ones. This is known as ...
We now have to determine if the median and arithmetic mean are equal or significantly different. Unfortunately, hypothesis tests about the median are not easily derived. However, we could use notched boxplots (a tool from Explorative Data Analysis) combined with the Blocked Bootstrap method to further investigate this question. The wessa.net website hosts an R module called “Blocked Bootstrap Plot Central Tendency” which compares three measures of central tendency (arithmetic mean, median, and midrange) by means of simulation for any univariate time series (see also 1.3.3.4. Bootstrap Plot [16]).
The idea is to simulate a (large) number of instances of the original time series by blocked bootstrapping. Then the three measures of central tendency are computed for each simulated series. The results can be summarized by the use of notched boxplots that allow us to compare them. Instead of using the default R module we make some changes by “editing” the underlying R source code (click to reproduce) [6]: • we eliminate the midrange computation (because it is of no interest for 5
Bootstrap Simulation − Central Tendency
simulated values
70
75
80
85
mean
median
Figure 5: Notched Boxplots of Blocked Bootstrap – Arabica Coffee Price in Colombia link of reproducible computation our investigation) • we add a table that displays the confidence intervals of the notched boxplot procedure as implemented in the R language • we deleted all pictures with the exception of the notched boxplots (this is the only picture of interest for our investigation) The result in Figure 5 clearly shows that the notches of both boxplots do not overlap (this can be verified in the table with confidence bounds [6]).
The Essay on Application Of The Vrine Model In The Analysis Of Cisco Systems, Inc.
Cisco’s acquisition of more than 115 companies since 1993 could have meant a significant challenge for integrating networks and other IT elements. Instead, Cisco IT has developed a standard set of principles and processes to help accomplish these integrations rapidly, consistently, and with minor disruption. Cisco IT continuously improves its integration expertise by applying the standards to each ...
The conclusion is that the median of simulated arithmetic means is different from the median of simulated medians. If the median and the mean of Xt are unequal then we may fairly assume the distribution of Xt to be skewed (compare this result with the histogram of Figure 3).
Let us continue the analysis with the median as estimator for αX . We are interested to know if we can make fairly appropriate predictions of Xt for t = T + 1, T + 2, …, T + K where K is the forecast horizon. More formally the forecast is defined by Ft ≡ Xt − et = αX for t = T + 1, T + 2, …T + K (where αX = median(Xt ) for t = 1, 2, …T ).
To test the hypothesis that the median is constant we need to address two questions: 1. is the median idependent of the month (seasonality)? 2. is the median constant on the long run? The R module called “Mean Plot” allows us to investigate the median of periodic and sequential subseries as described in 1.3.3.20 Mean Plot [16]. The mean plot of Xt (click to reproduce) [7] clearly answers both questions: 1. there is no indication of seasonality in the time series (see Figure 6).
Hence, the median does not depend on the month. 2. there is clear evidence of a trend-like behaviour in the time series under investigation (see Figure 7).
Hence, the median is not constant on the long run. 6
Notched Box Plots − Periodic Subseries
Value
40
60
80
100
120
140
160
1
2
3
4
5
6
7
8
9
10
11
12
Periodic Index
Figure 6: Notched Boxplots – Periodic Subseries – Arabica Coffee Price in Colombia link of reproducible computation
Notched Box Plots − Sequential Blocks
Value
40
60
80
100
120
140
160
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
Block Index
Figure 7: Notched Boxplots – Sequential Subseries – Arabica Coffee Price in Colombia link of reproducible computation Figure 7 is of particular interest in our analysis. It is clear from the analysis that the medians in sequential years are significantly different. Therefore the use of the median as a predictor (Ft = αX ) is problematic when it is computed over the entire observation period. It looks like the recent past deserves more weight in computing αX than the distant past. The answer to Hypothesis 1 is clear. The null hypothesis H0 : αX = c is rejected and the alternative hypothesis HA : αX = c accepted. The model Xt = αX + et for t = 1, 2, …T and the forecasts generated by Ft = αX = c for t = T + 1, T + 2, …T + K are unacceptable. Another way to investigate the proposed model is by means of the so-called 4-Plot as described in 1.3.3.32 4-Plot [16]. The implementation of the 4 Plot in wessa.net has 6 charts and is called “Univariate Explorative Data Analysis” (the Densityplot and Autocorrelation Function have been added) [13]. The underlying assumptions about the model can be investigated [8] with this R 7
The Term Paper on Painting Analysis in Jane Eyre
From the opening chapter of Charlotte Brontë’s Jane Eyre the reader becomes aware of the powerful role that art plays. There is something extraordinary about the pictures Jane admires from other artists, as well as the work she creates herself. Her solitary pastime often operates as an outlet of pain, either past or present, and offers her the opportunity to deal with unpleasant emotions and ...
module and reveals that (click to reproduce): • the time series does not behave like a constant (c.q. Xt = αX + et is not appropriate) • the variance is not constant • the time series is not normally distributed (but skewed to the right) • the time series is autocorrelated (and even contains high order autocorrelation) 2.3.2 Univariate Analysis of Yt
To repeat the analysis that was performed in the previous section we simply need to substitute Xt for Yt . There are several ways to estimate αY in Yt = αY +et for t = 1, 2, …T . First we investigate if the arithmetic mean could be an appropriate choice for estimating αY . We know that the arithmetic mean can be very sensitive with respect to outliers. Therefore, we compute the winsorized and trimmed (arithmetic) mean of Yt (click to reproduce).
Assignment Fill in the blanks in the following text. The arithmetic mean of Yt is US cents/lb [9] when all observations are considered. If we “winsorize” more than 50 observations on both sides of the distribution, the arithmetic mean is significantly lower. In other words, the extreme values in Yt cause the arithmetic mean to be biased and therefore the arithmetic mean is probably not a good choice for αY . As we move along the x-axis of the winsorized or trimmed mean chart, the computed mean (slowly) converges towards US cents/lb which is the (select from this list: midrange, midmean, harmonic mean, median, mode, geometric mean).
Assignment Investigate Yt and answer the following questions: 1. Is the distribution of Yt skewed? Use at least two different techniques to answer this question. 2. Does Yt contain seasonality? 3. Are the forecasts generated by Ft = αY = c for t = T + 1, T + 2, …T + K acceptable? Hint: make sure to blog (archive) your computations and include the hyperlinks of the archived computations in your document. 2.3.3 regression model 1
The third hypothesis under investigation focuses on the question if US retail prices of Arabica coffee can be explained by the price paid to growers in Colombia. We defined H0 : β = 0 and HA : β > 0 for the following equation: Yt = α + βXt + … + et for t = 1, 2, …T . Logically we expect that retail prices will reflect production costs which leads to the conclusion that the parameter β should be positive.
8
The hypothesis can be investigated by the use of the so-called 6-Plot (1.3.3.33 6-Plot [16]) which is basically an EDA-based graphical representation of the simple linear regression model. This analysis has been implemented in wessa.net with 9 charts (instead of 6) and is named “Linear Regression Graphical Model Validation” [14]. The result of this analysis [11] clearly shows that the null hypothesis H0 : β = 0 is rejected in favor of the alternative HA : β > 0 (click to reproduce).
The economic interpretation is that the cost of production is reflected in the ˆ retail prices. The estimated value of the regression parameter β ≃ 1.84 implies that an increase of 1 US cent/lb of Xt is multiplied by a factor 1.84 to get the ˆ estimated USA retail price Yt . The result of this regression estimation should be subjected to careful “model validation” (testing the underlying model assumptions).
Within the framework of EDA this is preferably done with graphical analysis. Assignment Answer the following questions based on the archived computation: 1. Does the scatterplot between Xt and Yt reveal/suggest a linear relationship? 2. Is there any evidence for autocorrelation?
2
3. Are the estimated residuals normally distributed? 4. Is there any evidence for the presence of outliers? 5. Overall, do you think the assumptions of the model are satisfied? 2.3.4 Regression Model 2
Let us suppose that the simple regression model of the previous section was misspecified (or incomplete) – in other words, let us suppose that a linear trend should have been added to the equation as defined in Hypothesis 4 of section 2.1.1. We can’t use the 6-Plot (1.3.3.33 6-Plot [16]) from the previous section because we now have three variables instead of two. There are two exogenous (c.q. explanatory) variables (Xt and t) and one endogenous variable (Yt ).
The newly added variable can be interpreted as a linear trend (or a long term effect of time).
One might wonder if the addition of this new variable does make sense at all? Would it affect the estimated relationship between Xt and Yt (represented ˆ by the parameter β)? We can find out the answer by making use of the multiple regression R module [15] which allows us to specify “multiple” (c.q. more than one) exogenous variables (hence the term “multiple regression”).
In this case the data have to be entered as a multivariate dataset. The easiest way to do this is to copy the multivariate dataset from a spreadsheet and paste it into the Data X textbox
2 You may not be familiar with the term autocorrelation. Therefore we could rephrase the question as follows: “Do prediction errors from the past contain any information that might help us to improve future predictions?” or “Is there any evidence that ρ(et , et−k ) = 0 for k = 1, 2, 3, …?”
9
Figure 8: Data Entry – Multiple Regression R module link of R module
Figure 9: Parameters – Multiple Regression R module link of R module
10
of the R module. You can use the Google Spreadsheet (click to open) that is available online. The data should be pasted all at once: select the data range of all values in de spreadsheet and copy to the clipboard (with Ctrl-Insert).
Then paste the contents of the clipboard into the “Data X” box of the R module. As a second step, we need to identiy the names of each column. We can do this by simply copying the contents of the cells (C1:D1) to the clipboard and pasting it into the “Names of X columns” box3 . Figure 8 is an illustration (partial screendump) of the R module after the multivariate dataset was pasted into the application. The model-specifying parameters of the multiple regression should be set correctly. First, we have to make sure that the “Column Number of Endogenous Series” field contains the number 2 because we want to define the second column as the endogenous series (= Yt ).
Second, we select the option “Linear Trend” from the “Type of Equation” drop down list in order to make sure the software automatically generates an exogenous variable that represents time. An illustration of both model-specification parameters can be found in Figure 9. The Multiple Regression R module produces a substantial amount of output. For the purpose of this tutorial however we focus exclusively on the aspects that are directly related to Hypothesis 4: is it necessary to include an additional trend variable to account for the long run increase in US retail prices? Based on the multiple regression analysis [12], the answer is affirmative4 . The null hypothesis H0 : γ = 0 is rejected in favor of the alternative HA : γ > 0 as can be seen from the table with estimated regression parameters and respective p-values (click to reproduce).
ˆ The estimated parameters α ≃ 135.7, β ≃ 1.88 (slightly higher than in ˆ regression 1) and γ ≃ 0.15 can be substituted into the regression equation ˆ ˆ which yields: Yt = 135.7 + 1.88Xt + 0.15t for t = 1, 2, …T .
2.4
Important remark
As indicated in the introduction the analysis portrayed in this document is illustrative and simplistic in nature. More importantly, the underlying assumption of the multiple linear regression model are not satisfied. The multiple regression equation suffers from autocorrelation and heteroskedasticity which leads to inefficient estimation. It is highly likely that both problems are caused by a misspecification of the regression model (unobserved variables) which may lead to biased parameter estimates. As a consequence our conclusions about Hypothesis 3 and 4 should be revisited in the context of a more adequate (complete) model specification. Assignment If you are familiar with autocorrelation and heteroskedasticity you may try to answer the following questions based on the archived computation:
that the column names must not be too long nor contain any spaces! with any regression model the validity of the hypothesis test depends on underlying assumptions. The answer that is provided here is conditional (c.q. given that the underlying assumputions are satisfied).
4 As 3 Note
11
• Are the residuals of the regression model autocorrelated? How do you detect this? • Is there any evidence for the presence of high-order autocorrelation? • Does the model really suffer from heteroskedasticity? Use two different diagnostic tests to confirm your answer. • Could the model be improved by taking into account seasonal effects? Hint: reproduce the model with monthly seasonal dummies.
References
[1] Statistical fice for FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/06/t1199574888xvrybp4pl7fyg1j.htm , Retrieved Sun, 06 Jan 2008 00:15:06 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/06/t11995765663rnq0f6tqd9tvuc.htm , Retrieved Sun, 06 Jan 2008 00:42:50 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/06/t11995770459lp663o7ht2jxj1.htm , Retrieved Sun, 06 Jan 2008 00:50:48 +0100 Computations at Research Development Computations at Research Development Computations at Research Development
[2] Statistical fice for
[3] Statistical fice for
[4] http://spreadsheets.google.com/ccc?key=pV35do1d8bhGWVkmHcU2xRg&hl=en [5] Statistical fice for FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/06/t1199655142l1mf45ba5ydsjhm.htm , Retrieved Sun, 06 Jan 2008 22:32:30 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/07/t1199701558w9l4qhatxnfi326.htm , Retrieved Mon, 07 Jan 2008 11:26:04 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/07/t119971350673encrjvyxgcqw7.htm , Retrieved Mon, 07 Jan 2008 14:45:24 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/07/t1199715933ou67sm0izgd06ae.htm , Retrieved Mon, 07 Jan 2008 15:25:36 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/19/t12007439926t0l73gl5ozrq14.htm , Retrieved Sat, 19 Jan 2008 12:59:59 +0100 12 Computations at Research Development Computations at Research Development Computations at Research Development Computations at Research Development Computations at Research Development
[6] Statistical fice for
[7] Statistical fice for
[8] Statistical fice for
[9] Statistical fice for
[10] Borghers E. and P. Wessa, Descriptive Statistics – Central Tendency Winsorized Mean, Office for Research Development and Education, URL mean.htm, Retrieved Mon, 07 Jan 2008 10:40 +0100 [11] Statistical fice for FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Feb/26/t12040214435my4uqfrs09mgrk.htm , Retrieved Tue, 26 Feb 2008 11:24:09 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Feb/26/t1204025214shjxx5mlpuxjln9.htm , Retrieved Tue, 26 Feb 2008 12:26:59 +0100 Computations at Research Development Computations at Research Development
[12] Statistical fice for
[13] Wessa P., (2007), Univariate Explorative Data Analysis (v1.0.5) in Free Statistics Software (v1.1.22-r4), Office for Research Development and Education, URL edauni.wasp [14] Wessa P., (2007), Linear Regression Graphical Model Validation (v1.0.2) in Free Statistics Software (v1.1.22-r4), Office for Research Development and Education, URL linear regression.wasp [15] Wessa P., (2008), Multiple Regression (v1.0.25) in Free Statistics Software (v1.1.22-r4), Office for Research Development and Education, URL multipleregression.wasp/ [16] NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, Retrieved Mon, 07 Jan 2008 02:01 +0100.
13