Descriptive Statistics and Linear Regression, Sample of Research papers

8 pages, 3735 words

A Compendium of Reproducible Research about Descriptive Statistics and Linear Regression

Abstract This document can be used as an illustrated tutorial about Wessa.net and FreeStatistics.org. It illustrates how Reproducible Research can be integrated in Statistics Education based on a Compendium of Reproducible and Reusable Research. 1

Introduction

This case study illustrates how free online statistical software (R modules) can be used in data analysis. This tutorial can be used at an undergraduate level the prerequisites are as follows: • a basic understanding of Descriptive Statistics • an interest in scientiﬁc data analysis • basic ICT skills All statistical techniques are based on the online “e-Handbook of Statistical Methods” [16] that is available online as HTML and PDF documents. The structure of the e-Handbook is like a cookbook: it contains a list of statistical methods in alphabetical order. There is also an overview of all methods that is ordered by type of analysis. In my experience with students it is best to use the electronic PDF version of the e-Handbook because it can be easily browsed and searched. Each method is explained in simple words and with a minimum of mathematical background. In addition, the documents contains numerous illustrations and examples which may also be beneﬁcial. The case-based approach of this tutorial and the technology that allows to reproduce/reuse every aspect of the research, is consistent with the pedagogical paradigm of (social) constructivism. The fact that this tutorial has been designed as a compendium makes every aspect of the portayed research reproducible which empowers anyone with a healthy interest in statistics to interact (and conduct experiments) with the underlying statistical science. From the statistical point of view, the following concepts are used in this tutorial:

2 pages, 930 words

The Essay on Statistical Research in Psychology

Many students, including myself, are unaware how important statistics can be to the research process in Psychology. In this report, I will discuss the method used to perform researches as well as the forms of data used through statistic in Psychology. This will include the advantages and disadvantages of each form used. Introduction Many studying Psychology might be shocked that statistics is a ...

1 Acknowledgements: The Compendium Platform for Reproducible Research was partially funded by the OOF 2007/13 grant of the K.U.Leuven Association.

• Measures of Central Tendency (arithmetic mean and median) • Explorative Data Analysis (such as Notched Boxplots) • The importance of outliers • Statistical Hypothesis Testing (one-sided and two-sided) • Simple and multiple regression Analysis A ﬁnal word of caution: The models that are presented here are illustrative and simplistic in nature. Nevertheless the document should contain some interesting and challenging concepts for students at an undergraduate level.

2.1

How much should we pay for coﬀee?

Problem

We want to import Arabica coﬀee from Colombia and sell it in the USA. Therefore we would like to explore and estimate the relationship between the monthly time series Yt (the US retail price in US cents per lb) and Xt (the price paid to growers in Colombia in US cents per lb).

We suspect that the increase of prices paid to growers (on the long run) is substantially lower than the rise in retail prices in the USA. 2.1.1 Hypotheses

Hypothesis 1 Arabica coﬀee prices paid to growers in Colombia are constant. We deﬁne H0 : αX = c and HA : αX = c where c is any “positive constant number” for the following equation: Xt = αX + et for t = 1, 2, …T . Note: if the coﬀee prices are constant this implies that the constant αX can be used to make predictions about Xt on the long run. Hypothesis 2 Retail prices of Arabica coﬀee in the USA are constant. We deﬁne H0 : αY = c and HA : αY = c where c is any “positive constant number” for the following equation: Yt = αY + et for t = 1, 2, …T . Note: if the coﬀee prices are constant this implies that the constant αY can be used to make predictions about Yt on the long run. Hypothesis 3 The retail prices of Arabica coﬀee (in the USA) can be (partially) explained by the price paid to growers in Colombia. We deﬁne H0 : β = 0 and HA : β > 0 for the following equation: Yt = α + βXt + … + et for t = 1, 2, …T . Note: we use a one-sided hypothesis test because we expect β to be positive. Hypothesis 4 The long run growth of retail prices of Arabica coﬀee (in the USA) cannot be explained by the prices paid to growers in Colombia. In other words, an additional trend variable is needed to account for the long run increase in US retail prices. We deﬁne H0 : γ = 0 and HA : γ > 0 for the following equation: Yt = α + βXt + γt + … + et for t = 1, 2, …T . Note: we use a one-sided hypothesis test because we expect γ to be positive. 2

3 pages, 1305 words

The Essay on The Rising Price of Food

Recent years have seen dramatic increases in the world prices for food commodities. The first half of the year 2008 saw the price of rice go up by 50% and generally speaking, similar increases in other food commodities such as maize, soybeans and wheat have been seen across the world, resulting in various forms of panic. In the Philippines, farmers have begun hoarding supplies of rice, while ...

Arabica Price in Colombia

value

40 0

100

120

140

160

100

150 time

200

250

300

350

Figure 1: Run Sequence Plot – Arabica Coﬀee Price in Colombia link of reproducible computation

2.2

Data

The following time series were obtained from the International Coﬀee Organization (ICO): • The univariate time series of prices paid to growers in Colombia in US cents/lb (Figure 1 ) [1]. This time series is denoted Xt for t = 1, 2, …T where T is the number of observations. • The univariate time series of retail prices in the USA in US cents/lb (Figure 2) [2]. This time series is denoted Yt for t = 1, 2, …T where T is the number of observations. Both time series range from January 1977 – December 2006 and can be viewed in a Google Spreadsheet (click to open Google Spreadsheet) [4]. As an alternative, the bivariate time series of Arabica Coﬀee in Colombia and the USA in US cents/lb (Figure 3) can be retrieved from the archive at FreeStatistics.org [3].

2.3

Analysis

Each hypothesis is investigated in turn, based on statistical models that are consistent with the theoretical concepts that are outlined in Chapter 1 (Explorative Data Analysis) [16]. 2.3.1 Univariate Analysis of Xt

There are several ways to estimate αX in Xt = αX +et for t = 1, 2, …T . First we investigate if the arithmetic mean could be an appropriate choice for estimating αX . We know that the arithmetic mean can be very sensitive with respect to outliers. Therefore, we compute the winsorized and trimmed (arithmetic) mean of Xt (click to reproduce).

3 pages, 1047 words

The Essay on Slaughterhouse five Realitivity Of Time

Many writers in history have written science fiction novels and had great success with them, but only a few have been as enduring over time as Kurt Vonneguts Slaughterhouse-Five. Slaughterhouse-Five is a personal novel which draws upon Vonneguts experiences as a scout in World War Two, his capture and becoming a prisoner of war, and his witnessing of the fire bombing of Dresden in February of 1945 ...

The arithmetic mean of Xt is 77.54 US cents/lb [5] when all observations are considered. However, if we “winsorize” the observations by making extreme

Arabica Price in the USA

value

250 0

300

350

400

450

100

150 time

200

250

300

350

Figure 2: Run Sequence Plot – Arabica Coﬀee Price in the USA link of reproducible computation

Plot of x

120 160 450

Plot of y

cents/lb

cents/lb 0 50 100 150 200 250 300 350

250 0

350

100

150

200

250

300

350

time

Histogram of x

Histogram of y

Frequency

Frequency 40 60 80 100 x 120 140 160 180

0 20

250

300

350 y

400

450

Figure 3: Run Sequence Plots and Histograms – Arabica Coﬀee Price in Colombia (x) and the USA (y) link of reproducible computation

Robustness of Central Tendency

Winsorized Mean(j/n)

75 0

60 j

100

120

Figure 4: Robustness of Central Tendency – Arabica Coﬀee Price in Colombia link of reproducible computation values less extreme [10] then the result in Figure 4 is obtained. As we move along the x-axis (from left to right) we observe how the winsorized mean decreases as more extreme values are made less extreme. If we “winsorize” about 60 observations on both sides of the distribution (j = 60), the arithmetic mean is approximately 76 cents/lb. If we consider the hypothesis H0 : αX = 77.54 and HA : αX = 77.54 then we observe that 77.54 falls outside of the 95% conﬁdence interval when j = 60. Hence, we reject the nullhypothesis that the arithmetic mean is equal to 77.54 at the 5% type I error level. In other words, the extreme values in Xt cause the arithmetic mean to be biased and therefore the arithmetic mean is probably not a good choice for αX . How about other measures of central tendency? It is often suggested that the median should be used instead of the arithmetic mean. The reseaon is that the median is not sensitive to outliers. Computation [5] shows that the median is 76.78 cents/lb which is smaller than the original arithmetic mean of 77.54 cents/lb. When the median and the arithmetic mean are unequal then we may conclude that the distribution is skewed (not symmetric).

2 pages, 797 words

The Essay on Cobweb Theorem Price Figure Supply

Cobweb model (know as Hog Cycle) is a dynamic analysis which provides an explanation for certain types of cyclical behaviours due to regular fluctuation in price and output. This model, as most other economic models, is based on some assumptions. It assumes that prices are determined by current prices i. e. farmers expect that the future price will be the same as the present ones. This is known as ...

We now have to determine if the median and arithmetic mean are equal or signiﬁcantly diﬀerent. Unfortunately, hypothesis tests about the median are not easily derived. However, we could use notched boxplots (a tool from Explorative Data Analysis) combined with the Blocked Bootstrap method to further investigate this question. The wessa.net website hosts an R module called “Blocked Bootstrap Plot Central Tendency” which compares three measures of central tendency (arithmetic mean, median, and midrange) by means of simulation for any univariate time series (see also 1.3.3.4. Bootstrap Plot [16]).

The idea is to simulate a (large) number of instances of the original time series by blocked bootstrapping. Then the three measures of central tendency are computed for each simulated series. The results can be summarized by the use of notched boxplots that allow us to compare them. Instead of using the default R module we make some changes by “editing” the underlying R source code (click to reproduce) [6]: • we eliminate the midrange computation (because it is of no interest for 5

Bootstrap Simulation − Central Tendency

simulated values

mean

median

Figure 5: Notched Boxplots of Blocked Bootstrap – Arabica Coﬀee Price in Colombia link of reproducible computation our investigation) • we add a table that displays the conﬁdence intervals of the notched boxplot procedure as implemented in the R language • we deleted all pictures with the exception of the notched boxplots (this is the only picture of interest for our investigation) The result in Figure 5 clearly shows that the notches of both boxplots do not overlap (this can be veriﬁed in the table with conﬁdence bounds [6]).

2 pages, 623 words

The Essay on Application Of The Vrine Model In The Analysis Of Cisco Systems, Inc.

Cisco’s acquisition of more than 115 companies since 1993 could have meant a significant challenge for integrating networks and other IT elements. Instead, Cisco IT has developed a standard set of principles and processes to help accomplish these integrations rapidly, consistently, and with minor disruption. Cisco IT continuously improves its integration expertise by applying the standards to each ...

The conclusion is that the median of simulated arithmetic means is diﬀerent from the median of simulated medians. If the median and the mean of Xt are unequal then we may fairly assume the distribution of Xt to be skewed (compare this result with the histogram of Figure 3).

Let us continue the analysis with the median as estimator for αX . We are interested to know if we can make fairly appropriate predictions of Xt for t = T + 1, T + 2, …, T + K where K is the forecast horizon. More formally the forecast is deﬁned by Ft ≡ Xt − et = αX for t = T + 1, T + 2, …T + K (where αX = median(Xt ) for t = 1, 2, …T ).

To test the hypothesis that the median is constant we need to address two questions: 1. is the median idependent of the month (seasonality)? 2. is the median constant on the long run? The R module called “Mean Plot” allows us to investigate the median of periodic and sequential subseries as described in 1.3.3.20 Mean Plot [16]. The mean plot of Xt (click to reproduce) [7] clearly answers both questions: 1. there is no indication of seasonality in the time series (see Figure 6).

Hence, the median does not depend on the month. 2. there is clear evidence of a trend-like behaviour in the time series under investigation (see Figure 7).

Hence, the median is not constant on the long run. 6

Notched Box Plots − Periodic Subseries

Value

100

120

140

160

Periodic Index

Figure 6: Notched Boxplots – Periodic Subseries – Arabica Coﬀee Price in Colombia link of reproducible computation

Notched Box Plots − Sequential Blocks

Value

100

120

140

160

Block Index

Figure 7: Notched Boxplots – Sequential Subseries – Arabica Coﬀee Price in Colombia link of reproducible computation Figure 7 is of particular interest in our analysis. It is clear from the analysis that the medians in sequential years are signiﬁcantly diﬀerent. Therefore the use of the median as a predictor (Ft = αX ) is problematic when it is computed over the entire observation period. It looks like the recent past deserves more weight in computing αX than the distant past. The answer to Hypothesis 1 is clear. The null hypothesis H0 : αX = c is rejected and the alternative hypothesis HA : αX = c accepted. The model Xt = αX + et for t = 1, 2, …T and the forecasts generated by Ft = αX = c for t = T + 1, T + 2, …T + K are unacceptable. Another way to investigate the proposed model is by means of the so-called 4-Plot as described in 1.3.3.32 4-Plot [16]. The implementation of the 4 Plot in wessa.net has 6 charts and is called “Univariate Explorative Data Analysis” (the Densityplot and Autocorrelation Function have been added) [13]. The underlying assumptions about the model can be investigated [8] with this R 7

7 pages, 3404 words

The Term Paper on Painting Analysis in Jane Eyre

From the opening chapter of Charlotte Brontë’s Jane Eyre the reader becomes aware of the powerful role that art plays. There is something extraordinary about the pictures Jane admires from other artists, as well as the work she creates herself. Her solitary pastime often operates as an outlet of pain, either past or present, and offers her the opportunity to deal with unpleasant emotions and ...

module and reveals that (click to reproduce): • the time series does not behave like a constant (c.q. Xt = αX + et is not appropriate) • the variance is not constant • the time series is not normally distributed (but skewed to the right) • the time series is autocorrelated (and even contains high order autocorrelation) 2.3.2 Univariate Analysis of Yt

To repeat the analysis that was performed in the previous section we simply need to substitute Xt for Yt . There are several ways to estimate αY in Yt = αY +et for t = 1, 2, …T . First we investigate if the arithmetic mean could be an appropriate choice for estimating αY . We know that the arithmetic mean can be very sensitive with respect to outliers. Therefore, we compute the winsorized and trimmed (arithmetic) mean of Yt (click to reproduce).

Assignment Fill in the blanks in the following text. The arithmetic mean of Yt is US cents/lb [9] when all observations are considered. If we “winsorize” more than 50 observations on both sides of the distribution, the arithmetic mean is signiﬁcantly lower. In other words, the extreme values in Yt cause the arithmetic mean to be biased and therefore the arithmetic mean is probably not a good choice for αY . As we move along the x-axis of the winsorized or trimmed mean chart, the computed mean (slowly) converges towards US cents/lb which is the (select from this list: midrange, midmean, harmonic mean, median, mode, geometric mean).

Assignment Investigate Yt and answer the following questions: 1. Is the distribution of Yt skewed? Use at least two diﬀerent techniques to answer this question. 2. Does Yt contain seasonality? 3. Are the forecasts generated by Ft = αY = c for t = T + 1, T + 2, …T + K acceptable? Hint: make sure to blog (archive) your computations and include the hyperlinks of the archived computations in your document. 2.3.3 regression model 1

The third hypothesis under investigation focuses on the question if US retail prices of Arabica coﬀee can be explained by the price paid to growers in Colombia. We deﬁned H0 : β = 0 and HA : β > 0 for the following equation: Yt = α + βXt + … + et for t = 1, 2, …T . Logically we expect that retail prices will reﬂect production costs which leads to the conclusion that the parameter β should be positive.

The hypothesis can be investigated by the use of the so-called 6-Plot (1.3.3.33 6-Plot [16]) which is basically an EDA-based graphical representation of the simple linear regression model. This analysis has been implemented in wessa.net with 9 charts (instead of 6) and is named “Linear Regression Graphical Model Validation” [14]. The result of this analysis [11] clearly shows that the null hypothesis H0 : β = 0 is rejected in favor of the alternative HA : β > 0 (click to reproduce).

The economic interpretation is that the cost of production is reﬂected in the ˆ retail prices. The estimated value of the regression parameter β ≃ 1.84 implies that an increase of 1 US cent/lb of Xt is multiplied by a factor 1.84 to get the ˆ estimated USA retail price Yt . The result of this regression estimation should be subjected to careful “model validation” (testing the underlying model assumptions).

Within the framework of EDA this is preferably done with graphical analysis. Assignment Answer the following questions based on the archived computation: 1. Does the scatterplot between Xt and Yt reveal/suggest a linear relationship? 2. Is there any evidence for autocorrelation?

3. Are the estimated residuals normally distributed? 4. Is there any evidence for the presence of outliers? 5. Overall, do you think the assumptions of the model are satisﬁed? 2.3.4 Regression Model 2

Let us suppose that the simple regression model of the previous section was misspeciﬁed (or incomplete) – in other words, let us suppose that a linear trend should have been added to the equation as deﬁned in Hypothesis 4 of section 2.1.1. We can’t use the 6-Plot (1.3.3.33 6-Plot [16]) from the previous section because we now have three variables instead of two. There are two exogenous (c.q. explanatory) variables (Xt and t) and one endogenous variable (Yt ).

The newly added variable can be interpreted as a linear trend (or a long term eﬀect of time).

One might wonder if the addition of this new variable does make sense at all? Would it aﬀect the estimated relationship between Xt and Yt (represented ˆ by the parameter β)? We can ﬁnd out the answer by making use of the multiple regression R module [15] which allows us to specify “multiple” (c.q. more than one) exogenous variables (hence the term “multiple regression”).

In this case the data have to be entered as a multivariate dataset. The easiest way to do this is to copy the multivariate dataset from a spreadsheet and paste it into the Data X textbox

2 You may not be familiar with the term autocorrelation. Therefore we could rephrase the question as follows: “Do prediction errors from the past contain any information that might help us to improve future predictions?” or “Is there any evidence that ρ(et , et−k ) = 0 for k = 1, 2, 3, …?”

Figure 8: Data Entry – Multiple Regression R module link of R module

Figure 9: Parameters – Multiple Regression R module link of R module

of the R module. You can use the Google Spreadsheet (click to open) that is available online. The data should be pasted all at once: select the data range of all values in de spreadsheet and copy to the clipboard (with Ctrl-Insert).

Then paste the contents of the clipboard into the “Data X” box of the R module. As a second step, we need to identiy the names of each column. We can do this by simply copying the contents of the cells (C1:D1) to the clipboard and pasting it into the “Names of X columns” box3 . Figure 8 is an illustration (partial screendump) of the R module after the multivariate dataset was pasted into the application. The model-specifying parameters of the multiple regression should be set correctly. First, we have to make sure that the “Column Number of Endogenous Series” ﬁeld contains the number 2 because we want to deﬁne the second column as the endogenous series (= Yt ).

Second, we select the option “Linear Trend” from the “Type of Equation” drop down list in order to make sure the software automatically generates an exogenous variable that represents time. An illustration of both model-speciﬁcation parameters can be found in Figure 9. The Multiple Regression R module produces a substantial amount of output. For the purpose of this tutorial however we focus exclusively on the aspects that are directly related to Hypothesis 4: is it necessary to include an additional trend variable to account for the long run increase in US retail prices? Based on the multiple regression analysis [12], the answer is aﬃrmative4 . The null hypothesis H0 : γ = 0 is rejected in favor of the alternative HA : γ > 0 as can be seen from the table with estimated regression parameters and respective p-values (click to reproduce).

ˆ The estimated parameters α ≃ 135.7, β ≃ 1.88 (slightly higher than in ˆ regression 1) and γ ≃ 0.15 can be substituted into the regression equation ˆ ˆ which yields: Yt = 135.7 + 1.88Xt + 0.15t for t = 1, 2, …T .

2.4

Important remark

As indicated in the introduction the analysis portrayed in this document is illustrative and simplistic in nature. More importantly, the underlying assumption of the multiple linear regression model are not satisﬁed. The multiple regression equation suﬀers from autocorrelation and heteroskedasticity which leads to ineﬃcient estimation. It is highly likely that both problems are caused by a misspeciﬁcation of the regression model (unobserved variables) which may lead to biased parameter estimates. As a consequence our conclusions about Hypothesis 3 and 4 should be revisited in the context of a more adequate (complete) model speciﬁcation. Assignment If you are familiar with autocorrelation and heteroskedasticity you may try to answer the following questions based on the archived computation:

that the column names must not be too long nor contain any spaces! with any regression model the validity of the hypothesis test depends on underlying assumptions. The answer that is provided here is conditional (c.q. given that the underlying assumputions are satisﬁed).

4 As 3 Note

• Are the residuals of the regression model autocorrelated? How do you detect this? • Is there any evidence for the presence of high-order autocorrelation? • Does the model really suﬀer from heteroskedasticity? Use two diﬀerent diagnostic tests to conﬁrm your answer. • Could the model be improved by taking into account seasonal eﬀects? Hint: reproduce the model with monthly seasonal dummies.

References

[1] Statistical ﬁce for FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/06/t1199574888xvrybp4pl7fyg1j.htm , Retrieved Sun, 06 Jan 2008 00:15:06 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/06/t11995765663rnq0f6tqd9tvuc.htm , Retrieved Sun, 06 Jan 2008 00:42:50 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/06/t11995770459lp663o7ht2jxj1.htm , Retrieved Sun, 06 Jan 2008 00:50:48 +0100 Computations at Research Development Computations at Research Development Computations at Research Development

[2] Statistical ﬁce for

[3] Statistical ﬁce for

[4] http://spreadsheets.google.com/ccc?key=pV35do1d8bhGWVkmHcU2xRg&hl=en [5] Statistical ﬁce for FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/06/t1199655142l1mf45ba5ydsjhm.htm , Retrieved Sun, 06 Jan 2008 22:32:30 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/07/t1199701558w9l4qhatxnfi326.htm , Retrieved Mon, 07 Jan 2008 11:26:04 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/07/t119971350673encrjvyxgcqw7.htm , Retrieved Mon, 07 Jan 2008 14:45:24 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/07/t1199715933ou67sm0izgd06ae.htm , Retrieved Mon, 07 Jan 2008 15:25:36 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Jan/19/t12007439926t0l73gl5ozrq14.htm , Retrieved Sat, 19 Jan 2008 12:59:59 +0100 12 Computations at Research Development Computations at Research Development Computations at Research Development Computations at Research Development Computations at Research Development

[6] Statistical ﬁce for

[7] Statistical ﬁce for

[8] Statistical ﬁce for

[9] Statistical ﬁce for

[10] Borghers E. and P. Wessa, Descriptive Statistics – Central Tendency Winsorized Mean, Oﬃce for Research Development and Education, URL mean.htm, Retrieved Mon, 07 Jan 2008 10:40 +0100 [11] Statistical ﬁce for FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Feb/26/t12040214435my4uqfrs09mgrk.htm , Retrieved Tue, 26 Feb 2008 11:24:09 +0100 FreeStatistics.org, Ofand Education, URL http://www.freestatistics.org/blog/date/2008/Feb/26/t1204025214shjxx5mlpuxjln9.htm , Retrieved Tue, 26 Feb 2008 12:26:59 +0100 Computations at Research Development Computations at Research Development

[12] Statistical ﬁce for

[13] Wessa P., (2007), Univariate Explorative Data Analysis (v1.0.5) in Free Statistics Software (v1.1.22-r4), Oﬃce for Research Development and Education, URL edauni.wasp [14] Wessa P., (2007), Linear Regression Graphical Model Validation (v1.0.2) in Free Statistics Software (v1.1.22-r4), Oﬃce for Research Development and Education, URL linear regression.wasp [15] Wessa P., (2008), Multiple Regression (v1.0.25) in Free Statistics Software (v1.1.22-r4), Oﬃce for Research Development and Education, URL multipleregression.wasp/ [16] NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, Retrieved Mon, 07 Jan 2008 02:01 +0100.

Introduction

References

Similar Papers