Introduction to the R Project for Statistical Computing for use at ITC .. 18 Example PDF produced by Sweave and LATEX Statistical Computing with R 1 10/10/07 PM Chapman & Hall/CRC Computer Science and Data Analysis Series The interface between. This course uses the R computing environment for practical examples. • R serves both as a statistical package and as a general programming environment.

Statistical Computing With R Pdf

Language:English, Arabic, Japanese
Published (Last):23.08.2016
ePub File Size:25.34 MB
PDF File Size:8.71 MB
Distribution:Free* [*Sign up for free]
Uploaded by: LANELL

New York: Chapman Hall CRC, - p. This book is an introduction to statistical computing and computational statistics. Computational statistics is a. You must read this chapter while sitting at the computer with an R window We do our computing in the open-source package R, a command-based statistical software .. the PDF, but for this example, it's a little easier using the Excel file. June 5, Fred Boehm. Statistical Computing With R: Introduction. June 5, 1 / 5 CRAN/doc/manuals/r-release/ Other contributed manuals.

This saves a matrix multiplication. In this section each method of generating MVN random samples is illus- trated with examples. Also note that there are functions provided in R pack- ages for generating multivariate normal samples. See the mvrnorm function in the MASS package [], and rmvnorm in the mvtnorm package [].

In all of the examples below, the rnorm function is used to generate standard normal random variates. This method can also be called the eigen-decomposition method. The scatter plot of the sample data shown in Figure 3. The Choleski factorization is implemented in the R function chol. Width Sepal. Length 0. Width 0. Choleski , mu, Sigma pairs X The pairs plot of the data in Figure 3. The joint distribution of each pair of marginal distributions is theoretically bivariate normal.

The iris virginica data are not multivariate normal, but means and correlation for each pair of variables should be similar to the simulated data. Length 6. Width 3. Length 2. Width 2. Pairs plot of the bivariate marginal distributions of a simu- lated multivariate normal random sample in Example 3. The parameters match the mean and covariance of the iris virginica data.

Remark 3. The transformed d-dimensional sample then has zero mean vector and covariance Id. This is not the same as scaling the columns of the data matrix. When several methods are available, which method is preferred? One consideration may be the computational time re- quired the time complexity. Another important consideration, if the pur- pose of the simulation is to estimate one or more parameters, is the variance of the estimator.

The latter topic is considered in Chapter 5. To compare the empirical performance with respect to computing time, we can time each procedure.

R provides the system. In the next example, the system. This example uses a function rmvnorm in the package mvtnorm []. N rmvn. The covariances used for this example are actually the sample covariances of standard multivariate normal samples. In order to time each method on the same covariance matrices, the random number seed is restored before each run.

The last run simply generates the covariances, for comparison with the total time. The Choleski method is somewhat faster, while rmvn. The similar performance of rmvn. The code not shown is similar to the examples above. As the mixing parameter p and other parameters are varied, the multivariate normal mixtures have a wide variety of types of de- partures from normality. Pa- rameters can be varied to generate a wide variety of distributional shapes. Johnson [] gives many examples for the bivariate normal mixtures.

Many commonly applied statistical procedures do not perform well under this type of departure from normality, so normal mixtures are often chosen to compare the properties of competing robust methods of analysis. If X has the distribution 3. The following procedure is equivalent. Use the mvrnorm MASS function [] to generate the multivariate normal observa- tions.

We will eliminate the loop later. Generate a random permutation of the indices 1: See Appendix B. Methods for Generating Random Variables 79 more efficient version loc. All of the one dimensional marginal dis- tributions are univariate normal location mixtures. Methods for visualization of multivariate data are covered in Chapter 4.

R in Action

Also, an interesting view of a bivariate normal mixture with three components is shown in Figure Histograms of the marginal distributions of multivariate nor- mal location mixture data generated in Example 3. Implementation is left as an exercise. Random vectors uniformly distributed on the d-sphere have equally likely directions. A method of generating this distribution uses a property of the multivariate normal distribution see e. The ith row of M corresponds to to the ith random vector ui.

Compute the denominator of 3. Deliver matrix U containing n random observations in rows. See the help topic? A random sample of points from the bivariate distribu- tion X1 , X2 that is uniformly distributed on the unit circle in Example 3. Uniformly distributed points on a hyperellipsoid can be generated by ap- plying a suitable linear transformation to a Uniform sample on the d-sphere.

Fishman [94, 3. The index set T could be discrete or continuous. The set of possible values X t can take is the state space, which also can be discrete or continuous. Ross [] is an excellent introduction to stochastic processes, and includes a chapter on simulation. Methods for Generating Random Variables 83 A counting process records the number of events or arrivals that occur by time t. A counting process has independent increments if the number of arrivals in disjoint time intervals are independent.

A counting process has stationary increments if the number of events occurring in an interval depends only on the length of the interval. An example of a counting process is a Poisson process. The set of times of consecutive arrivals records the outcome and determines the state X t at any time t. The interarrival times T1 , T2 ,. One method of simulating a Poisson process is to generate the interarrival times. The state of the process at a given time t is equal to the number of arrivals in [0, t], which is the number min k: It should be translated into vectorized operations, as shown in the next example.

Suppose we need N 3 , the number of arrivals in [0, 3]. That is, given that the number of arrivals in 0, t is n, the arrival times S1 ,. Returning to Example 3. As a check, we estimate the mean and variance of N 3 from replications. Here the sample mean and sample variance of the generated values N 3 are indeed very close to 6. In this case, the process needs to be simulated for a longer time than the value in upper.

For example, if we need N t0 , one approach is to wrap the min which step with try and check that the result of try is an integer using is. See the corresponding help topics for details. Actually, the second method is considerably slower by a factor of 4 or 5 than the previous method of Example 3.

The rexp generator is almost as fast as runif, while the sort operation adds O n log n time. Some performance improvement might be gained if this algorithm is coded in C and a faster sorting algorithm designed for uniform numbers is used.

A nonhomogeneous Poisson process has independent increments but does not have stationary increments. Every nonhomogeneous Poisson process with a bounded intensity function can be obtained by time sampling a homogeneous Poisson process. To see this, let N t be the number of accepted events in [0, t]. The steps to simulate the process on an interval [0, t0 are as follows. Methods for Generating Random Variables 87 Algorithm for simulating a nonhomogeneous Poisson process on an interval [0, t0 ] by sampling from a homogeneous Poisson process.

This is shown in the next example. This example is discussed in [, Sec. The process can be simulated by generating geometric interarrival times and computing the consecutive arrival times by the cumulative sum of interarrival times. The plot is shown in Figure 3. Sequence of sample means of a simulated renewal process in Example 3. The process has returned to 0 several times within time [1, ]. If the process has returned to the origin before time n, then to generate Sn we can ignore the past history up until the time the process most recently hit 0.

Then starting from the last return to the origin before time n, generate the increments Xi and sum them. Partial realization of a symmetric random walk in Example 3.

Algorithm to simulate the state Sn of a symmetric random walk The following algorithm is adapted from [69, XIV. Let Wj be the waiting time until the j th return to the origin. Deliver si. The probability distribution of T [69, Thm. The following methods are equivalent. Therefore, a generator can be written for values of T up to using the probability vector computed above. Suppose now that n is given and we need to compute the time of the last return to 0 in 0, n].

Here instead of issuing a warning, one could append to the vector and return a valid T. We leave that as an exercise. A better algorithm is suggested by Devroye [69, p. One run of the simulation above generates the times , , , , , , and that the process visits 0 uncomment the print statement to print the times. Algorithms for generating random tours in general are discussed by Fishman [94, Ch. For a more theoretical treatment see Durrett [77, Ch.

See Franklin [98] for simulation of Gaussian processes. Functions to simu- late long memory time series processes, including fractional Brownian motion are available in the R package fSeries see e. See Examples 2. Use the inverse transform method to generate a random sample of size from this distribution. Use one of the methods shown in this chapter to compare the generated sample to the target distribution.

Graph the density histogram of the sample with the Pareto 2, 2 density superimposed for comparison. Construct a relative frequency table and compare the empirical with the theoretical probabilities. Repeat using the R sample function.

Generate a random sample of size from the Beta 3,2 distribution. Graph the histogram of the sample with the theoretical Beta 3,2 density superimposed. Methods for Generating Random Variables 95 3. Compare the histogram with the lognormal density curve given by the dlnorm function in R. Write a function to generate random variates from fe , and construct the histogram density estimate of a large simulated random sample. Make a conjecture about the values of p1 that produce bimodal mixtures.

Compare the empirical and theoretical Pareto distributions by graph- ing the density histogram of the sample and superimposing the Pareto density curve. Use the R pairs plot to graph an array of scatter plots for each pair of variables. That is, transform the sample so that the sample mean vector is zero and sample covariance is the identity matrix.

To check your results, generate multivariate normal samples and print the sample mean vector and covariance matrix before and after standardization.

Each row of the data frame is a set of scores xi1 ,. Standardize the scores by type of exam. That is, standard- ize the bivariate samples X1 , X2 closed book and the trivariate samples X3 , X4 , X5 open book. Compute the covariance matrix of the transformed sample of test scores. See Example 3. The game ends when either one of the players has all the money.

Let Sn be the fortune of player A at time n. Estimate the mean and the variance of X 10 for several choices of the parameters and compare with the theoretical values.

Chapter 4 Visualization of Multivariate Data 4. Tukey [] believed that it was important to do the exploratory work before hypothesis testing, to learn what are the appropriate questions to ask, and the most appropriate methods to answer them. Here we restrict attention to methods for visualizing multivariate data. In this chapter several graphics functions are used.

In addition to the R graphics package, which loads when R is started, other packages discussed in this chapter are lattice [] and MASS see []. Also see the rggobi [] interface to GGobi and rgl [2] package for interactive 3D visualization.

Table 4. Chapter 1 gives a brief summary of options for colors, plotting symbols, and line types. For example, a scatterplot matrix dis- plays the scatterplots for all pairs of variables in an array.

The pairs function in the graphics package produces a scatterplot matrix, as shown in Figures 4. An example of a panel display of three-dimensional plots is Figure 4. The pairs function takes an optional argument diag. For example, to obtain a graph with estimated density curves along the diagonal, supply the name of a function to plot the densities. The function below called panel.

Statistical Computing in C++ and R

Before plotting, we apply the scale function to standardize each of the one-dimensional samples. From the plot we can observe that the length variables are positively correlated, and the width variables appear to be positively correlated. Other structure could be present in the data that is not revealed by the bivariate marginal distributions.

The lattice package [] provides functions to construct panel displays. Here we illustrate the scatterplot matrix function splom in lattice. It is displayed here in black and white, but on screen the panel display is easier to interpret when displayed in color plot 2. Also see the 3D scatterplot of the iris data in Figure 4. Scatterplot matrix pairs comparing four measurements of iris virginica species in Example 4. For other types of panel displays, see the conditioning plots [42, 48, 49] implemented in coplot.

Width 1. Scatterplot matrix comparing four measurements of iris data: The persp graphics function draws perspective plots of surfaces over the plane. Try running the demo examples for persp, to see many interesting graphs.

The command is simply demo persp. We will also look at 3D methods in the lattice graphics package and the rgl package [, , 2].

The command for this is expand. Visualization of Multivariate Data Example 4. Most of the parameters are optional; x, y, z are required. For this function we need the complete grid of z values, but only one vector of x and one vector of y values.

The returned value is a matrix of function values for every point xi , yj in the grid. Storing the grid was not necessary. This transformation can be used to add elements to the plot. Example 4. Perspective plot of the standard bivariate normal density in Example 4. R provides a function trans3d to compute the coordinates above. Here we have shown the calculations. Perspective plot of the standard bivariate normal density with elements added using the viewing transformation returned by persp in Example 4.

Other functions for graphing surfaces Surfaces can also be graphed using the wireframe lattice function []. The syntax for wireframe requires that x, y and z have the same number of rows. We can generate the matrix of x, y coordinates using expand. If the rgl package is installed, run the demo. One of the examples in the demo shows a bivariate normal density. Actually, the data used to plot the surface in this demo is generated by smoothing simulated bivariate normal data.

Chapter 10 gives examples of methods to construct and plot density esti- mates for bivariate data. Figures A possible application of this type of plot is to explore whether there are groups or clusters in the data.

The second part of the example illustrates several options. There are three species of iris and each is measured on four variables. The plot produced is similar to 3 in Figure 4. To see all four plots on the screen, use the more and split options. The split arguments determine the location of the plot within the panel display. The plots show that the three species of iris are separated into groups or clusters in the three dimensional subspaces spanned by any three of the four variables.

There is some structure evident in these plots. One might follow up with cluster analysis or principal components analysis to analyze the apparent structure in the data. The screen option sets the orientation of the axes. Syntax for print cloud: To split the screen into n rows and m columns, and put the plot into position r, c , set split equal to the vector r, c, n, m.

See print. The functions contour graphics and contourplot lattice [] produce contour plots. The functions filled. A variation of this type of plot is image graphics , which uses color to identify contour levels.

The data is an 87 by 61 matrix containing topographic information for the Maunga Whau volcano. It may also be interesting to see the 3D surface of the volcano for comparison with the contour plots. A 3D view of the volcano surface is provided in the examples of the persp function. The R code for the example is in the persp help page. To run the example, type example persp. If the rgl package is installed, an interactive 3D view of the volcano appears in the examples.

The image function in the graphics package provides the color background for the plot. The plot produced below is similar to Figure 4. Contour plot and levelplot of volcano data in Examples 4.

Using image without contour produces essentially the same type of plot as filled. The contours of filled. Compare the plot produced by image with the following two plots. The display on the screen will be in color.

In this case, the 2D scatterplot does not reveal much information about the bivariate density. The hexbin function in package hexbin [38] available from Bioconductor repository produces a basic version of this plot in grayscale, shown in Figure 4. Note that the darker colors correspond to the regions where the density is highest, and colors are increasingly lighter along radial lines extending from the mode near the origin. The plot exhibits ap- proximately circular symmetry, consistent with the standard bivariate normal density.

The bivariate histogram can also be displayed in 2D using a color palette, such as heat. A similar type of plot is implemented in the gplots package []. The plot not shown resulting from the following code is similar to Figure 4. Flat density histogram of bivariate normal data with hexag- onal bins produced by hexbin in Example 4. These include, among others, Andrews curves, parallel co- ordinate plots, and various iconographic displays such as segment plots and star plots.

Queensland, Australia for two types of leaf architecture [] are represented by Andrews curves. The data set is leafshape17 in the DAAG package [, ]. Three measurements leaf length, petiole, and leaf width correspond to points in R3. In general, this type of plot may reveal possible clustering of data. By default, the statistic is subtracted but other operations are possible. Then the ranges of each of the three columns in r are swept out; that is, each column is divided by its range.

Andrews curves for leafshape17 DAAG data at latitude R note 4. The representation of vectors by parallel coordinates was in- troduced by Inselberg [] and applied for data analysis by Wegman [].

Rather than represent axes as orthogonal, the parallel coordinate system represents axes as equidistant parallel lines. Usually these lines are horizontal with common origin, scale, and orientation. Then to represent vectors in Rd , the parallel coordinates are simply the coordinates along the d copies of the real line. Each coordinate of a vector is then plotted along its corresponding axis, and the points are joined together with line segments. Parallel coordinate plots are implemented by the parcoord function in the MASS package [] and the parallel function in the lattice package [].

The parcoord function displays the axes as vertical lines. The panel function parallel displays the axes as horizontal lines. The crabs data frame has 5 measurements on each of crabs, from four groups of size The graph is best viewed in color.

Much of the variability between groups is in overall size. Adjusting the measurements of individual crabs for size may produce more interesting plots. Following the suggestion in Venables and Ripley [] we adjust the measurements by the area of the carapace. The Andrews curves in Example 4. Andrews curves were displayed su- perimposed on the same coordinate system. Other representations as icons are best displayed in a table, so that features of observations can be compared.

A tabular display does not have much practical value for high dimension or large data sets, but can be useful for some small data sets. Some examples include star plots and segment plots. This type of plot is easily obtained in R using the stars graphics function. Parallel coordinate plots in Example 4. As in Example 4. The observations are labeled by species. The plot suggests, for example, that orange crabs have greater body depth relative to carapace width than blue crabs. The measurements have been adjusted by overall size of the individual crab.

The two species are blue B and orange O. Principal components analysis similarly uses projections see e. Dimension is reduced by projecting onto a small number of the principal com- ponents that collectively explain most of the variation. Pattern recognition and data mining are two broad areas of research that use some visualization methods.

See Ripley [] or Duda and Hart [75]. An interesting collection of topics on data mining and data visualization is found in Rao, Wegman, and Solka []. For an excellent resource on visualization of categorical data see Friendly [] and http: In addition to the R functions and packages mentioned in this chapter, several methods are available in other packages.

Again, here we only name a few. Mosaic plots for visualization of categorical data are available in mosaicplot. Also see the package vcd [] for visu- alization of categorical data. The functions prcomp and princomp provide principal components analysis.

Many packages for R fall under the data min- ing or machine learning umbrella; for a start see nnet [], rpart [], and randomForest []. Also see the graph gallery at http: The rggobi [] package provides a command-line interface to GGobi, which is an open source visualization program for exploring high-dimensional data. GGobi has a graphical user interface, providing dynamic and interactive graphics.

The GGobi software can be obtained from http: Readers are referred to documentation and examples at http: Exercises 4. Visualization of Multivariate Data 4. Generate a bivariate random sample from the joint distribution of X, Y and construct a contour plot.

Adjust the levels of the contours so that the the contours of the second mode are visible. Compare the plots before and after adjusting the measurements by the size of the crab. Interpret the resulting plots. Set line type to identify leaf architecture as in Example 4. Compare with the plot in Figure 4. Produce Andrews curves for each of the six locations. Split the screen into six plotting areas, and display all six plots on one screen. Set line type or color to identify leaf architecture.

Display a segment style stars plot for leaf measurements at latitude 42 Tasmania. Repeat using the loga- rithms of the measurements. Another well known example is that W. Teams of scientists at the Los Alamos National Laboratory and many other researchers contributed to the early development, including Ulam, Richtmyer, and von Neumann [, ].

The simple Monte Carlo estimator of 0 g x dx is gm X. Example 5. Generate X1 ,.

2nd Edition

However, we can break this problem into two cases: Suppose that we prefer an algorithm that always samples from Uniform 0,1. This can be accomplished by a change of variables. Generate iid Uniform 0,1 random numbers u1 ,. The Monte Carlo estimates appear to be very close to the pnorm values. The estimates will be worse in the extreme upper tail of the distribution. This is left as an exercise.

36-350, Statistical Computing, Fall 2014

In fact, the integrand of the previous example is itself a density function, and we can generate random variables from this density. This provides a more direct approach to estimating the integral. When the distribution of X is unknown we substitute for FX the empirical distribution Fm of the sample x1 ,. That is, 5.

The MC estimate 2. The Monte Carlo estimate 5. Generate a random sample X1 ,. The transformed sample Y1 ,. Unless it is not clear in context, however, for simplicity we use g X.

However, a large increase in m is needed to get even a small improvement in standard error. To reduce the standard error from 0. Thus, although variance can always be reduced by increasing the number of Monte Carlo replicates, the computational cost is high. Other methods for reducing the variance can be applied that are less computationally expensive than simply increasing the number of replicates.

In the following sections some approaches to reducing the variance of this type of estimator are introduced. Several approaches have been covered in the literature. Readers are referred to [69, , , , , , ] for reference and more examples. This fact leads us to consider negatively correlated variables as a possible method for reducing variance.

For example, suppose that X1 ,.

Then in 5. That is, g is increasing if g x1 ,. Similarly g is decreasing if it is decreasing in its coordinates. Then g is monotone if it is increasing or decreasing. Assume that f and g are increasing functions. The proof is by induction on n. Suppose that the statement 5. Without loss of generality we can suppose that g is increasing.

By restricting the simulation to the upper tail see Example 5. Generate random numbers u1 ,. Phi below. Optionally MC. Phi will compute the estimate with or without antithetic sampling. The MC. There are some important differences, but much of the code written for S runs unaltered. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages.

Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made.

NET [25] or Python code to manipulate R objects directly. Due to its S heritage, R has stronger object-oriented programming facilities than most statistical computing languages.

Extending R is also eased by its lexical scoping rules. Dynamic and interactive graphics are available through additional packages. The prefix [1] indicates that the list of elements following it on the same line starts with the first element of the vector a feature that is useful when the output extends over multiple lines. R's data structures include vectors , matrices , arrays, data frames similar to tables in a relational database and lists. The scalar data type was never a data structure of R.

Writing Sweave documents takes much longer than writing R scripts, but it leads to self-documenting work that is likely to be understandable by many researchers long after it has been written. Here I give a short, non-exhaustive list of books that I recommend to students to complement lecture notes and to show applications of R in computational biology. Some of the books are quite advanced and are likely to be useful for students only after they have gained sufficient experience. I also take these books to lab sessions so that students can see which book would be most useful for them.

For a general introduction to R, Introductory Statistics with R [10] provides a nice balance of introducing R and showing its application to classical statistical testing; Introduction to Probability with R [11] goes further into aspects of probability.

A First Course in Statistical Programming with R [12] introduces R as a programming language; those already familiar with programming may wish to consult S Programming [13]. Finally, for students wishing to explore the graphing facilities of R, R Graphics [14] is recommended. Several texts focus on aspects of computational biology.

First, the introductory text on Computational Genome Analysis [3] provides worked examples in R throughout the book. Stochastic Modelling for Systems Biology [15] uses R to demonstrate modelling in systems biology. An advanced book for those already familiar with R is R Programming for Bioinformatics [16]. Finally, a general text for biological modelling is Dynamical Models in Biology [4].

Useful web sites R has numerous online resources that students should be encouraged to explore. Powered by Google, this site searches numerous online R resources, including documentation, source code, and books. It also searches the numerous email lists hosted by the R project; R-help in particular is a useful list for people to learn about R.

A very useful guide for students who know Matlab; it provides a comprehensive list of Matlab functions and the corresponding functions in R. This site provides a gallery of advanced graphic examples, along with the R code used to generate those plots. Common problems encountered when learning R Students with previous programming experience usually find learning R quite straightforward.

It has a rich set of online documentation for each function, complete with examples, to help learn the language. However, there are some common problems that occur when learning R, described briefly below, along with suggestions for helping students.

Syntax errors and getting started The syntax of R can be difficult for students to acquire, and students often report that they spend many hours debugging simple problems. We encourage students to ask a colleague for help, as often these errors are simple, yet frustrating to spot.

We use a wiki to allow students to post questions or exchange tips and example code.Compare your estimates with the normal cdf function pnorm. Use the print command within a script to display the value of an expression. The first chapter provides an overview of computational statistics and a brief introduction to the R statistical computing environment. Then in 5. Lists are frequently used to return several results of a function in a single object.

For the Monte Carlo experiment above, the parameter estimated is a proba- bility and the estimate, the observed Type I error rate, is a sample proportion.

Then the following properties hold provided the moments exist. A complete list of all available packages is provided on the CRAN web site. Use bootstrap to estimate the standard error of the correlation statistic computed from the sample of scores in law. Thus in 5.

BARRETT from Fairfield
Please check my other posts. I absolutely love rope. I do enjoy exploring ePub and PDF books fairly .