Simulating Data.

Index.-

Introduction.

    Most of the material that we have covered until now had to do with the use of statistical and econometric models applied to real data and real economic issues.  We have also analyzed theoretically some of the nice properties of these econometric models.  However, econometric models with nice theoretical properties may not behave nicely in real applications. It is for this reason that researchers usually like to study the robustness of econometric models using data artificially created by the researcher for that purpose.  For this reason, it is very important for the researcher to have the ability to create data satisfying any type of desired properties.  Using this data the researcher can find out if the nice theoretical properties of certain model are empirically relevant.

Very generally, an experiment of the type described above proceeds as follows:

1.- Specify a "true" econometric model. For example, a standard linear regression model.
2.- Generate a data set that satisfies the restrictions specified by the "true" model.
3.- Use the data generated in 2 to evaluate empirically the "true" econometric model.

    In order to fulfill step 2 we need to learn how to generate data using SAS. We will proceed by first creating a randomly generated data set, after this we will impose the restrictions of the model to this data.  We can define a random data set by using a special functions available in SAS and called RANDOM NUMBER GENERATORS.

In the next section we learn how to generate of random numbers using SAS. Next we use this information in two simple applications. The first application deals with finding an approximation to the PI number. The second application consist on using random numbers to construct a "lottery".  In the next set of notes we analyze the properties of the regression model using the techniques that we have learned here.
 

Random Numbers

The Merriam-Webster online dictionary  defines random as "without definite aim, direction, rule, or method."  Random numbers for use in computer programs can be classified into 3 different categories: A variable that generates random values is usually called a random variable. The UNIFORM RANDOM VARIABLE  takes any value in the interval [0,1] with equal probability (this can be generalized to any interval [a,b]). Another common random variable is the STANDARD NORMAL, it is usually represented as N(0,1). A standard normal random variable generates sequences of random numbers with mean zero and variance one. In contrast with the uniform, values generated with the Standard Normal random variable have higher probability of occurrence the closer they are to the mean.  The normal random variable has a bell shaped probability density curve

The area between the bell shaped curve and the x-axis is equal to one and the area in yellow represents the probability that a value generated from a standard normal random variable falls within the interval [0,1/2]. In particular, because the bell has a symmetric shape we can infer that the probability of obtaining a value higher or equal than zero is equal to 1/2 (equal to the probability of obtaining heads if we toss a coin). If we consider a Standard Normal random variable X and apply a transformation of the form

Y  =  A  +  BX

with B>0, we obtain a new normal random variable with mean A and variance B2, it is denoted as Normal(A,B2).  Compared with the Standard Normal, this new random variable has a probability density curve centered around A and it has heavier tails if B>1 and thinner tails if B<1. Random numbers are available for a wide variety of random variables. Here are some of the most useful random number generators available in SAS:

   x = ranuni(seed);            /* uniform between 0 ? 1 */
   x = a+(b-a)*ranuni(seed);    /* uniform between a ? b */
   x = ranbin(seed,n,p);        /* binomial size n prob p */
   x = rancau(seed);            /* cauchy with loc 0 ? scale 1 */
   x = a+b*rancau(seed);        /* cauchy with loc a ? scale b */
   x = ranexp(seed);            /* exponential with scale 1 */
   x = ranexp(seed) / a;        /* exponential with scale a */
   x = a-b*log(ranexp(seed));   /* extreme value loc a ? scale b */
   x = rangam(seed,a);          /* gamma with shape a */
   x = b*rangam(seed,a);        /* gamma with shape a ? scale b */
   x = 2*rangam(seed,a);        /* chi-square with d.f. = 2*a */
   x = rannor(seed);            /* normal with mean 0 ? SD 1 */
   x = a+b*rannor(seed);        /* normal with mean a ? SD b */
   x = ranpoi(seed,a);          /* poisson with mean a */
   x = rantri(seed,a);          /* triangular with peak at a */
   x = rantbl(seed,p1,p2,p3);   /* random from (1,2,3) with probs */
                                /* p1,p2,p3 */
The Normal and the Uniform random number generators are the ones that we will use more often. In order to understand how random number generators operate imagine that we have a data set containing a very large number of random numbers. A random number generator will extract numbers from that list. The SEED should be specified as an integer and it usually represents the position in the list of the first random generated number. If the seed above is specified as negative or zero the computer clock is used to determine the position of the first random number in the sequence. If the seed is positive (it should be less than 2**31-1) then it will represent the position on the list of the first random generated number. The seed is only examined on the first encounter with a random number generator in your program, so you cannot change the process once you begin. Remember, computer generated random numbers are never truly random.

The following simple program can be used to compute a set of 3 10-dimensional randomly generated vectors from the standard normal distribution.

PROGRAM 1   ===============================================

data a;
array srn(3);
                /* Initialize seed. An argument of 0 uses the clock as a seed */
 do j=1 to 10;
   do i=1 to 3;
      srn(i)=rannor(111111);  /* generate normal random numbers */
   end;
   output;
 end;

proc print;      /* print the result  */
var srn1-srn3;
run;
 

===========================================================

Class Exercise: Simulating tossing a coin.

The goal of this assignment is to use random generated numbers to design a computer based game that replicates the tossing of a coin.
 

Application one: Designing a computer generated lottery.

Lotteries, such as lotto or raffles, used to be drawn using a physical device such as a container containing numbered balls from which balls are drawn (hopefully) at random. However, some Lottery Commission are moving towards using computer based systems to simulate the container containing numbered balls.

The Colorado Lotto is an On-Line "jackpot" game offering the largest prize of any other Lottery game in Colorado. The size of the jackpot is determined by Lotto ticket sales. Lotto involves selecting six numbers from a field of 42 numbers.

How Lotto Works

Players select 6 numbers from a field of 42 possible numbers. Then, the Lottery chooses 6 winning numbers at random in a live drawing. If a player matches 3, 4, 5, or 6 winning numbers, they win a prize. Players may chose their own numbers or use the Quick Pick method in which numbers are chosen randomly by a computer. The following table indicates the odds of winning
 
 

Odds of Winning
Match Odds
6 of 6 numbers 1 in 5,245,786
5 of 6 numbers 1 in 24,287
4 of 6 numbers 1 in 556
3 of 6 numbers 1 in 37

If you are interested in knowing which are are the most and least frequently drawn numbers drawn in our big jackpot game, Lotto (since Lotto began on January 24, 1989) follow this link.  The lotto has been played approximately 13*52 = 676 times since Lotto began on January 24, 1989.

The goal of this application is to use random generated numbers to design a computer based lottery game that replicates the current Colorado Lotto. After this, generate 676 draws of this lottery and obtain the number of times each number appears. Compare these results with those in the actual lotto ( link.). Finally, repeat this exercise many times and show that the empirical probability of drawing a certain number converges to the theoretical probability (1/42 = 0.023809534).

A computer program designed to replicate the Colorado Lotto should satisfy the following properties:
 

Consider the following simple program

PROGRAM   ===============================================

data a;
array lotto(6);
                                                              /* initialize the random generator randomly  */
   seed = int(1111111*ranuni(0) + 1);        /* an argument of 0 uses the clock to generate a seed */

  do i = 1 to 6;                                        /* generate 6 lotto numbers */
     lotto(i) = int(ranuni(seed)*42 + 1) ;     /* generate an integer between 1 and 42 randomly */
  end;

proc print;                                             /* print the result  */
title 'results of the lotto';
id  lotto1;
var lotto2-lotto6;

run;

===========================================================

After running this program twice we obtained
 
 

13 3 9 12 11 4

as a result of the first run, and
 

4 38 41 19 21 37

after the second run.  The seed is selected by the statement   "  seed = int(1111111*ranuni(0) + 1); " This statement guarantees that the seed will not be the same each time we run the program.

    In the previous program we cannot rule out the possibility of obtaining repeated numbers. Although the probability of that event is very small it is positive. The next program incorporates some additional lines of code to avoid repetition of numbers. Basically, in the added code we require that the experiment of drawing 6 numbers between 1 and 42 be repeated if two numbers are the same.  This can be easily accomplished using the "do while" statement.
 

PROGRAM   ===============================================

data a;
array lotto(6);
                                                            /* initialize the random generator randomly  */
   seed = int(1111111*ranuni(0) + 1);      /* an argument of 0 uses the clock as a seed */
   c = 0;
   do while (c = 0);

      do i = 1 to 6;                                    /* generate 6 lotto numbers */
        lotto(i) = int(ranuni(seed)*42 + 1) ;  /* generate an integer between 1 and 42 randomly */
      end;

      c = 1;                                               /* this part checks for duplicity in lotto numbers */

      do i = 1 to 5;
        do j = (i+1) to 6;
          if (lotto(i) = lotto(j)) then c = 0;
        end;
      end;
   end;

proc print;                                             /* print the result  */
title 'results of the lotto';
id  lotto1;
var lotto2-lotto6;

run;

===========================================================

Finally, we modify the previous program slightly to obtain 1000 repetitions of the computer LOTTO.

PROGRAM   ===============================================
data a;
array lotto(6);
                                                                /* initialize the random generator randomly  */
   seed = int(1111111*ranuni(0) + 1);          /* an argument of 0 uses the clock as a seed */

do k=1 to 1000;                                         /* repeat LOTTO 1000 times  */

  c = 0;
  do while (c = 0);

     do i = 1 to 6;                                        /* generate 6 lotto numbers */
       lotto(i) = int(ranuni(seed)*42 + 1) ;       /* generate an integer between 1 and 42 randomly */
     end;

     c = 1;                                                   /* this part checks for duplicity in lotto numbers */

     do i = 1 to 5;
       do j = (i+1) to 6;
         if (lotto(i) = lotto(j)) then c = 0;
       end;
     end;
  end;

  output;

end;

proc freq;
TITLE 'lotto frequencies';
TABLES lotto1-lotto6 / nocum;

run;

===========================================================

  As we know the theoretical probability of drawing a certain number is (1/42 = 0.023809534). After obtaining frequencies for each lotto number we observe that the empirical probability of obtaining a number (as defined by the frequency) is close to the theoretical probability.

Application two: How to Approximate Pi Using Random Numbers.

By: Alexandru Csete Intsitute of Physics and Astronomy University of Aarhus

(Note: I have modified his notes slightly to fit the purpose of this class)

Introduction

The number pi is a well known irrational number. It is irrational because it cannot be fully represented by a finite numerical expression. Thus any finite representation of the pi number is only an approximation to its real value. One of the reasons why pi is well known is because it is part of the formula of the length of a circumference and the area of a circle, among others. The following number represents a good approximation to pi

PI=3.141592653589793238462643383279502884197

In the following I will describe how to use random number generators to approximate pi. It is a simple method and easy to implement on a computer, as you will see.

How to calculate pi.

We have to look at the unit circle (radius=1) within a square with sides equal to 2 (see figure 1). Now if we pick a random point (x,y) were both x and y are between -1..1, the probability of that this random point lies inside the unit circle is given as the proportion between the area of the unit circle and the square:
 

If we pick a random point N times and M of those times the point lies inside the unit circle, the probability of that a random point lies inside the unit circle is equal to

M/N

Consider now the following figure

where d represents the distance from the origin to the border of the circle and is also the radius of the circle. Therefore d = 1 and (x,y) represents a point in the border of the circle. Using Pitagoras' Theorem we obtain

x2 + y2 =  1

and we deduce that any point in the circle should satisfy

x2 + y2 < 1

Thus, the probability of that a random point (x,y) lies inside the unit circle can be represented as P(x2 + y2 < 1 ) and is equal to

But if N becomes very large (theoretically infinite), the two probabilities will become equal and we can write:

approximately, for N large. This strategy can be used to approximate PI using random numbers. A point (x,y) is a point in the circle of radius one centered at the origin if the distance of this point from the origin is not larger than one.

Program

As you can see in the final formula for Pi, the precision (number of digits) depends on how many times you pick a point. The greater N and M are the more digits you get. We can use the following program

PROGRAM   ===============================================

data xycoor;

   do i=1 to 1000000;
      x = ranuni(111111);
      y = ranuni(1111111);
      output;
   end;

data random;
 set xycoor;

   if (x*x + y*y) lt 1 then z = 1;
   else z = 0;

   z = 4*z;

 proc means n mean; title ' Approximation to pi '; var z;

run;

===========================================================