From bobduff Wed Dec 16 15:56:12 1992 From: bobduff (Bob Duff) To: ada9x-mrt To: iso@ajpo.sei.cmu.edu Subject: LSN-1052 on random number generation for Ada 9X !topic LSN on random number generation for Ada 9X !key LSN-1052 on random number generation for Ada 9X !reference MS-L;4.6 !from Kenneth W. Dritz $Date: 92/12/16 15:48:53 $ $Revision: 1.3 $ !discussion This LSN addresses resolution number 10-4 from the November ISO/WG9 in Salem. 1. Introduction In designing a facility for the generation of pseudo-random numbers (hereafter called simply "random numbers") in Ada 9X, we have sought to balance utility and simplicity. We do not offer all the bells and whistles someone might want or need, but we do offer the capabilities that are difficult to program correctly from scratch; other capabilities can be composed on top of these rather easily by the user, if required. We present in section 2 the specification of a package providing the desired capabilities. Section 3 discusses the design decisions leading to this specification, while section 4 discusses some other alternatives that might still be considered. Section 5 discusses properties that the implementation must satisfy. We conclude by inviting comments from reviewers. 2. Specification The specification of the random number facility for Ada 9X is as follows: package Random_Numbers is type Seed is ; -- but not private Seed_Error : exception; -- see Set_Seed, below type Reset_Integer is range ; -- see Reset, below generic type Float_Type is digits <>; package Generic_Generator is -- Instantiation of this generic package provides a random number -- generator in each task; multiple instantiations provide multiple -- generators. function Random return Float_Type; -- Get the "next" random number, uniformly distributed on -- [0.0, 1.0), from the current task's generator. procedure Reset (I : in Reset_Integer); -- Set the state of the current task's generator to an -- implementation-defined function of the integer I. procedure Reset; -- Set the state of the current task's generator to an -- implementation-defined function of the current date and time. procedure Get_Seed (S : out Seed); -- Get the state of the current task's generator. procedure Set_Seed (S : in Seed); -- Set the state of the current task's generator to an explicit, -- user-provided value; raises Seed_Error if S is unacceptable to -- the implementation. end Generic_Generator; end Random_Numbers; 3. Design decisions As in many other languages, only one kind of distribution is provided, namely the uniform distribution. More specifically, the random numbers generated by the facility are uniformly distributed on the semi-open interval [0.0, 1.0), and are delivered as values of a floating-point type chosen by the user. Other intervals are easily obtained by scaling and translation: if X is uniformly distributed on [0.0, 1.0), then a*X+b is uniformly distributed on [a, b). Integer values uniformly distributed on some interval are also easily obtained by appropriate scaling and conversion to integer, with post-conversion adjustment if rounding went "the wrong way." Other distributions, such as the normal (Gaussian), Poisson, binomial, or exponential distributions, can be obtained from the uniform distribution by various standard techniques. We focus our efforts on the capability most difficult to provide if it is not predefined, namely a reliable uniform random number generator. Some algorithms naturally produce 0.0 among the numbers generated, and it is normal to include this value in the range of the random number generator. Mathematically, good algorithms exclude 1.0, but care must sometimes be exercised to avoid rounding a result close to 1.0 up to 1.0. We could make the job simpler for implementors by including 1.0 in the specified range of random numbers but choose not to on account of the benefit to some applications of knowing that at least one of these extremes is never generated. For example, if X is uniformly distributed on [0.0, 1.0), then -log(1.0 - X) is exponentially distributed with mean 1.0 and standard deviation 1.0, and it never receives an invalid argument. A primary goal of the design of the random number facility for Ada 9X is the preservation of opportunities for innovation on the part of implementors, as the state of the art advances, while providing for a certain amount of portability of applications among implementations. Thus, an algorithm is not prescribed, nor is the nature of the seed specified. (The seed, which may be scalar or composite, embodies the state of a generator. Technically, we should reserve "seed" to mean the initial state of a generator, but since a generator's current state may be though of as its initial state relative to the future sequence of outputs, we use seed to mean its current state.) It is our intention to allow any of the following as implementations of the random number generator: o good instances of a classical linear congruential generator, whose scalar seed is an integer or, in some realizations, a floating-point number; o combination generators, such as those of Wichmann and Hill or L'Ecuyer, whose composite seed is a small vector; o the more recent "subtract-with-borrow" or "add-with-carry" Fibonacci generators of Marsaglia and Zaman that have phenomenally long periods, but whose composite seed is a vector, often of twenty to a hundred integer or floating-point components representing the lagged previous outputs used in computing future outputs, plus another component representing the next borrow or carry. Some assurance of quality is provided by specifying minimal properties that the implementation must exhibit (see section 5). Portable means of obtaining reproducible sequences of random numbers within a given implementation are provided, as are portable means of obtaining unique sequences of random numbers from one run to the next in a given implementation. (Reproducible sequences are generally desired for application development and testing, while unique sequences are often wanted for production use of applications.) The user must not expect to be able to obtain the same "reproducible" sequence of random numbers he or she has been accustomed to in one implementation upon switching to a different implementation, or when an implementation is upgraded by the vendor. Generators have a default (implementation-defined) initial state, which provides for reproducibility of sequences across runs in simple applications. Unique sequences in different runs are obtained by replacing the default initial state with a time-dependent state (by calling the parameterless version of Reset before generating random numbers). Normally, one does not need to manipulate the state of a generator (i.e., its seed) directly, but Get_Seed and Set_Seed are provided for those occasions when, in more advanced applications, it is desirable to do so. For example, the current state of a generator can be obtained by a call to Get_Seed and saved in a file; the value retrieved from the file can be passed to Set_Seed in a subsequent run to resume the sequence from the previous stopping point. This particular use does not require knowledge of the seed type, but other uses of Get_Seed and Set_Seed do and are inherently implementation dependent. One example of such a non-portable use of Set_Seed is to obtain a reproducible sequence other than the one provided by default. Another arises in applications in which different tasks have separate random number generators that must produce different, but reproducible, sequences. The only general solution to this problem is to allow the user to compute an initial seed for the generator in each task, perhaps as a function of the current seed in the task's parent. An example of a non-portable use of Get_Seed is its use in the simple printing of the current seed for debugging purposes. The version of Reset with an integer parameter provides an alternative way to obtain a reproducible sequence other than the default sequence. In contrast to using Set_Seed for that, Reset with an integer parameter is somewhat more portable though less general. It is more portable only in the sense that the source text of some applications, for example those that prompt the user for a seed value, need not depend on the implementation. It is less general in the sense that the possible values of a composite seed are far more numerous than those of a single integer, so only a subset of the possible states of a generator may be produced by calling Reset. Every integer in the range of Reset_Integer should be acceptable to Reset. The seed type is a visible type in order to support the convenient computation of seed values, when required. Implementations are allowed to define other types in terms of which Seed is defined, as may be necessary when Seed is composite; they may also define Seed by a subtype declaration instead of a type declaration. There are several different ways to enable separate tasks to have their own random number generators. The method we have chosen has some limitations, discussed along with the alternatives in section 4, but provides within those limitations for the greatest simplicity of use. Each instantiation of Generic_Generator provides each task with its own generator of random numbers, which is implicitly accessed by the operations Random, Reset, Get_Seed, and Set_Seed. (They access the generator of the current task.) This behavior can be implemented by appropriate instantiation of the predefined generic package Task_Identification.Per_Task_Attributes in the body of Generic_Generator (the seed becomes a per-task attribute). Those single-task applications requiring multiple generators and those multitasking applications requiring multiple generators in some or all tasks can obtain them by instantiating Random_Numbers.Generic_Generator multiple times. Some values of type Seed may be inappropriate for certain generators. In some cases, inappropriate seed values cannot be excluded simply by the use of range constraints in the definition of Seed or its components. The exception Seed_Error is provided for Set_Seed to use in signaling an attempt to set the state of a generator to an inappropriate value. Implementations may prefer to avoid or reduce the possibility of inappropriate seed values, which arise largely through the use of floating-point seeds, by defining the seed type to be an integer or a composite of integer components, even when the internal state contains floating-point components (having integer or scaled integer values). We do not provide a subprogram to fill a vector with random numbers in a single call. On scalar computers, such a capability offers a reduction in the subprogram call overhead. On vector computers, additional benefits might accrue by vectorizing the implementation, which is possible for some algorithms; the primary benefit on such computers, however, is the additional vectorization of the user's code that might be possible when, in a loop, random numbers are consumed from a vector rather than by calling a subprogram to generate each one. In omitting this capability, we emphasize the minimal essential functionality of our random number facility. Omitting it also keeps the generic formal part simple. 4. Alternatives An alternative solution to the problem of obtaining different but reproducible sequences in different tasks, one that does not require knowledge of the seed type, is a procedure to "advance" a generator over some presumably large portion of its full period without generating the intervening elements. This is relatively straightforward for linear congruential generators, but difficult or impossible for some of the other kinds of generators we wish to allow. Furthermore, obtaining many "different" sequences by systematically advancing their sequences by given amounts runs the risk of obtaining identical sequences offset in time, or, in the case of linear congruential generators, more weakly correlated sequences. Thus, it seems desirable to give the user full latitude in deciding how to obtain reproducibly different sequences in different tasks, even though it means requiring the use of implementation-dependent procedures for doing so. We have found in the literature a reference to a library at Queen's University, Belfast, whose subtract-with-borrow Fibonacci generator has a procedure for resetting the state of the generator from an integer, each value of which initiates a very long subsequence of the full period not overlapping with any other (in any practical sense). We have not yet seen the theoretical basis for such a claim, and we do not know how to implement it ourselves. Our Reset procedure (with an integer parameter) can provide exactly this functionality for appropriate generators; for such generators, at least, it provides a simple method of giving each task its own sequence of random numbers. The simple printing of the current seed for debugging purposes could be made easier by something like an Image function that returns the state of a generator as a string according to some implementation-defined coding scheme. Indeed, initially we made the seed type private and contemplated Image and Value functions on seeds. It does not seem likely that Image and Value would be suitable replacements for Get_Seed and Set_Seed; using them would, for many purposes, be inconvenient because of the need to code or decode the strings (in the case of a composite seed), and they would offer no real escape from the inherently implementation-defined nature of seeds. We fear that they would give a false sense of being able to manipulate seeds portably, given that the implementation-defined characteristics of the interface would disappear from the specification and remain hidden away in the semantics. We therefore omit Image and Value as not offering sufficient added benefit, given Get_Seed, Set_Seed, and a visible seed type. The most difficult problem we faced in designing these facilities was their "packaging." We could easily have done something different, and we are still not completely sure that we shouldn't. The main attraction of our packaging method is the painless way in which each task is given its own generator of random numbers. The user need not create objects in each task for the storage of the task's generator's state, and no extra parameters are needed to identify the generator being addressed in the various operations on generators; in particular, no parameters are passed to Random, which is therefore as easy to use in separate tasks (where separate generators are wanted) as in non-tasking applications. It is anticipated that implementations will instantiate Task_Identification.Per_Task_Attributes internally to create a "slot" in each task for the task's random number generator's seed; this use of Per_Task_Attributes has been mentioned repeatedly during the design of that facility. Some small number of applications might require multiple generators per task, or multiple generators in a single task. They can be obtained by instantiating Random_Numbers.Generic_Generator multiple times. This suffices if a fixed number of generators is needed, so that they can be addressed without using indexed names. (If there are two generators per task, and the two instantiations of Generic_Generator are called, say, Velocity_Generator and Mass_Generator, then the random numbers from the two generators could be obtained by calling Velocity_Generator.Random and Mass_Generator.Random.) Our scheme does not support applications requiring an arbitrary number of generators in a single task or in multiple tasks, where indexing is required. We do not view this as a serious limitation. (There may be no overhead for random number generators in tasks that never invoke a random number operation; it depends on the implementation of Per_Task_Attributes. That is, an instantiation of Random_Numbers.Generic_Generator does not necessarily cause the acquisition of space for the generator's seed; that happens, at least with the canonical implementation model of Per_Task_Attributes, only if and when an operation is first performed on a generator. There is, however, a little extra overhead for using random number generators in our proposed scheme than in the alternatives discussed below.) The most obvious way to allow the creation of arbitrary numbers of random number generators without the use of Per_Task_Attributes is to identify a generator with an object of type Seed. A generator would be created by elaborating an object declaration for such an object. (The declaration of the type Seed would be changed to specify the initial state of the generator as the initial value of objects of the type.) The user would have the responsibility for creating generators in the quantity required, and in the just the places, i.e. tasks, required. Operations on a particular generator would be designated by passing the seed as an "in" parameter to the operation ("in out" in the case of Random, which would have to become a procedure, since Random updates the state). There would be no need for Get_Seed and Set_Seed operations. We never contemplated actually making a seed the embodiment of a generator. Were we to follow something like the approach discussed immediately above, we would, instead, embody a generator in a separate type. Our specification might be as follows: with System; package Random_Numbers is type Generator is limited private; type Seed is ; -- but not private Seed_Error : exception; type Reset_Integer is range ; generic type Float_Type is digits <>; function Generic_Random (G : Generator) return Float_Type; procedure Get_Seed (G : in Generator; S : out Seed); procedure Set_Seed (G : in Generator; S : in Seed); procedure Reset (G : in Generator; I : in Reset_Integer); procedure Reset (G : in Generator); private type State is ; -- typically a record containing -- a Seed component and maybe -- other stuff type Access_State is access State; type Generator is new System.Controlled with record S : Access_State := new State'(); end record; procedure Finalize (G : in out Generator); end Random_Numbers; (Deriving Generator from System.Controlled allows finalization of generators and, thereby, the reclaiming of the storage allocated separately to their associated states.) In this alternative, as in the proposal of sections 2 and 3, the state of a generator is not directly accessible. That operations are performed on generators, not seeds, seems like the proper abstraction. Get_Seed and Set_Seed play the same role as before in obtaining or changing the internal state of a generator (which now becomes a parameter to them). Objects of type Seed would be declared, as in the original scheme, only if the user needs to obtain or save the state of a generator. The validity check performed on a seed being provided to a generator remains part of the semantics of Set_Seed, where it is relevant. If, in contrast, generators were directly associated with seeds, and created by creating objects of type Seed, and operated upon by operating on such objects, then Generic_Random would presumably have to validate the current seed each time a random number is requested (if seed validity needs to be checked), since the seed might have been changed by direct manipulation since the instantiation of Generic_Random was last called. That would be intolerably expensive. We would have to state that the generators provided by this alternative scheme are single-threaded; it would be an error to share them among (call them from) multiple tasks. In the scheme proposed in sections 2 and 3, there is no need to make such a statement, since each task's generator can be called only by its own task. In addition to being slightly more general than our original scheme, albeit slightly harder to use in simple applications, the alternative has two additional advantages. One is that it can be implemented in Ada 83 (except for the reclamation of the storage occupied by a generator's state). The other is that it can serve as the foundation of a higher-level abstraction that some users may wish to implement, using protected types -- that of a multi-threaded random number generator. We do not consider the need for random number generators that can be called by multiple tasks (with appropriate serialization, of course) important enough to be predefined. The nondeterministic nature of (the sequence of) calls to such generators would, in most cases, eliminate any possibility of reproducible behavior. Nevertheless, it would be good to allow the advanced programmer the opportunity, at least, to create such a capability, if it is needed, out of more basic predefined components. With the facility proposed in sections 2 and 3, it appears that providing such a capability requires the creation of a server task in which random numbers are generated, and with which other tasks must explicitly interact. If control over the identity of generators were provided instead by the Generator type, the user could embed a Generator object in a protected object or protected type, as needed. Are these advantages of the alternative scheme significant enough to warrant adopting it instead? Reviewer comment is solicited on this question. It has been suggested that some combination of both schemes be provided, so that simple applications can be served simply while not precluding more advanced applications. That is, one could provide BOTH the Generator type (and operations on it) AND the inner generic package Generic_Generator. (Text_IO has been cited as a model; there, I/O operations can be performed either on the "current default input or output file" or on arbitrary files.) We have not actually proposed to do so because the resulting facility would have overlapping or redundant capabilities; most of the random number generators that would be needed, especially those in simple applications, could be created in either of two ways, which is confusing and unattractive. It has been assumed that the need exists for the user to be able to specify the floating-point subtype of the random numbers that are delivered. This is the reason why Random has become a generic function in the alternative proposal. The analogue of this in section 2 is the inner generic package, Generic_Generator, that contains the function Random. There, however, genericity plays a further role: it is the means by which multiple generators are obtained in a task (more precisely, in each task). Of course, the generic actual subtype used in the instantiation should have a range of at least 0.0 .. 1.0 and should provide "adequate" precision; too little could introduce repeated values into the output. It may actually be preferable to standardize the result subtype of Generic_Random as Float, or let it be an implementation-defined floating-point subtype. If this were to be done, then Generic_Random could revert to an ordinary function in the alternative proposal, and Generic_Generator could become a parameterless generic package in the original proposal; the user would generally have to convert the random numbers obtained to some application-dependent type. (Evidence that this may be appropriate comes from our experimentation with different generator implementations; we observed that the conversion from implementation-dependent types appropriate to the algorithm to the user's type, denoted by Float_Type, occurs as the final step.) 5. Properties required of implementations Random number generators should have long periods. It will be specified that implementations must exhibit a period of at least 2**31-2. (Unfortunately, it is not practical for this property to be verified by a validation suite.) An ultra-fast computer may be able to process applications for which this period is inadequate, in which case the vendor should consider using a subtract-with-borrow or add-with-carry Fibonacci generator. There needs to be adequate assurance that the version of Reset that uses the current date and time will provide a unique sequence on two widely separated runs. It will be specified that this procedure must reset the generator to different states when the interval between calls is at least one second and not greater than fifty years; a failure of this uniqueness property is allowed only when two calls are separated by exactly one hour and a return from daylight savings time to standard time intervenes. Furthermore, the function that maps the current date and time into generator states should be a wildly varying function (i.e., one whose sensitivity to the components of the current date and time is greatest with respect to the second and least with respect to the year). By the same token, the version of Reset that takes a user-supplied integer should map consecutive integers into generator states that initiate significantly different, and ideally uncorrelated, sequences. The most important properties that implementations must exhibit are those related to the "randomness" (unpredictability to one not familiar with the algorithm) and uniformity of the generated numbers. The general approach to uniformity testing is to formulate an experiment based on random sampling for which the expected outcome for a truly uniform random sample can be calculated, perform the experiment using the random number generator, calculate the chi-square value for the experimental observations, and then check that it is less than the chi-square value expected from theory (with some confidence like 95%). Exact criteria are still being investigated, as are appropriate experiments. The latter can be as simple as distributing the random numbers (or tuples of them) into cells of even width, which would be filled in equal number by truly uniform random numbers, or they can be more sophisticated. One experiment that looks promising is the playing of a large number of "craps" games, with rolls of the dice simulated by the random number generator, for which measurements of wins and losses, game length, and the length of "passes" (runs of consecutive wins) can be tested against theoretical data for real dice. This experiment is useful because correlations many dice rolls apart can affect the duration of a game; it can be combined with simple equidistribution tests on the outcomes of the dice rolls. The literature describes a variety of experiments that have been used with some success in the past. Request for Review Comments are solicited on any part of this proposal, but especially on the following questions: o Should the approach be like that outlined in sections 2 and 3 (in which the implementation is expected to use Per_Task_Attributes to give each task a random number generator for each instantiation of Generic_Generator), or should it be more like the alternative discussed in section 4 (in which the user creates objects of type Generator in the numbers and places desired), or should it be a combination of the two approaches offering both simplified use and general use? o Was the decision against providing vector output from the random number generator correctly made and justified? o Was the decision against making the seed type private, and against supporting a private type with Image and Value functions, correctly made and justified? o Should the floating-point type of the random numbers be specified as Float, or specified by the implementation, instead of being selected by the user? References Publications that have proven useful during the preparation of this LSN include the following: D. E. Knuth, "Seminumerical Algorithms" (vol. 2 of The Art of Computer Programming), Addison-Wesley, 1989. F. James, "A Review of Pseudorandom Number Generators," Computer Physics Communications 60:329-344, North-Holland, 1990. S. L. Anderson, "Random Number Generators on Vector Supercomputers and Other Advanced Architectures," SIAM Review 32(2):221-251, June 1990. G. Marsaglia and A. Zaman, "A New Class of Random Number Generators," Annals of Applied Probability 1(3):462-480, August 1991. Ada BCSLIB (Mathematical/Statistical Library), Report 20462-0519-R2, Boeing Computer Services, Seattle, 1990. B. A. Wichmann and I. D. Hill, "A Pseudo-Random Number Generator," Report DITC 6/82, National Physical Laboratory, Teddington, UK, June 1982. B. A. Wichmann and J. G. J. Meijerink, Report DITC 39/84, National Physical Laboratory, Teddington, UK, March 1984. P. Lecuyer, "Efficient and Portable Combined Random Number Generators," CACM 31(6):742-749,774, June 1988. ------ Output from automsg.perl ------ Mail received from bobduff *** Key LSN-1052 given to topic 'LSN on random number generation for Ada 9X' ----------- End of output ------------