Ada Compiler Evaluation System
Reader's Guide
for
Version 2.1
FINAL
Contract Number F33600-92-D-0125
CDRL A0014
Prepared for:
High Order Language Control Facility
Business Software Support Branch
88 CG/SCTL
Wright-Patterson AFB OH 45433-5707
Prepared by:
CTA INCORPORATED
5100 Springfield Pike, Suite 100
Dayton, OH 45431
Availability of the Ada Compiler Evaluation System
The ACES software and documentation are available by anonymous FTP from the host "sw-eng.falls-church.va.us" in the directory "public/AdaIC/testing/aces/v2.0" and from other Ada-related hosts. Document files are included in PostScript format and as ASCII text. The ACES files are also accessible via the World-Wide Web. The home page URL is "http://sw-eng.falls-church.va.us/AdaIC/testing/aces/".
For further information about the ACES, contact the High Order Language Control Facility. As of 1 March 1995, the appropriate contact is:
Mr. Brian Andrews
88 CG/SCTL
3810 Communications, Suite 1
Wright-Patterson AFB, OH 45433-5707
(513) 255-4472
andrewbp@email.wpafb.mil
This Reader's Guide for the Ada Compiler Evaluation System describes how end users can interpret the results of executing the ACES performance tests, the statistical significance of the numbers produced, the organization of the test suite, how to find particular language features and/or specific optimizations, and how to submit error reports and change requests.
LIST OF FIGURES
The following documents are referenced in this guide.
ANSI/MIL-STD-1815A Reference Manual for the Ada Programming Language (LRM)
User's Guide for Ada Compiler Evaluation System (ACES), Version 2.0
High Order Language Control Facility
Business Software Support Branch
88 CG/SCTL
Wright-Patterson AFB OH
Version Description Document (VDD) for
Ada Compiler Evaluation System (ACES), Version 2.0
High Order Language Control Facility
Business Software Support Branch
88 CG/SCTL
Wright-Patterson AFB OH
Primer for Ada Compiler Evaluation System (ACES),
Version 2.0
High Order Language Control Facility
Business Software Support Branch
88 CG/SCTL
Wright-Patterson AFB OH
ISO/IEC 8652 (1995) Programming Language Ada, Language and Standard Libraries (RM 95)
"A Performance Evaluation of the Intel iAPX 432" by Hansen, Linton, Mayo, Murphy and Patterson in Computer Architecture News, Volume 10, Number 4, June 1982.
Ada Evaluation System AES/1
User Introduction to the Ada Evaluation System
Release 2, Version 1, Issue 1,
Crown Copyright, 1990.
Ada Evaluation System AES/2
Reference Manual for the Ada Evaluation Compiler Tests
Release 2, Version 1, Issue 1,
Crown Copyright, 1990.
Ada Evaluation System AES/3
System User Manual Parts 0 and 1
Introduction and General Information
Release 1, Version 1, Issue 2,
Crown Copyright, 1990.
ALGOL60 Compilation and Assessment by B. Wichmann, Academic Press, 1973.
"An Empirical Study of FORTRAN Programs" by D. Knuth in Software: Practice and Experience, Volume 1, Number 2, 1971.
Analysis of Messy Data by George A. Milliken and Dallas E. Johnson Van Nostrand Reinhold Company, New York, 1984.
Applications, Basics and Computation of Exploratory Data Analysis by Velleman and Hoaglin, Durbery Press, 1981.
"Are SPECmarks as Useless as Other Benchmarks?" by
O. Selin, UnixWorld, August 1991.
Biostatistics: A Foundation for Analysis in the Health Sciences by Wayne W. Daniel, John Wiley and Sons, 1978.
Characterizing Computer Performance with a Single Number
by James E. Smith, CACM, Oct. 1988, Vol 31, No. 10.
pp1202-06.
Compilers: Principles, Techniques, and Tools by A. Aho,
R. Sethi, and J. Ullman, Addison-Wesley, 1986.
Computer Architecture A Quantitative Approach by David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers, Inc. San Mateo, CA. 1990.
"How Not to Lie with Statistics: The Correct Way to Summarize Benchmark Results" by P. Fleming and J. Wallace in CACM, Volume 29, Number 3.
Introduction to Theory of Statistics by A. Mood and F. Graybill, McGraw-Hill, 1963.
Low Level Self-Measurement in Computers by R. Koster, School of Engineering and Applied Science, University of California, Los Angeles, Report Number 69-57, 1969.
Optimizing Supercompilers by Supercomputers M. Wolfe, University of Illinois at Urbana-Champaign, 1982.
Performance and Evaluation of LISP Systems by R. Gabriel, MIT Press, Cambridge, Mass., 1986.
"Performance Variation Across Benchmark Suites" by
C. Ponder, Computer Architecture News, Vol 19 No 4, June 1991.
"Re-evaluation of RISC I" by J. L. Heath in Computer Architecture News, Volume 12, Number 1, March 1984.
"Task Management in Ada - A Critical Evaluation for Real-Time Multiprocessors," by E. Roberts, A. Evans, C. Morgan, and E. Clarke, Software: Practice and Experience, Volume 11, Number 10, October 1981.
"The Keynote Address for the 15th Annual International Symposium on Computer Architecture" by D. Kuck, May 30 - June 3, 1988. Computer Architecture News,
Vol 17 No 1, March 1989.
Understanding Robust and Exploratory Data Analysis by
D. Hoaglin, R. Mosteller, and J. Tukey, John Wiley & Sons,
1983.
This section identifies the Ada Compiler Evaluation System (ACES) product, states its purpose, and summarizes the purpose and contents of this Reader's Guide.
This document is the Reader's Guide for the ACES Software Product. This guide explains the rationale behind the ACES product, objectives of particular tests and groups of tests, and objectives of the assessors and analysis tools. The User's Guide explains how to adapt the product to a particular Ada environment and how to run the tests, the assessors, and the analysis tools. The files and tests are named and described individually in the Version Description Document (VDD). A simplified set of instructions for using the ACES, suitable for use when no unexpected problems arise, is given in the Primer.
This document defines what the ACES test suite and associated tools are to accomplish, their intended users, and why the various parts exist as they do. Additionally, the background and approach of the ACES are defined. The rationale behind the design and development of the various test groups is described, as well as the background of Ada usage influencing the development, and a general perspective of approaches. This document also describes the benefits of different approaches to results analysis, and capabilities of the analysis tools to assist in those approaches.
The ACES is a tool to assist in the evaluation of an Ada compilation system. The basis of the ACES is a series of performance tests and assessors that are designed and built in Ada, to be adapted and executed on an Ada system. The performance tests are designed to measure execution timings and memory storage requirements. Additional tests are included to acquire compilation speed, system capacities, symbolic debugger capabilities, Ada program library functions, and diagnostic error messages. Tools to help with adaptation of the ACES product, as well as to assist in the analysis of the resulting data, are included.
The ACES makes it possible to:
* Compare the performance of several implementations. The Operational Software permits the determination of which is the better performing system for given expected Ada usage.
* Isolate the strong and weak points of a specific system, relative to others which have been tested. Weak points, once isolated, can be enhanced by implementors or avoided by programmers.
* Determine what significant changes were made between releases of a compilation system.
* Predict performance of alternate coding styles. For example, the performance of rendezvous may be such that designers will avoid tasking in their applications. The ACES will provide information to permit users to make such decisions in an informed manner.
* Determine whether the functional capabilities of a symbolic debugger are sufficient to accomplish a set of predefined scenarios. There is a wide variance among compilation systems in types of user interfaces and functional capabilities of symbolic debuggers.
* Determine whether the functional capabilities of a program library management system are sufficient to accomplish a set of predefined scenarios. There is a wide variance among compilation systems in types of user interfaces and functional capabilities for a program library management system.
* Evaluate the clarity and accuracy of a system's diagnostic messages. There are no standards for the format or content of diagnostic messages. The interpretation of the system response to these compilation units will require manual inspection and evaluation.
* Evaluate the system capacity limitations involving the compiler, linker and run-time system.
Different types of users might be interested in the ACES.
* Compiler implementors will want to know the strong and weak points of their system(s). They will also want to be able to assess improvements in their system.
* Compiler selectors are interested in comparing performance across systems to choose the best system for their project.
* Compiler users will want to be able to predict the performance of design approaches. They will also want to be able to isolate the strong and weak points of the compilation system they are using and to monitor performance differences between releases of a compilation system.
The ACES user is expected to already know how to use the Ada compilation system being tested. While this is not always a realistic assumption, it is not feasible to explain how every Ada compilation system which an ACES user might encounter will operate. For details on how to use any particular Ada compilation system, the ACES user is referred to the documentation for the compilation system. Similarly, for details of how a user can perform operations in the host operating system, the user must consult system documentation. In particular, an ACES user is expected to know (or be able to find out from sources other than ACES documentation) how to: use the text editor; construct a command file (script); compile, link and execute an Ada program; delete Ada compilation units from a program library; and in general, how to use the tools provided by the Ada system and the host operating system. Before starting with the ACES proper, users need to perform any Ada program library creation required by their compilation system.
Readers with prior backgrounds in statistics may have an easier time interpreting the ACES output as it deals with confidence intervals, standard deviation, modeling, residuals, robust estimators, significant differences, etc.. However, this document explains the relevant concepts as they arise and contains citations to other works providing more background for interested readers.
The Ada programs provided are generally portable, consistent with the requirement to test all major features of the Ada language. In some cases, a feature is inherently implementation dependent and will have to be adapted to operate on each new system.
For a complete list of system-dependent tests, refer to the VDD Appendix D, "System Dependent Test Problems". The ACES does not restrict itself to a subset of the language to try to increase its portability. For example, there are some test problems which use floating point types declared with 9 digits of precision, although there are several Ada implementations where this size exceeds SYSTEM.MAX_DIGITS. Also, there are some test programs which will not terminate when executed on systems which do not support pre-emptive priority scheduling.
Some test problems may fail to execute on a validated implementation when they violate system capabilities, such as exceeding a capacity limit or not supporting PRAGMA PACK. Each project must decide for itself how serious it considers the failure of any test problem.
The ACES contains a large number of test problems. Most individual problems are fairly small. Many address one language feature or present an example which is particularly well suited to the application of a specific optimization technique.
The focus of the Comparative Analysis (CA) tool is on comparing performance data (execution time, compilation time, link time, compilation plus link time, and code expansion size) from different Ada compilation systems. The measurements reflect the influences of both the compiler (software) and the processor (hardware), and no attempt is made to isolate the relative contribution of the compiler or processor to the overall compilation system performance. It is quite possible that a highly optimizing compiler for a slow target processor will generate code which executes slower than that of a non-optimizing compiler generating code which runs on a fast target processor; in such a case the ACES will report that the second compilation system is faster. This does not imply that the second COMPILER is "inherently better" than the first; running the first compiler on a faster target processor could reverse the conclusion about which compilation system is faster. The CA tool computes "average" performance factors of different compilation systems over all the test problems. CA identifies test problems where any individual system is much slower (or faster) than expected, relative to the average performance of that system over all problems and the average performance of that problem over all compilation systems. ACES users can review the CA report to isolate the weak and strong points of an implementation, by looking for common threads among the test problems which report exceptional performance data. ACES users can also examine the results on a specific system in more detail by running the Single System Analysis (SSA) program.
To be successful, any suite of tests must be satisfactory at each of three independent levels.
* The organization of the test suite and supporting tools.
* The relationships between sets of individual test problems in a test suite.
* The properties of individual test problems.
These are discussed below.
* The organization of the test suite and supporting tools.
Topics of interest at this level are the reporting facilities provided, and identification of the test problems in the suite which address areas of interest.
The ACES Harness program will allow the user to easily identify tests to run and build scripts to run those tests. If the user has a set of tests that he wants to run for a number of systems, he can identify the group of tests in the performance issue file and easily select the entire set of tests at once.
The ACES Comparative Analysis program will compare performance data between systems. It will identify the test problems which show statistically unusual results.
The ACES Single System Analysis program will look at related test results and help isolate the strong and weak points of a system's performance.
An extensive system of indexes and cross reference lists is provided in the VDD appendices that accompany each release. Using these, the test problems which share a particular characteristic can easily be found and their results examined. The VDD appendices are listed in Table 1-2 "Appendices Description".
* The relationships between sets of individual test problems in a test suite.
The primary concern at this level is the breadth and depth of coverage that the test problems in the suite provide. The ACES test suite is designed to provide extensive coverage of language features and common constructions. Version 2.1 provides significant, but not extensive coverage of language features new to Ada 95.
The functional requirements on the ACES for breadth and depth of the coverage are discussed in Section 3 "ORGANIZATION OF THE TEST SUITE", and summarized here.
+ The test suite should contain problems which:
- Address all major syntactic language features.
- Demonstrate the presence (or absence) of particular compiler optimization techniques by comparing results among related test problems. For example, there are several sets of problems where one version is a "hand-optimized" variation of another. If the system executed both versions in the same time, then it was able to recognize the "more general" example. Test problem op_al_alge_simp_04 is the statement "II := LL * 0;" and op_mi_mach_idiom_03 is the statement "II := 0;". If both statements take the same time to execute, it is fair to assume that the compiler has recognized the algebraic simplification possible in op_al_alge_simp_04 and exploited it.
- Represent problems from Ada practice. It is important to include examples of how Ada is actually being used. Practical problems are not designed to be optimized (or to be unoptimizable), they are simply built to get the work done. Some examples are fairly large programs that can give an estimate for the effects of program locality on cache memory usage which is not representative of very small programs. An instance of this type of problem is the KALMAN program in ap_kfm01, which performs a digital space state filter operation.
- Represent classic benchmark problems used in the comparison of other languages. These include programs such as Ackermann's function in programs cl_acm01 - cl_acm02, Whetstone in programs cl_whm01 - cl_whm04, Dhrystone in programs cl_dhm01 - cl_dhm03, and sort programs in cl_som01 - cl_som03. Results from these programs may be available for other languages. (The classical benchmark tests made up the Classical (Cl) group of ACES tests. See Section 3.2.4.1.)
+ Test problems should not generally duplicate other test problems. For consistency checking, some duplication is desirable.
+ Test problems should differentiate between systems. A good test problem will run well on some target systems and poorly on others. Executing a test problem which all systems treat the same does not provide useful information about whether one system is better or worse than another.
Because a limited number of systems were tested prior to this release, a test problem which all these systems treated similarly may show differences when run on additional targets. When there is a "reasonable" expectation that a problem might show differences, it is prudent to retain it in the test suite.
The results of some test problems are of independent interest; e.g., those tests associated with LRM features such as rendezvous, exception propagation, or procedure calls. Relative performance data is not always sufficient. A system may have a relatively fast rendezvous, compared to other Ada implementations, but the absolute time is also important. For example, a real time system may need to cycle every 20 milliseconds to satisfy the applications requirements. An implementation of that application which requires 100 rendezvous will not satisfy its performance requirements when the fastest entry call takes two milliseconds. If all other computations were instantaneous, the program would take 180 more milliseconds to perform the indicated synchronization than are available.
A particular test problem may be related to other problems to expose the presence of specific optimizations. One may be an unoptimizable version of another. If the first were removed from the suite, it would be difficult to compare the tests and reveal the presence of the optimization, even when the first problem may not differentiate between systems.
If all the trial systems perform the specific optimization in a comparable manner, then the two test problems will be serially correlated on all systems tested. However, adding the performance measurement results from an implementation which does not perform the optimization will weaken the correlation between the two test problems, making the two test problems useful and non-redundant.
* The properties of individual test problems.
At this level, the relevant question is whether or not the particular test problem is "well written". The answer to this question depends on the intended purpose of the problem, in addition to the characteristics of the actual code comprising it.
A test problem which can be optimized into a NULL statement is a poor problem if it was intended to expose the performance of addition operators, but may be a well written problem if the intention was to test for potential optimizations. For example, "XX := XX + 0;" would be a poor test for general addition of literal values, but is appropriate to test for algebraic simplification.
Test problems which exhibit the characteristics described below are defined as "poorly written". They are further discussed in Section 6.8 "CORRECTNESS OF TEST PROBLEMS".
+ The problem could be incorrect Ada.
This condition is sufficient to disqualify a problem. Erroneous test problems will be withdrawn.
+ The performance of the test problem could be nonrepeatable.
It might take a different computation path when it is repeatedly executed, falsifying the assumption that the timing loop can execute the problem many times, divide by the number of executions, and obtain a valid time estimate for one execution. If the test problem takes a different amount of execution time on each iteration, a sequence of timing estimates may not converge.
+ A problem intended to test one feature may actually test another.
For example, a problem designed to test passing literal values to a simple function might be expanded inline and folded, making the test problem more a test of possible compile-time optimization than a test of passing literal parameters. While it is important to test for the folding of inlined subprograms, it is also important to test for specifying literal actual parameters when inlining is neither requested nor possible.
Each test problem in the ACES has been compared against these criteria. Also of concern at this level is the accuracy of the timing measurements themselves. The steps the ACES takes to ensure that the timing loop code is accurate are discussed in depth in Section 6 "DETAILS OF TIMING MEASUREMENTS".
For the purpose of running the CA tool when a user has performance data from only one compilation system, the ACES is distributing sample data derived from running CA on the performance data from the trial systems. This represents the "average" performance of the systems tested, and can be useful in detecting differences between the one system the ACES user has data on and a hypothetical "typical" system.
There are three different but complementary ways an ACES user can examine test results. These are discussed in Section 5 "HOW TO INTERPRET THE OUTPUT OF THE ACES", and reviewed here.
* After running all the test programs on several systems, use CA to identify the test problems which have statistically unusual behavior on each system.
* Select a set of test problems which use a set of features of special interest and examine the results of these problems.
* Run the Single System Analysis program.
The VDD indexes and the source text itself provide a rich field for exploration. Depending on their interest, time, and the use they intend to put the results to, it is possible that users may form very detailed hypotheses about combinations of language features which are responsible for particularly slow performance, and the user may construct unique test problems to verify those hypotheses. Many ACES users will not be that interested or have that much time to spend. If the reason they are running the ACES is to choose between two or three compilation systems for one target processor, they may not care if they find that there are not large differences in performance between the systems. The details of the differences may not be important to them.
The important point of this discussion on examining test results is that ACES users will not have to understand the details of a thousand different test problems to obtain useful information from the results.
The Ada Compiler Evaluation System will not answer every question a user might have about Ada compilers; for example, dollar cost is not addressed. The ACES is concerned with evaluating systems, not just compilers. No attempt is made to factor out the contribution of hardware to overall performance. Some ACES users will want to compare different target architectures, in addition to comparing different compilers for the same target. This taxonomy will help to place the ACES tests in perspective and answer user questions about what assistance the ACES will and will not provide.
* Covered
+ Execution-time Efficiency (see Section 3.2 "EXECUTION-TIME EFFICIENCY").
This is the major emphasis of the ACES. Users will want to be able to examine the results of the ACES to study aspects of Ada performance. The ACES helps the user in isolating the particular tests that may be needed by providing indexes which list tests by various criteria.
+ Code-size Efficiency (see Section 3.3 "CODE-SIZE EFFICIENCY").
Code expansion size is an important area of interest for the ACES. This is measured by using the label'ADDRESS attribute. On systems which do not support this feature, users can select the "Get Address" technique as discussed in Section 5.1.1 of the ACES User's Guide to measure code expansion size in other ways, as discussed in Section 6.7 "CODE EXPANSION MEASUREMENT".
Users should be aware that measuring code size by calculating differences of label addresses may produce misleading results. Highly optimizing compilers for pipelined processors may reorder statements in unexpected ways, rendering address differences meaningless. Unexpectedly small or large code sizes should lead the user to examine the object code to determine whether code reordering is responsible.
The ACES gathers size measurements for all the tests along with execution time. While this is not a major thrust, some tests are included which measure data space allocated to objects by using the X'SIZE attribute and comparing the size of the packed objects to the minimum size possible. Sequences of allocation and deallocation in a collection (LRM Chapter 13) are included which will fail if space is not reclaimed. The tests are designed as performance tests but will demonstrate, as a side effect, whether storage reclamation takes place. These results, along with others, are reported in the ancillary data section of the SSA report.
+ Compile-time Efficiency (see Section 3.4 "COMPILE-TIME EFFICIENCY").
One area emphasized in the ACES is compile and link speed. The ACES provides tests that explore the effect of program size and various constructs on the speed.
+ Symbolic Debugger Assessor (see Section 3.6.1 "Symbolic Debugger Assessor").
The ACES provides a set of Debugger Assessor scenarios (programs and sequences of operations to perform) to enable users to evaluate a compilation system's debugger.
+ Program Library Assessor (see Section 3.6.2 "Program Library System Assessor").
The ACES provides a set of Library Assessor scenarios (programs and sequences of operations to perform) to enable users to evaluate a compilation system's library system.
+ Diagnostic Message Assessor (see Section 3.6.3 " Diagnostic Assessor").
The ACES Diagnostic Assessor tests will determine whether a system's diagnostic messages clearly identify an error condition and provide information to correct it, and whether warning messages are generated for various conditions.
+ Capacity Assessor (see Section 3.6.4 "Capacity Assessor").
The ACES provides a set of program generators that, when executed, generate programs to assess the compile-time and run-time capacity limits of an Ada system.
* Not Explicitly Covered
+ Test for Existence of Language Features.
The presence of language features is the charter of the Ada Compiler Validation Capability (ACVC). The ACES test suite assumes that the full Ada language is supported and correctly implemented. The ACES contains tests for the performance of representation clauses and implementation dependent features (LRM Chapter 13 features). (Note that for this manual, LRM will typically represent both the Ada 83 Language Reference Manual and the Ada 95 Reference Manual. When specifically distinguishing between the two, the terms Ada 83 LRM (for Ada 83) and RM (for Ada 95) will be used.) Some test problems will require modification to run on different systems (such as using the PRAGMA INTERFACE and calling on a procedure written in assembler language) and may fail on some Ada implementations which do not support the full language.
The ACES test suite contains a large number of test problems, which are collected (at link time) into a smaller number of test programs, as illustrated in Figure 3-1. A test program may contain more than one test problem. This is helpful in executing the ACES on an embedded target where downloading from the host can take a large amount of time. The time to run the whole test suite can be significantly reduced if more than one problem is downloaded at a time. This can also reduce the total compilation time, since several programs are combined into one which can share various pieces of logic (such as instantiations of TEXT_IO).
Run-time measurements are reported by test problem. These measurements include the time required to execute the problem and the number of 8-bit bytes occupied by the problem's compiled code. There are 1937 test problems that report these measurements when they are executed. (Another 25 test problems exist solely for the purpose of providing compile-time measurements; they are not executed.) Each test problem's code is embedded in measurement code that is part of a procedure. These procedures may be bound singly or in groups to form executable programs of reasonable size. This reduces load time, which is especially important for embedded systems. There is a default binding scheme, or the user may select collections of test problems to be bound together to form executable programs. (See Figure 3-1.) In any case, the measurements for each test problem are calculated and reported individually. Thus, if all test problems execute successfully, there will be 1937 execution time measurements and 1937 code size measurements.
Compile-time measurements include the time required for compilation and the time required for linking. Since the test problems are typically very small, the enclosing measurement and reporting code would be the dominant factor in compile-time measurements. Therefore, compilation time is reported as the total time required to compile all the units forming an executable "main" program, and linking time is reported as the time required to link the units into the executable. (See Figure 3-1.) In order for compile-time measurements from different evaluations to be comparable, these measurements are taken only if the default groupings of test problems into executables are used. The default grouping results in 725 executable programs; thus, if all the programs compile and link successfully, there will be 725 compilation time measurements and 725 linking time measurements. Since the distinction between compilation and linking is not clear for some systems, the Analysis programs also report the total time for compiling and linking each main program.
Each test problem is embedded in a template which, when executed, will report the execution time and the code expansion space for the problem. The template is discussed in Section 6.5 "HOW TEST PROBLEMS ARE MEASURED".
The emphasis of the Performance Test Suite is on determining the execution performance on a target system, both for speed and memory space. To a lesser degree, the ACES will also test for compilation and link speed. The tests are designed to:
* Produce quantitative results, rather than subjective evaluations.
* Be as portable as possible, consistent with the need to test all Ada language features of interest to Mission Critical Computer Resource (MCCR) applications, which are the class of applications of special interest to the ACES.
* Require minimal operator interaction.
* Be comparable between systems, so that a problem run on one system can be directly compared with that problem run on another target.
The best performance predictor for any particular program is the program itself. Only the final program will enable users to determine whether their implementation will satisfy their performance requirements. That being said, it must be recognized that most projects will want to use the ACES before their final program is available, so they can select an appropriate compilation system (compiler and target processor).
Even when a working version of the final program is available, users may find that its performance is not acceptable and want information to assist them in enhancing it. Performance analyzers (also called program profilers) are helpful in isolating areas of a program which account for most of the execution time of a program. However, this does not tell a programmer whether or not the construction used within the program's "hot spots" might be faster if coded using different language constructions. If a project is concerned with selecting which of several implementations to use on a project, this approach requires the final program be ready for testing before the selection of the compilation system. And even if a "very similar" application is available, the effort required to adapt it to each candidate system may be large, involving learning about system dependencies. A compiler implementor would not receive any information about which language features of their system are particularly slow (or fast) relative to other implementations, which they could use to help determine areas where performance improvements should be possible. Also, with only a set of final application programs, programmers have no initial guidance as to which language features or coding techniques they should try to avoid on a particular implementation. The ACES intends to provide such information. To do so, the ACES includes a set of tests which provides a broad scope of coverage of various features of the language.
The ACES is organized as a set of essentially independent test problems, assessors, and analysis tools. It is possible for users to insert additional test problems that they find of particular interest. In fact, by using ACES-defined names, users can easily add up to ten user-defined benchmarks for an Ada 83 system or up to twenty for an Ada 95 system. See Section 4. The analysis tools will process user-defined test problems, as well as ACES-developed problems. Users should consult Section 6 "DETAILS OF TIMING MEASUREMENTS" and the User's Guide for advice on test construction.
The issues addressed and not addressed by the ACES are indicated below.
* Addressed
+ Execution time
+ Code expansion size
+ Compile time
+ Link time
+ Diagnostics
+ Ada program library management systems
+ Symbolic debuggers
+ Capacity limits
* Not addressed
+ Cost
+ Tests for existence of language features
+ Adaptability to a special environment
This includes the ability to modify the Run-Time System (RTS) to fit a special target configuration, which is important to many MCCR projects. Modification to an RTS brings up validation and revalidation questions, and extensive modifications may require revalidation.
+ Presence of support tools
Cross reference listings, variable set/use lists, management support tools, compilation order dependency tools, configuration management applications, and other tools can make a system easier to work with, speed up the program development process, ease the maintenance task, and generally support the program life cycle.
Some target processors have not been explicitly anticipated in the suite of test problems, and a reader may ask how well unconventional target processors will be evaluated, or what extensions to the ACES would be needed to cover them adequately.
* Vector processors - There are test problems which can be vectorized. However, the machine idioms do not completely cover all the cases of interest. In particular, each vector processor architecture places particular constraints on what sequence of operations can and can not be processed with vector instructions. For example: vector registers may be limited in size; the memory span between elements which can be processed in vector mode may be restricted (to 1 in some designs); the efficiency of vector mode instructions on short vectors may be such that scalar processing is more efficient; some conditional processing in vector mode may be supported on some targets; special advice may have to be given to Ada compilers to permit vectorizing (particularly with respect to numeric exception processing in vector mode); and special vectorized library functions may be available (and necessary) to exploit the vector processing capabilities.
* Very Long Instruction Word (VLIW) machine architectures - Machines in this class have different machine idioms than others. It is possible that some of the specific optimization tests designed for other target machines will be falsely reported as being performed. For example, there are several common Subexpression tests where an optimizing compiler may reduce the operation count by generating fewer instructions and taking less time than a version of the problem written as two statements with the common Subexpression merged "by hand;". A VLIW processor may use several parallel instruction units to perform more operations for the "unoptimized" test problem, but still complete it in the same time.
Some care should be used in interpreting results on such a target. There are few Ada compilers available for such targets, so experience is limited.
* Reduced Instruction Set Computer (RISC) - On these processors, the only real difficulty is interpreting the test problems which probe for machine idioms. If there is no Increment Memory instruction on a target, test problems which permit its use are not directly applicable to the target machine. However, the time to increment a counter, or zero memory, or perform other operations which some machines have developed special idioms to support efficiently, are still of interest. Adding to or zeroing a counter is something which many programs need to do, whether it can be done in one machine instruction or not. The time (and space) it takes to perform the operation is the topic of real interest, not the machine instructions used to implement the operation. Special, idiomatic instructions were generated for machines because the designers believed that the instructions would be useful. However, for any particular target Instruction Set Architecture (ISA), a user may be especially interested in how well the target machine instructions are utilized: Are available idioms exploited? To answer this question, specific test problems need to be executed, and if not present, may need to be developed.
* Multicomputer - On a network of independent processors, with or without shared memory, it may be possible to process tasks in parallel for faster execution. In particular, it may be possible for a system to allocate separate Ada tasks to run in parallel on different processors. This may give interesting results, but when comparing performance between a uniprocessor system and a multicomputer, readers will need to understand the target configuration and the number of processors.
* 1750A Extended Memory - Access to MIL-STD-1750A extended memory is being provided by different implementors in different, incompatible ways. It is not the intention of the ACES to explore the performance of the different implementation-dependent extensions to provide access to extended memory on the 1750A. If the ACES tests are run as written, a baseline will be developed which can be used to compare systems. Projects requiring extended memory will generally need to modify the tests to match the extension needed for each implementation they test.
Determining the execution speed of various problems is the primary objective of the ACES.
The ACES includes test problems for all major Ada language features. One test problem will rarely be sufficient to accurately characterize the performance of a language feature. Optimizing compilers can generate different machine code for the same source text statement, depending on the context in which it occurs. The test suite contains sets of test problems which present constructions in different contexts. The results will demonstrate the range of performance associated with a language feature.
For example, consider the simple assignment statement "I := J;" where "I" and "J" are both integers.
* If "J" is a loop invariant in a loop surrounding the assignment, the statement may be moved out of the loop.
* If "I" is not referenced after being assigned to before being reassigned or deallocated, dead code elimination may eliminate the statement.
* If the value of "J" is known to be some constant based on statements executed before this assignment, it may be possible to use a special store constant or clear (if the value is known to be zero) instruction on the target machine.
* If an assignment was made to "J" earlier in the basic block leading to this assignment, it may be available in a register, permitting the compiler code generator to avoid a redundant load from memory of "J".
* If the value of "I" is needed later in the sequence of statements following the assignment, it may be possible to defer the store into "I".
* Establishing addressability to "I" and/or "J" may require the setup of a base register, or it may be possible that the required addressability has been established by other statements. This may be a common requirement if the variable is declared with intermediate scope, neither local nor library scope.
* If "I" has range constraints, various amounts of data flow analysis may be able to determine if "J" must satisfy the constraint or if explicit tests need to be generated. The performance in general will depend on whether "PRAGMA SUPPRESS" has been specified or not.
* If the assignment statement is unreachable, occurring inside the THEN clause of an IF statement determinable at compile time to be false, or directly after a GOTO, RAISE, EXIT, or RETURN statement without a label on the statement, or within a FOR loop whose range is determinable to be NULL at compile time, or within a WHILE loop determinable at compile time to have a false condition, or in a CASE alternative which is determinable at compile time not to be selectable, no code need be generated for it at all.
* If "I" has range constraints and "J" is known at compile time by data flow analysis to be outside the permissible range, the compiler may simply raise an unconditional CONSTRAINT_ERROR.
Some language features are tested with a corresponding systematic variation of contexts to expose the range of performance. Examples include: subprogram calling; task rendezvous; exception processing; arithmetic expressions; and Input/Output (I/O) operations. Although many embedded targets will not support file systems, many Command and Control MCCR applications make intensive use of file systems and the performance of I/O operations is critical to their application performance.
A set of test problems for "PRAGMA PACK" measures both time and space. Some packing methods do not allocate a component so that it will span a storage unit boundary, while some pack as densely as possible. The time to access a component which spans a storage unit is usually greater than when the component does not span a boundary. There are test problems which access a packed component of a record which would span a storage unit boundary (if densely packed), and others that access components which are left and right justified in a storage unit.
For justified components, machine idioms may be able to access the components more efficiently than for general alignments. These tests do not require tailoring to the storage unit sizes on the system under test; however, a test problem which forces a field to span a boundary on one system may not do so on all target systems. Although which component spans a boundary is dependent on the implementation storage unit size, the computation to identify the component can be performed in an implementation-independent manner using the named number SYSTEM.STORAGE_UNIT. There are test problems for the storage unit sizes in common usage: 8, 12, 16, 24, 32, 48, 60, and 64 bits. In addition to measuring the time to perform the test problems accessing packed objects, these test problems use the representation attribute, X'SIZE (LRM Chapter 13, Representation Attributes) to determine the actual bit size of the objects and compare this with the predetermined minimum possible bit size for the object. These sizes will be printed for information; they show the degree of packing the system under test performs.
Major language features are defined to be the features which are expected to require execution time and to have a reasonable expectation of portability. This includes all syntactic constructions except those listed below by LRM chapter number:
* Based Literals, LRM/RM Chapter 2
The ACES assumes that programs will be compiled, and literal values represented in an internal form at execution time. The format of the external representation of a literal value should not make any difference in execution time or space in nonsource-interpretive systems.
* Comments, LRM/RM Chapter 2
The presence of comments in test problems should not make any difference in execution time or space in nonsource-interpretive systems.
* Replacements of Characters, LRM/RM Chapter 2
The use of alternate character forms should not make any difference in the execution time or space in nonsource-interpretive systems.
* Address Clauses, LRM/RM Chapter 13
The ACES will not force users to make system-dependent modifications to test references to variables that are bound to fixed locations by ADDRESS clauses. Tying tasks to interrupts is a special case which is tested, because the performance characteristics are important and likely to be highly variable. For portability, the ACES does not include tests which use an ADDRESS clause to map a variable to a fixed location.
* Machine Code Insertions, LRM/RM Chapter 13
Test problems for machine code insertion are fundamentally tests of the underlying target hardware, rather than of any property of the Ada system. Designing a portable test problem for different target machines would be awkward, and impossible to verify since each user would have to perform the adaptation to the target machine. Even for the same target machine, different compilers may use different subprogram linkage and register usage conventions making a valid code insertion for one system erroneous on another. Some examples:
+ One system may expect that all machine registers are preserved across machine code insertions, while on another it may be permissible to modify various register values. If one compiler leaves a FOR loop index in a register and a piece of inserted machine code modifies the register, results could be peculiar.
+ It may be possible to write a piece of machine code which, when placed within an exception handler for OTHERS, examines the run-time environment provided for debug support and determines the identification of the exception which is being processed and where it was originally raised (line number or subprogram name). This could be a very useful piece of code, but is not likely to be portable.
+ The machine code to establish addressability to library units can differ between implementations; one system may allocate an object to a static (bound at link time to a fixed address) location, where another does not bind it until the package is elaborated. A piece of machine code which assumes the first alternative is performed and loads the address of the object in one instruction into a register to manipulate it will fail totally if the actual address needs to be computed by adding the base address of a package or by following a level of indirection.
+ The layout of records, including the degree of packing of fields, placement of discriminants, and the format and interpretation of descriptors for unconstrained type objects, may differ between implementations. Machine code written assuming one layout may be completely inappropriate for another.
It is possible that the text formats are incompatible between implementations (e.g., different mnemonics for the machine instructions, different placement of instruction fields).
* Interface to Other Languages, LRM Chapter 13 (RM Annex B)
Calls to other languages are implementation dependent and must be adapted to each system. There is one test problem, ms_il_interfac_lang_assem_01, which users must adapt to each target, which calls on a NULL procedure coded in assembler. As such, it is basically a test of interlanguage linkage conventions, and is by no means exhaustive.
This simple control linkage test does not answer many of the interesting questions users have about the performance of a multi-language program. Data structure layouts often differ between implementations, and special access functions may be necessary to reference data structures defined in one language when referenced from the other, or shared data may be restricted to scalar items. Measuring the performance of a simple control transfer will not expose the costs of such data access functions.
Ensuring comparability in interfacing with a NULL assembler procedure is not easy. On many target machines, there are different sets of linkage conventions, of decreasing generality and increasing efficiency. For example, on the DEC VAX the CALLS mechanism is more general, but slower than the JSB mechanism. In "real" applications which call assembler coded subprograms, the linkage convention used will depend on the details of the program design, and represents a major design decision.
Certain predefined pragmas are expected to have an impact on the execution time and space of a program. These include: CONTROLLED, INLINE, OPTIMIZE, PACK, PRIORITY (Ada 83; Ada 95 implementations supporting Annex C), SHARED (Ada 83), ELABORATE, and SUPPRESS. Others are concerned with presentation information (LIST and PAGE). There are test problems which explore the performance effects of specifying those pragmas in the first list.
The ACES requires the availability of certain math functions. Implementations vary considerably in their approach to math libraries, and modifications may be required before using the ACES. It is important that ACES users record all the modifications made to adapt the math library. Implementations may provide:
* The math library defined in Ada 95 (RM Annex A). All Ada 95 systems are expected to provide the packages specified in this Annex.
* A math library with all needed math functions implemented according to the Numerics Working Group (NUMWG) specifications, which is the ideal case. In general, the ACES testing should use a math library similar to that which will be used by projects. If an implementation-provided math library is inaccurate (even if it is fast) an ACES evaluator must judge whether the inaccuracy is sufficiently gross that projects using the compilation system will (or should) replace the provided math library.
Vendor math libraries may provide more precision than the ACES types ZG_GLOB1.FLOAT6 and ZG_GLOB6.FLOAT9 (digits 6 and digits 9, respectively) require. This is not a problem; the accuracy of the library is tested during Pretest and if observed errors exceed NUMWG recommendations a comment is made.
* No math library. In this case, evaluators may use the ACES portable math library "zm_genma.ada", provided as a generic package.
* A math library with limited support. In this case, the adaptation may be more complex because the user may want to use the existing vendor functions, and use the ACES portable version only for the missing functions. This can be done by writing a "pass-through" package which calls the implementation's math functions where available, and the portable ACES math library functions for the others.
In addition, there is the special case of an implementation which requires the ACES portable math library, but has difficulty in processing generic packages the size of zm_genma. See the discussion below for two potential workarounds.
The ACES portable math package, zm_genma is a generic package that is instantiated in package zm_math for type ZG_GLOB1.FLOAT6 and in package zm_dblma for type ZG_GLOB6.FLOAT9.
If the generic package is too large for a system to handle, the compilation system might accept the package if it were "de-genericized". Edit the files as specified in the User's Guide Section 5.1.6.1.4 "Making a Non-generic Math Package". It should be noted that comparing the performance of such a "de-genericized" version with a system which used the provided generic system may provide an unfavorable bias, in that the system supporting the generic version without requiring adaptation may run more slowly than it would have if it had been adapted. This is an example of the importance of recording all modifications.
Another approach to working around size problems in processing the generic package zm_genma is to define subprograms in zm_genma as SEPARATE units, although there is no guarantee that a system which could not process the original units of zm_genma will have any better luck in processing a version with subunits.
Specific optimization test problems include examples where it is easy and where it is more difficult to determine that the optimization is applicable. There are some test problems which perform the same basic operations, but have a modification which either performs the intended optimization in the source text, or precludes the application of the optimization. These examples permit a reader to determine if the optimization is ever performed. The system comparisons performed by CA will not distinguish between the case where all tested systems perform the optimization and the case where none did. The SSA report on optimizations should show which optimizations are performed. These test problems should not be amenable to optimizations other than the one being studied.
The set of Ada language features which, when used in various contexts, might affect the performance (time and space) of the generated code is much too large to consider exhaustively enumerating all permutations of basic features. It is necessary to be selective in the construction of the test problem suite. Sets of related problems have been constructed with variations based on the use of language features expected to demonstrate the presence of specific optimizations.
There are test problems which check for the presence of specific compiler optimizations. More detailed discussions can be found in texts on compiler writing, such as Compilers: Principles, Techniques, and Tools, by A. Aho, R. Sethi, and J. Ullman. The test suite contains problems for the listed techniques. The following subsections describe individual techniques.
This is the recognition that an expression previously evaluated need not be evaluated a second time as long the results of the first calculation are still available.
Although many compilers will recognize common subexpressions in some contexts, the better optimizing compilers do it in more contexts.
Folding is the performing of operations at compile time, including evaluation of arithmetic expressions with static operands (or operands whose values can be determined), and the folding of control structures (for example, no code needs to be generated for a statement of the form IF false THEN RAISE program error; END IF;).
This is the moving of a statement (or a part of a statement, such as evaluating some expression) which is written within a loop, but which does not vary between iterations of the loop, to a place before the loop.
Greater than linear performance gains may be achieved through this technique, since instead of executing the moved code once per loop iteration, it will be executed only once. Some care should be exercised in applying this technique, since a loop which is normally not executed at all (for example, a WHILE loop which is typically initially false) might run slower with loop invariant motion than without it. This problem can usually be minimized by checking the condition that the loop may not be executed at all before evaluating the "moved" code.
This is a replacement of a strong operator with a weaker but potentially faster one. For example, "multiply" can be replaced by an "add" in the expression "I * 2" resulting in the expression "I + I". A frequent and profitable application is the reduction of FOR loop indexes used as subscripts in arrays (which produce multiplication of the index by a constant reflecting the span between logically consecutive elements of one dimension of an array) with addition.
An assignment to a variable which is not referenced again before being either reassigned or deallocated is dead and need not be actually performed. While there are instances in which data is legitimately written to a memory location and never read (such as memory mapped I/O), these cannot happen inside Ada unless the compiler knows that a variable is tied to a particular memory location by an ADDRESS clause (see LRM 13.5; RM 95 13.1).
On machines with multiple general purpose registers, the registers serve as temporary storage which can save the values of expressions between the execution of the simple statements. If used efficiently, significant reductions in the number of instructions required to execute code are possible.
This is the process of switching inner and outer loops. It has a profound effect on the order of execution of statements, and can produce large performance gains.
The code can be transformed to contain more unit-stride array references. That is, the difference in memory addresses between references to an array on consecutive iterations is one. This can permit, or greatly enhance, operations on vector processors.
It can change the number of instructions executed on non-vector processors. For example, when an outer loop is performed 1000 times and an inner loop 5 times, there will be 1001 loop initiations (1000 for the inner loop and 1 for the outer loop). If the loops were switched, only 6 loop initiations will be required (5 for inner loop and 1 for outer loop).
Loop interchange may permit a reduction in paging on virtual memory machines by reducing the size of the working set.
Refer to Optimizing Supercompilers for Supercomputers, by M. Wolf for more discussion.
This is the merging of loops with equivalent bounds. This can reduce the amount of loop overhead in a program.
This is the combination of tests, best demonstrated by example. Consider the code:
IF I > 0 THEN ... ELSIF I = 0 ... ELSE ... END IF;
On most machines, the comparison for "I > 0" will set condition codes which can be retested for the condition "I = 0"; that is, the generated code could be: "..., compare, branch_less_equal to $1, ... $1: branch_not_equal $2, ... $2: ..".
A boolean expression which has a compile-time determinable operand can be simplified. An expression which simplified to "FALSE AND I=J" can be further simplified into "FALSE". DeMorgan's rule can be applied to help simplify boolean expressions.
Various algebraic identities can be used to simplify expressions. For example, it is not necessary to actually multiply by one or add zero.
Evaluating expressions in a non-canonical order often makes it possible to compute the results faster. This can reduce the number of temporaries, permit special form immediate operations, or permit register-memory operands rather than loading a value from memory followed by a register-register operand.
This is also known as "branch tracing".
A jump instruction which goes to another jump can often be simplified into a single jump. For example, when a conditional jump skips over an unconditional jump, the instruction may be rewritten as one conditional jump using the inverse condition.
Programmers will not often write a GOTO which branches to another GOTO. However, a straightforward translation of conditional control structures often produces such sequences of branches.
Code which is not reachable can be eliminated. Statements following an unconditional EXIT, RAISE, GOTO, or RETURN cannot be reached and need not have any code included in the memory load. When the compiler can determine that an IF statement condition is always false, the THEN alternative need not have any code generated for it. Some may argue that since well written programs should not have this type of unreachable code, having a compiler which optimizes it is not very important. However, with optimizing compilers performing both inline expansions and generic instantiations, there are more conditional expressions which can be resolved at compile time than is immediately apparent.
A procedure in a library package which is not reachable from the main program can be eliminated from the load. The reuse of common packages is expected to make this fairly common, since not every program will use every subprogram in a reusable package. For example, few programs will use all the subprograms defined in TEXT_IO.
The performance of some test problems can be enhanced by the good use of special properties of target machines. Examples of such idioms include: special instructions to clear memory; store of small constants; reuse of condition code settings to avoid retesting; use of special "loop" instructions; compare between limit instructions; register usage; add-to-memory instructions; loop instructions which update a counter and test against some limit; memory-to-memory block moves; and memory increment and/or decrement instructions.
If an array will fit within a single target computer word, the compiler may use fast and small inline code to perform various logical operators. For larger and/or dynamically determined sizes of packed arrays, a run-time support library routine will usually be called at execution time to perform the logical operation. This can be considered as a type of machine idiom, using the capabilities of the machine.
The LRM defines several pragmas which can affect the performance (time or space) of the generated code. These include CONTROLLED, INLINE, OPTIMIZE, PACK, PRIORITY (Ada 83; Ada 9X implementations supporting Annex C), SHARED (Ada 83), SUPPRESS, and ELABORATE.
The LRM also permits implementations to define additional pragmas, some of which may affect performance. The ACES is designed to study the performance of Ada implementations as defined by the LRM. It does not contain test problems for all possible implementation-defined pragmas. However, individual users are free to modify the test problems to incorporate additional pragmas and observe their effects. They should execute the tests as distributed so that they may compare results between systems; however, they may be interested in exploring the effects of implementation-defined pragmas. For a selection process, it would be reasonable to experiment with a few problems to find the setting of pragmas giving the best performance, or even better, to observe the effect of the pragma setting planned for use on the project.
It is sometimes possible and profitable to elaborate an object before it is required in a "canonical" order. For example, a library package containing a constant array with static bounds initialized with literal values might be elaborated and initialized before the program is loaded. That array would then not require any overhead when elaborating the package. Depending on the other contents of the package, it may not be necessary to execute any code to elaborate the package. In general, if the bounds are not static, it will be necessary to execute the expressions providing the bounds, at some time (not completely specified by the LRM) after the program is loaded but before any objects in the package are invoked (subprograms, variables referenced, or types used).
Objects initialized with a static aggregate in the declarative region of a library package will, by definition, only be elaborated one time. This makes it impossible to use the normal ACES timing loop to measure the speed of elaborating library units. The special technique used to measure library package elaboration is discussed in Section 3.2.7.5 "Elaboration".
Frequently, arrays (or records) are initialized with an aggregate which can be evaluated at compile time, often with literal values specified. This permits several optimization techniques to be applied.
One approach to aggregates is to consider them as a sequence of individual assignments. However, for a static aggregate, it may be better to define a copy in memory and do a block move into the object to be initialized; block moves are often much faster and shorter than a sequence of load and store instructions. If a static aggregate is used to initialize a CONSTANT object (or if the compiler verifies that no assignments are made to the object), an optimizing compiler may make all references to the object refer to the pre-allocated static block, and save both the space for the object on the stack and the time to copy the initial values into it.
Many application designs which use tasking can use a static task structure, creating all tasks when the program is initiated and keeping them active until the program terminates. If an optimizing compilation system, and particularly an RTS, recognizes this, it may be able to save a significant amount of space and time. Space can be saved in the RTS because the internal routines associated with task creating and termination will not be necessary. Time can be saved in two areas. First, the time to create the tasks can be saved. This would be done once on program initiation and is not likely to be significant. Second, the RTS routines associated with rendezvous could be replaced with versions which do not test to verify that a task exits. If the number of rendezvous is large and the RTS is organized so that replacement is feasible, the time savings may be significant.
An optimizing compiler may be able to determine that a task is static either through:
* User declaration, via an implementation-defined pragma; or
* By analysis, observed during compilation and linking that all tasks will be elaborated at the library level.
The definition of Ada tasking provides scope for some optimizations which have no analogue in more traditional languages. This section is entitled Language-Specific Optimizations, although the techniques described here may be applicable to other languages which support facilities similar to Ada.
The Habermann-Nassi transformation for tasking is a technique to reduce the number of task switches required to execute a rendezvous. It does this by executing the code of the rendezvous in the stack frame of the calling task, rather than in the frame of the entered task, when the entering task is ready to accept a rendezvous when the entry call is made.
Habermann observed that many of the tasks that arise in practical applications are of the "server" type, consisting of one or more SELECT statements enclosed in a loop. This structure permits an optimizing compiler to implement the rendezvous by replacing the ACCEPT statement linkage with a less general but more efficient subroutine which implements the required mutual exclusion and synchronization. A more detailed discussion of the approach is contained in the article "Task Management in Ada - A Critical Evaluation for Real-Time Multiprocessors", by E. Roberts, A. Evans, C. Morgan, and E. Clarke, Software: Practice and Experience, Volume 11, Number 10, October 1981.
Elapsed time measurements of DELAY (including DELAY UNTIL) statements requesting positive delay values will not follow the ACES product model for analysis (refer to Section 7.1 for details). It would not be desirable if one system, which is typically twice as fast as another, executed a DELAY statement twice as fast, as to do so would require execution time shorter than the requested DELAY value. Comparing execution times of DELAY statements is also complicated by system dependencies because the precision of the requested DELAY value must be expressed in terms of the system-dependent type DURATION; a request for a 100 millisecond DELAY can be interpreted very differently by different systems because of truncation and rounding required to convert the value into the proper type.
All DELAY test problems are given a special error code to prevent them from being processed as "normal" test problems by CA. The measured execution times for test problems executing DELAY statements are printed for examination. Because the actual elapsed time to complete a DELAY statement can be much larger than the requested value due to quantization and scheduling overheads, this information can be of interest to application programmers.
The measurement of statements with a zero delay value are of particular interest because they can provide insight into system task scheduling and could be given special treatment. These tests are expected to highlight differences between compilation systems.
It is permissible for a compilation system to optimize a literal DELAY 0.0 into a NULL statement. The Ada Uniformity Rapporteur Group (URG) has recommended that implementations consider a "DELAY 0.0;" statement as a scheduling point. In particular, this would require an implementation to determine whether a task has been made abnormal (that is, aborted by another task) and if so, to terminate it. For Ada 95 implementations, such a statement is an abort completion point; for implementations supporting Annex D, it is also a scheduling point.
Individual ACES test problems are designed to determine whether:
* The system treats a "DELAY 0.0;" statement as a NULL statement.
Because it is a permissible interpretation, the ACES should not treat a system which does this as erroneous. Systems which do not treat this as a NULL statement could still provide special handling for a delay statement with a literal zero, which is faster than when the delay value is not determinable at compile time.
* The performance of a "DELAY 0.0;" statement is different depending on whether there are multiple tasks or there is only one active task in the system.
* The system recognizes a "DELAY 0.0;" as a synchronization point for abort statements.
A system which translates this into a NULL statement at compile time might "miss" some synchronization points a programmer thought were present in a task. For a task with a DELAY zero in a loop, it could mean that after the task was made abnormal (that is, aborted by another task) the time until it was terminated could be indefinitely postponed.
* A zero DELAY will force a task switch between equal priority tasks.
By comparing the number of task switches in a series of long running problems, the ACES can determine:
+ Whether the system is using a run-till-blocked or a time-slicing task scheduler. This is tested in problem dt_dp_delay_zero_06x.
+ The time quantum on systems which use a time-slicing scheduler. This is the time interval when the task scheduler will switch between equal priority tasks. Many implementations provide system-dependent features which permit a user to specify the task-specific schedule algorithm:
- A pragma, as in DEC Ada.
- A linker directive, as in ALSYS on the Apollo.
- A system-provided subprogram which can be called at execution time, as in the TLD compiler.
The task-switch time can be inferred by running the same problem with different task scheduler directives for equal priority tasks, such as run-till-blocked or time-sliced with a short quantum. It will be the difference in measured times divided by the difference in the number of task switches.
A NULL statement or a sequence of NULL statements might generate some code. There are test problems to explore this.
st_nu_null_01 contains a single NULL statement.
st_nu_label contains a sequence of labeled NULL statements. It would be interesting to know whether a system generated code for each occurrence of a label.
There are some language constructions which display non-uniform performance. The more of them in a program, the slower the average performance. The simplest example is program size. Some compilers will not generate optimized code as well for the same arithmetic expression in the last statement of a ten thousand statement procedure as when it is the last statement in a ten statement procedure. Fixed size internal compiler tables for optimization can overflow and the compiler simply stops trying to generate optimized code.
Other examples are in the following sections.
The time to perform a rendezvous might degrade as the number of tasks in the system increases.
The time needed to access a variable can vary with the lexical level at which it is declared. If a "static-link" approach to accessing objects in an intermediate (non-local, non-global) scope is used, the time to reference variables will vary based on the number of links which must be followed (and on whether registers have been set up by prior statements containing the values of these links). A "display" will provide essentially constant access times, but the overhead to maintain it on block entry/exit can be higher than with a static-link approach. For more discussion on displays versus static-links, refer to any of several textbooks on compiler construction, for example, Compilers: Principles, Techniques, and Tools, by A. Aho, R. Sethi, and J. Ullman.
The performance of control structures, particularly FOR loops, can vary with the level of nesting. Some compilers try to dedicate registers to hold FOR loop indexes, and as the level of loop nesting increases, the number of available registers decreases, and the time to load and restore environments increases on subprogram calls.
Some implementations try to make access to simple scalar formal parameters quick when there are few of them, passing them in registers, for example. Therefore, access to the first (or last) few formal scalar parameters may be significantly faster than access to other parameters. Also, calling subprograms with only a few parameters may be much faster than calling subprograms with many parameters. Input parameters with default values also need to be tested.
Many target machines have different formats of instructions to be used with different displacements. Code to access a variable which can be reached with a short displacement from a base address can be shorter and faster than code to access a variable requiring a long displacement. When a compiler allocates variables in canonical order, the time to access a variable declared at the beginning of a declarative region might be faster than the time to access a variable declared at the end of the declarative region. An optimizing compiler might allocate variables to memory so that variables frequently referred to are accessible with a short displacement instruction.
Similar effects might be present with respect to access to fields of records.
In many areas of the language, it is possible to speed up the performance of one feature at the cost of slowing another down. One classical example common to block-structured languages is the trade-off between a display and a static chain for access to intermediate lexical-scoped variables. Here a static chain approach trades off faster subprogram linkages for slower access to intermediate scoped variables. Other examples are discussed in the following paragraphs.
There are times when a programmer must make design decisions but the information to determine the best choice for a particular program is not available. Alternately, there are design issues where a programmer may want to know how a system implements various constructions so that a decision to use or avoid the language construction can be made on the basis of quantitative information.
Order of evaluation is discussed under classical optimizations. See Section 3.2.4.1.12 "Order Of Expression Evaluation".
Records with default initializations are provided for in the Ada language. If an explicit initialization is specified for all occurrences of the record type, a good compiler would not have any extra code associated with the default initialization. Where there is extra code generated, users may wish to avoid giving defaults.
In a SELECT statement with several open alternatives, the LRM does not specify which is accepted (unless, in an Ada 95 implementation, a non-default entry querying policy is specified). Several test problems are presented which have multiple open alternatives to see if some implementations have adapted particularly fast (or slow) algorithms to perform the arbitrary selection. They also report on the method used to select open alternatives (lexical, priority based, a "roving" order giving priority to a different alternative each time the SELECT statement is entered, First-In-First-Out, Last-In-First-Out, or some other unanticipated order).
The relative time to access variables declared at library level, local, and intermediate lexical levels can vary between implementations. For intermediate lexical levels, the use of a display, as discussed in Section 3.2.5.2 "Levels of Nesting", can make access times roughly constant; however, this can slow down subprogram linkages. Access to variables declared in the parent unit of a separate subunit may be slower than what would be observed if the source text were textually included in the parent.
Looping constructions can be coded in alternate ways. A WHILE loop can be rewritten as an equivalent simple LOOP with an EXIT WHEN or IF ... THEN EXIT, or as IF ... THEN GOTO ... statements. There are test problems that explore the performance of these variations.
A CASE statement with a dense range of alternatives can be simply and efficiently implemented with a jump table on most current machine architectures. When the range between the first and last alternates is large, a jump table approach can result in very large memory usage, perhaps exhausting the addressable memory of the target machine in one statement. Therefore, "sparse" CASE statements need to be translated as a sequence of tests. An implementation may choose to perform all CASE statements with a sequence of tests. Users will want to know the performance of both dense and sparse CASE statements. The ACES contains examples of each.
A packed array of a subtype might be stored as an array of the base type, or with a size which permits tighter packing. A subtype may require range checking which would not be necessary for objects of the base type (because all possible values are valid). An optimizing compiler may be able to avoid some of the range checks which would appear to be required.
Generic units can either be shared or treated as templates for macro expansion. When a generic unit is instantiated several times, a macro expansion approach can result in more space than a shared code approach. For macro expansion, depending on the actual generic parameters, it may be possible to reduce the time and space of the generated code (for example, by folding or eliminating unreachable code, which is only unreachable with the particular actual parameters specified).
The objective for this set of problems is to examine the performance of tests which elaborate generic packages defined by TEXT_IO. Because of differences in approaches to processing generic instantiations, these test problems are expected to highlight differences between implementations.
TEXT_IO is a predefined library package. It is not generally possible for a user (including the ACES test suite) to modify TEXT_IO to insert code to force time stamping to be recorded; neither is it feasible to produce multiple versions of the package so that each can be elaborated.
It is feasible to measure the time to elaborate instantiations of the generic packages defined in TEXT_IO such as FLOAT_IO, INTEGER_IO, or ENUMERATION_IO.
Because they might be treated differently, there are test problems which contain:
* Sharable and non-sharable instantiations.
* Instantiations performed in library units and in nested units.
It is possible that a compiler will implement an option to perform INLINE substitution and simplification on subunits. The presence of subunits can interfere with optimizations of the parent unit because the compiler must assume that the subunit may modify any or all objects to which it has visibility.
One common approach to handling exceptions for faults detected by hardware, such as divide by zero or numeric overflow, will execute instructions on the entry and exit of each frame to track the exception handler which should be called if an exception is raised. Another approach builds a table of ranges so that when a machine fault is detected, the table can be searched to find the appropriate exception handler.
In one case, there will be additional overhead on each frame entry, but comparatively quick selection of the appropriate handler when a fault is raised. In the other, there will be no overhead on frame entry, but slower processing to find the correct handler after a fault is detected. It is expected that raising an exception will be a relatively rare occurrence, compared to frame entry, and that the trade-off will usually favor the second approach. Users may want to know which approach has been used by an implementation.
The test suite contains test problems constructed to reflect the difference that coding styles (computing the same results through different methods) can make. Examples include the Polynomial Coding Style subgroup of the Applications group (ap_pc) and the If Coding Style subgroup of the Statements group (st_is).
The test suite contains problems addressing those aspects of an Ada RTS which traditionally have been the province of operating systems. Examples include task processing (scheduling, DELAY handling, task creation and termination, aborting) and interrupt processing (obtainable by tying task entries to interrupt addresses); the ability to process hardware exceptions (such as dividing by zero); low level I/O and representation clauses tying objects to specific addresses (for memory mapped I/O or for hardware diagnostics); memory allocation and deallocation (using "new" operators and UNCHECKED_DEALLOCATION); elaboration order (concept not commonly made visible in languages other than Ada); and run-time checking (where Ada provides facilities for checking to be controlled and suppressed, if requested).
The test suite contains problems that explore task processing. This is discussed in Section 3.2.4.3.2 "Tasks".
The test suite contains problems addressing exception handling. This is discussed in Section 3.2.6.1.11 "Exceptions".
I/O tests are described below.
* Tests for asynchronous I/O
On some systems, any I/O operation will halt the program until the I/O completes. The Ada Uniformity Rapporteur Group has recommended that "A TEXT_IO.GET operation from an interactive input device, such as a keyboard, not be permitted to block other tasks from proceeding while waiting for input. Other types of input/output operations should allow the maximum feasible level of overlap, but it is recognized that in some systems, a general implementation of input/output overlap may be unfeasible". This issue is not one of correctness; the LRM does not require, or even discuss, the question of I/O operations blocking other tasks. Developing a portable test problem to determine this question is complicated because if the system halts waiting for I/O to complete, the test problem may not terminate.
The ACES contains test problems which measure the performance of operations in one task while another task is waiting on console input. When run on a target which blocks a program whenever a task waits for I/O completion, this test problem will not terminate until the user enters characters on the console. It is possible that having one task waiting for I/O will slow down the execution of other tasks (without necessarily halting them). This is revealed by comparing the results of executing the same code with and without another task in the system waiting for I/O.
* Tests for console I/O
Performance of console I/O is not intrinsically correlated with file I/O performance. In conventional operating systems, they will be routed to very different device drivers. The speed of console I/O is important to some applications and needs to be independently tested. It is not feasible to test console input without using special test equipment, since those tests would essentially measure operator typing time. These test problems need to be executed interactively, not as a background batch job because they are intended to measure the performance of console output. The problems are implementation dependent. It is possible that not all target systems will support console output, or that not all ACES users will be interested in console output performance. To work as intended, the target system must interpret an ASCII carriage return control character in a output string as affecting the cursor. The LRM does not require this; therefore, these test problems are not necessarily portable.
On many terminals, the time to display a string is a function of the characters to be displayed. The screen management routines of some target systems maintain the current contents of the display and compute the minimal set of terminal commands to modify it into the desired display. For example, a PUT of a string to a line already displaying the same string could have the result that no physical I/O commands are passed to the terminal. Many terminals have special commands to insert characters (sliding the old contents); delete characters; write blanks from the current cursor position to the end of the line; and overwrite characters. Since the time to transmit a character (either data or control) to a terminal can exceed a millisecond, reducing the number of characters transmitted can be a profitable optimization.
Some systems have bit-mapped terminals where the time to display a line is not correlated to the characters in the line or the old contents of the line.
These test problems reveal whether the time to display lines varies greatly with the lines being displayed or the prior contents of the line. To enhance repeatability of measurements, the contents of a line will be the same each time the string is displayed. The test problems first write a line of blanks, then write two strings, in order to observe the time to change between the three display strings. The blanks are written first because on some systems the time to blank a line will be constant and in many of the test problems, the first and second strings are the same, permitting users to attribute differences in test problems to changing between the first and second string.
If the console test problems only evaluated examples where the same string was being redisplayed, it would give an overly-optimistic impression of the performance of systems which omit physical I/O when re-displaying the same string. It would give an overly pessimistic impression of such systems if all the test problems displayed different graphic characters in every position.
* Tests for I/O patterns
Many classes of applications are I/O intensive. Their overall performance is determined primarily by the performance of the file system. The hardware characteristics of the file devices and the efficiency of the target operating system's file processing routines are the primary determinants of the speed of these problems. Disk caching is critical to file I/O performance. Projects which develop I/O-intensive applications will be concerned with performance of I/O primitives, and will not be particularly interested in knowing how the execution time is partitioned between the Ada run-time system, the operating system, and the physical device.
The LRM, by the FORM parameter on OPEN and CREATE procedures, provides a way for a program to specify implementation-dependent options for an external file. A null value is portable and requests the system defaults. On many target systems, specifying non-null values can greatly enhance performance. To ease portability, the ACES generally uses null values for the FORM parameter; however, the defaults will not necessarily result in the best performance. Programs concerned with file I/O performance on a particular target are advised to explore the performance differences resulting from specifying non-null values. Examples of such options which may have performance impacts include requests for multiple buffering, read-after-write checking, read-ahead, contiguous allocation on creation, a file's device name, shared vs. exclusive access, and so on. Time reduction factors on the order of one hundred are commonly achieved by using optimal FORM strings versus the default. The default settings for different compilation systems for the same target may not be comparable.
All the following cases consider records of 100 bytes (or variant records with a maximum size of 100 bytes). Resulting file sizes range from 100_000 to 10_000_000 bytes. On many systems, the speed of operations on small files is significantly different from large files, and it is necessary to include some tests problems on large files. By Management Information System (MIS) standards a 10 megabyte file is not very large, but it is not trivially small either, and it is not realistic to expect all installations interested in running the ACES to have much more free disk space than this.
The ACES contains a set of test problems which produce disk access patterns either typical of important classes of applications or potentially optimizable by common techniques. The set will demonstrate both "typical" behavior and the presence of optimizations.
Typical access patterns include:
+ Sequential
Processing a file in sequential (ascending or descending) order is a common operation in many applications.
+ Random
Some processing can be characterized as stochastic, following different distributions. The file size makes a major difference in these patterns.
- A uniform distribution may be a reasonable approximation to typical access patterns for some applications.
- Several empirical studies have observed that the frequency of access to records in a file corresponds to the 80-20 rule: 20% of the records account for 80% of the activity, and that smaller partitions of the file follow the 80-20 rule recursively. A pattern of references which correspond to this observation will exercise a system in a typical fashion. The direct application of this pattern corresponds to a hashing on the primary key reference, or if records were accessed through a B-tree, the actual pattern would contain frequent references to the directory pages. Because files are rarely organized precisely in frequency order, a model access pattern based on this distribution should map the frequency-based order to a random permutation of the physical order.
+ Cyclic
Many actual reference patterns display a cyclic pattern. For example, an indexed file being used for keyed access will show a typical pattern: Read the root; Read one of the first level directory pages; Read one of the second level directory pages; ...; Read a leaf page. The number of different pages at each directory level is fixed. The pages in the lower levels are referred to frequently. For a four level tree, the root page may be referenced every fourth logical I/O operation, and one of the first level directory pages (of which there will be a few dozen different pages) will also be referenced every fourth operation. This pattern can be exploited by locking the root and perhaps the first level directory pages in memory, or more flexibly by using a disk cache.
+ Combinations
The sequential, random, and cyclic patterns can be combined in different ways. For example, a program may be reading and comparing two sequential files; or a batch update program will process two different files, reading one (the update file) sequentially and making a random access to the second (assuming the master file is hashed).
If disk I/O systems had timing behavior like Random Access Memory (RAM), it would not be necessary to consider patterns of references. Disks have seek times, rotational latencies, and generally much slower access (compared to RAM). There are software techniques to compensate for the disk performance which are effective for some types of access patterns. Common optimization techniques are:
+ Allocating logically contiguous pages of a file to physically contiguous sectors of a disk. This is particularly effective for sequential access patterns because it permits a system to: perform multisector I/Os, transferring a full track at a time minimizing delays for seeking and for latencies; minimize disk head movement when accessing "close by" logical pages of the file; and simplifying and speeding the mapping of logical pages to disk sectors. On some systems, including many UNIX implementations, the file system maintains a directory mapping logical pages of a file to physical disk sectors; and for large files this mapping information is large and forces a (virtual) disk access to the extended file map to refer to logical pages with "large" page numbers. A disk cache will often permit the system to refer to mapping pages without a physical I/O. While this enhances performance, it occupies cache space which would otherwise be available for pages of user files. Other file system designs based on extents can provide a faster mapping between logical file pages and physical disk sectors by allocating blocks of contiguous sectors.
+ A disk cache in memory can speed access by replacing physical disk I/O operations with references to the copy in memory. This can be very effective for access patterns with small "working sets". Examples include directory pages of the file system and high level pages of an indexed file.
+ Specifying multiple buffers on sequential disk files in the same way as is done on tape files, can overlap I/O and processing and reduce the elapsed time required for problem execution. Multiple buffering is applicable independently of contiguous physical allocation; a system can initiate the reading of the next logical page of a sequential file without this page being the next physical sector. When used in combination with contiguous allocation, both techniques will work better.
+ Specifying large blocks on sequential disk files, in the same way as is done on tape files, speeds processing by reducing the number of physical I/O operations required to process a file. Reducing the number of physical I/O operations saves time in the Central Processing Unit (CPU) (the device drivers are executed less often) and a multisector read will not insert a complete disk revolution between sectors as might well happen if two consecutive single sector read requests were issued.
+ It is well known in the business data processing community that batching together a set of updates and sorting them in the same order as the master file can improve performance. Physical I/O operations might be avoided by processing updates to records on the same disk sector with one operation. Checking the contents of a single buffer is sufficient to ensure this. The head movement associated with processing the master file can be minimized by sorting the updates. Instead of each read referring to a random location in the master file, the sorted list of updates would specify a monotonic increasing sequence with the average distance between specified records being approximately the master file size divided by the size of the update batch. The optimizations appropriate to sequential files, such as multiple buffering, may also be applicable here. For large update files, update processing can be viewed as a merge-like process of reading the update file and the master file and searching for matches.
+ The setting of implementation-dependent options can impact performance. For example:
- Access control - Shared access to a file by concurrent users will involve overheads to ensure consistent usage. When a particular file is open reserving exclusive use, processing should be faster because the file system will not need to manipulate record locks.
- Allocating files which will be accessed concurrently to different physical drives and/or onto different disk controllers can minimize head movement and channel contention. Not allocating high frequency access data files to the same device where the operating system files are loaded can also reduce contention. Although the best performance for a program may result if every file is allocated to a different physical disk, many installations will not have enough different devices to do this, or may determine that the best installation-wide performance will be achieved when every program does not allocate files on every disk (file backup is simplified if one project's files are allocated to one device).
- Journaling is the recording of file activity to keep audit trails. It is necessary in some applications that such trails be maintained, and some operating systems may provide for automatic journaling, but it does have a performance cost which can be large.
- Read-after-write checking is often a user-selectable option, permitting the trade-off between performance and the surety resulting from verifying that a disk write was correctly completed.
- Some systems provide an option to assign a file to memory. When available, this can eliminate all physical I/O operations except for initial load and final save, although if loaded into a virtual memory system, there may be paging I/O operations associated with the "RAM FILE" depending on the available physical memory and the load on the system. But the overhead for this I/O is (one hopes) smaller than for disk resident files.
- There may be an option to selectively enable or disable a "write-through" option on a disk cache. When enabled, a physical write would be performed on the disk (and the cache) and the users would be assured that the disk file had been updated in the event of a system crash.
FORM strings are implementation dependent. Programs using default FORM strings will be portable, but are dependent on the implementation (and perhaps the installation and/or the characteristic of the account the program is executed under). Two different compilers which generate code runnable under the same operating system may select different defaults. It is important that users understand that specifying explicit FORM strings on some systems can have very large performance effects (orders of magnitude difference). A project which is seriously concerned with file system performance might establish coding conventions which require all programs to specify FORM strings and may be completely uninterested in the performance obtained by using default values.
These test problems construct and use several test files. These are instantiations of packages SEQUENTIAL_IO and DIRECT_IO for a 100 byte record type. Several sizes of files are used, ranging from 100 records (a 10 KB file) to 100_000 records. The test program generates these files during its initialization. The size is an adaptation parameter so that interested users may explore the performance of different sized files. For many small configurations, finding this much free space will be awkward. Relatively few configurations have ten megabytes of buffer space or disk cache, so the test problems should measure disk performance. Organizations developing applications for large systems should explore performance on files as large as they plan to use.
There are performance tests which focus on memory management issues. Some test the objects explicitly allocated and deallocated by a program via NEW and the instantiation of UNCHECKED_DEALLOCATION. A large set of tests focuses on the management of implicitly allocated objects which are discussed in more detail in the following paragraphs.
Many systems create temporary objects at execution time for Ada statements which do not explicitly contain an allocator. For example, functions returning an unconstrained type will typically allocate space on the heap to contain the function value. Two issues follow from this observation:
* First
Because the performance of storage allocators can vary greatly between systems, test problem results which invoke allocators can have results which vary greatly between systems.
* Second
If the memory space for these objects is not reclaimed for later use, available memory will "leak" as the program runs, and eventually space will be exhausted and the program will crash. This is a particularly nasty problem because:
+ The program source looks correct, and may work flawlessly on other systems, including earlier releases of the same compilation system. Software that was designed for reusability and portability is especially sensitive to this problem.
+ The existence of a problem may not be discovered until late in a project life cycle, perhaps only after the system is made operational. On target systems with gigabytes of virtual memory, it may take a long time to exhaust available memory; not so long that it will not eventually crash, but long enough that the application can pass most testing.
+ If users have suppressed checking, very strange behavior, such as an operating system crash, may result when the system eventually exhausts free space.
An implementation which can execute an Ada statement once but fails when it is executed repeatedly due to failure of implicit storage reclamation is not very robust. A compiler that does this can be validated, but the ACES will downgrade it during evaluation.
There are ACES test problems which:
* Detect if a system allocates and does not reclaim storage. In these cases, error reports against the system should be generated.
* Demonstrate the efficiency of a system in executing statements which (might) implicitly allocate and should deallocate storage. Some systems may be able to perform some of the test problems without allocating from temporary memory. That does not invalidate the test problems because it is desirable to have test problems which demonstrate performance differences between systems.
Because of differences between systems in their efficiency in manipulating temporary memory objects, and the possibility that some systems will be able to support some of the test problems without requiring dynamically allocated temporary memory, these test problems are expected to highlight differences between implementations.
Test problems use a simple strategy to test for storage reclamation. Each suspected language feature is executed 100_000 times in a test problem. If each execution allocates and does not reclaim as much as a few bytes, the space usage will quickly grow and exhaust space on systems without a large amount of usable free space. This will be effective at finding faults on the non-virtual memory systems typical of embedded applications. If the execution of these test problems exhausts space and raises STORAGE_ERROR, then the test problem will process the exception and report the STORAGE_ERROR (and not give an execution-time measurement).
The elapsed time to execute 100_000 Ada statements for some of these tests could be excessive (that is, many hours to run one problem). The problems make a preliminary estimate of the execution time for 100_000 statements and do not attempt it when the estimate is larger than a user modifiable time limit (ZG_GLOB3.EXCESSIVE_TIME).
Any other ACES test problem might also have the side effect of exhausting space when it is executed repetitively inside the timing loop, if it contains a language feature for which the compilation system allocates and does not reclaim space. Such behavior would be reported as a run-time error.
Elaboration in Ada occurs in several contexts which are significantly different with respect to performance, although not with respect to syntax.
For subprograms and blocks, entry into the declarative region implicitly invokes the elaboration of their declarative regions. Testing for this is extensive.
The test suite contains problems which elaborate nested (that is, non-library) packages and which perform sequences of statements similar to those which would be performed by a library package elaboration (that is, calling on explicit allocators to get dynamically determined space).
Several problems use a modified version of the timing loop to measure library package elaboration time. These versions declare multiple (25) library packages with an elaboration order defined by USE clauses and PRAGMA ELABORATE which forces a linear order of elaboration. Using this order, it is possible to place code in the initialization block of the package bodies to perform timing measurements. Although the error bounds obtainable in this way are not nearly as tight as those obtained from the normal timing loop, they can inform the reader of the order of magnitude of the time necessary to elaborate library packages. The elaboration is done so that it is possible to perform some consistency checks.
There are test problems to measure the cost of verifying that the predefined constraints are satisfied.
Some problems contain the same source text where the only difference between the problems is the presence (or absence) of suppression pragmas. Comparison of these problems would reveal possible performance savings by specifying suppression of checking.
Other sets of test problems are designed to test specific aspects of constraint checking code. Examples include: special cases of range checking, such as assigning a literal (whose range is verified at compile time) to a subtype variable; assigning an expression to a variable with a subrange (where data flow analysis can determine that some run-time bounds checking can be suppressed); and suppressing an access check (when the prior statement performed the same check).
Certain test problems are designed to highlight implementation differences in constraint checking and control flow analysis. The test problems which observe optimization of elaboration checks contain several calls on subprograms defined in an external package. An optimizing compiler should be able to combine some of the checking code which verifies that the package body has been elaborated before calls on subprograms defined in the package are performed. These test problems must be compiled without suppressing checking for predefined constraints because they are intended to test the quality of the code generated to perform the checks.
A system must verify that a package body has been elaborated before it is proper to call on a subprogram defined in that package. By using the predefined PRAGMA SUPPRESS (ELABORATION_CHECK), it is possible to avoid all checking code when the system honors the request. However, when no suppression pragma (or comparable compiler option) is specified, an optimizing compiler still has available more efficient options than generating testing code before every call on a subprogram defined in an external package. For example, it might apply some data flow analysis and only generate one test for pre-elaboration of a package in a region independent of how many different calls on subprograms in the package are contained in the region.
The test suite contains test problems representative of how Ada is being used in practice. In most programs, a small fraction of source text accounts for a large fraction of the execution time of the program. The test suite contains examples of such time-critical sections of code extracted from MCCR applications. The performance of an Ada compiler on these examples will be a good estimator of the expected performance on a similar program. These examples were selected to represent typical Ada usage. Neither code complexity nor use of specific language features was a selection criterion. Using such examples could result in test problems which look like "Fortran with semicolons", but if that is the way the language is being used in practice, then the ACES should contain test problems representative of this usage.
The following subsections discuss, in turn, classical benchmarks, Ada in practice, and "ideal" Ada.
The test suite contains classical benchmark programs coded in Ada. Examples include Whetstone, Dhrystone, Livermore Loops, Ackermann's function, GAMM, sieve, puzzle, several sort routines, the eight queens problem, problems from the Computer Family Architecture study (LU, BMT, TARGET, HEAPIFY, and AUTO), and Ada versions of the inner loops discussed in the paper by D. Knuth, "An Empirical Study of FORTRAN Programs" in Software: Practice and Experience, Volume 1, Number 2, 1971. These tests are in the Classical (Cl) group in the ACES test suite.
Typical usage of Ada in actual practice is represented by test problems based on several development projects. The examples below, which are all found in the Application group of test problems, are drawn from:
* A simulator for the E-3A. This is a set of problems drawn from a flight simulator. There are eight test problems from navigation, avionics, and communications. These are in the Simulation subgroup.
* The Advanced Rotocraft Technology Insertion program, from Navigation and Inflight Performance Monitoring modules. These are in the Avionics subgroup.
* Radar tracking algorithms. These are in the Avionics subgroup.
* Test problems which manipulate a balanced tree (insert, delete, search) are provided in the subgroup AVL.
* Test problems which manipulate a trie (insert, delete, search) are provided in the subgroup TRIE.
* Test problems which perform an A* search over a graph are provided in the subgroup Artificial Intelligence.
* An inference engine example is in the Artificial Intelligence subgroup.
* Test problems which exercise a neural network are provided in the Artificial Intelligence subgroup.
* Test problems which perform Cyclic Redundancy Check (CRC) operations are provided in the subgroup, Cyclic Redundancy Check. These are Ada implementations of a common application in communication programs.
* Test problems to explore different code-style approaches relevant to object-oriented design are provided in the Filter subgroup.
The ACES contains a lag filter application which has been coded in different ways, consistent with object-oriented coding styles. The significant point about the variations is the way state information and input/output are handled. The four implementations of the lag filter are discussed below:
+ One style uses procedure parameters, where the calling procedure retains the state information.
+ The second style uses a generic unit and includes the state information as part of the generic parameters. That is, each instantiation retains its own state and a procedure to advance the state of the system one time interval does not have any explicit parameters, because the identification of the input and output variables, the filter coefficient, and the state variables are all encoded into the generic unit.
+ The third style uses a generic unit where the instantiated procedure to advance time retains explicit parameters for input and output and perhaps state.
+ The fourth style defines a library package (containing the state variables and the access procedures) for each filter object in the design. This involves duplicating source code for each filter object being modeled.
These coding styles represent different design approaches. Users may be interested to know if there are significant performance differences associated with the different approaches. Inlining could be specified for each approach.
Ada in practice may be criticized because current practices represent techniques adapted from experience in other languages, and may be just "Fortran with semicolons". Such programs may ignore Ada language features not present in prior languages, such as exception processing or tasking. The ACES contains problems initially designed with Ada capabilities in mind, which do not artificially constrain themselves to a subset of the language.
The memory size of programs is an important attribute to many MCCR applications. On embedded systems, memory is a limiting resource. On some target processors, such as the MIL-STD-1750A, while physical memory may be available, maintaining addressability is critical and a small code expansion rate can help system design by reducing the need to switch memory states. There are two size measurements of most interest to Ada projects: the amount of code generated inline to translate each statement (Code Expansion Size); and the amount of space occupied by the Run-time System (RTS).
The code expansion size is measured in the timing loop. (See Section 6.7 "CODE EXPANSION MEASUREMENT".) It is the space, in bits, between the beginning and end of each test problem. This is an important metric to many users.
The size of the RTS is an important parameter to many projects. Space taken by the RTS is not available for use by application code, so a small RTS will permit larger applications to be developed.
It is not possible to measure RTS size in a portable manner, therefore the ACES does not attempt it.
The time to compile compilation units can be important to projects which have large volumes of code or where compilation is expensive. Most ACES programs were developed to measure execution-time performance aspects, and do not necessarily represent a set of compilation units which will expose all the relevant compilation time variables. However, they do represent a set of programs which will exercise a compiler, and observing the compile time of these programs can give insight to the overall compilation rates.
Compiler speed is sensitive to various system tuning parameters, such as the amount of main memory, working set sizes, contention for CPU and secondary storage, and the placement of files on disk. File placement for the compiler executable, compiler temporaries, the Ada program library, the Ada source program, and operating system files can affect performance. Sometimes users may be able to control the location of these files and sometimes (on some systems) these are determined by how the compiler was initially installed. In a clustering environment, users should be aware that a reference to a file on a remote node will typically require MUCH more time than access to a local file. The use of disk caching software can greatly enhance compile speeds. The extent of file fragmentation can also influence I/O performance (access to contiguous files will typically be faster than to files which are fragmented all across the disk). Having "identical" hardware and the same version of the compiler may not be sufficient to permit users to replicate compilation speed results. It is recommended that users configure the compiler and associated files as they expect to use them. It would not be unusual for "traditional" system tuning methods to be able to enhance compilation speeds significantly over an "untuned" installation and use.
The Systematic Compile Speed group is designed to explore the effects on compile time of different language feature usages and coding styles, such as the size of a program library (number of units), the size of the compilation units, the number of WITHed units, the use of generics, and whether a system closure facility is provided to bring all obsolete library units up-to-date.
CA can analyze either compile plus link time or compile time or link time in isolation. The performance tests gather both compilation and link times separately.
The ACES measures the time to compile files and link programs. CA does not compute compilation rates in lines-per-minute, but deals directly with the time to compile files and link programs. However, the SSA program does compute lines-per-minute.
Several features may impact compile speeds:
* Size - Several code generator paradigms build a representation of a program (or a piece of a program such as a subprogram, basic block, linear sequence of statements) and perform various manipulations on the structure to try to produce good code. Some of the algorithms used are of more than linear time complexity; they run much more slowly on large units than on short ones. The hope is that better code will be generated in exchange for the longer compilation time.
The time associated for each phase of a compiler may mean that all programs take some fixed time. A short program may not take much less time than a slightly longer one.
* Generic instantiation - The amount of time associated with instantiating a generic unit can be substantial. The compiler must check that type signatures match, and if it is treating instantiations as a form of "macro expansion" it may try to optimize generated code. When an actual generic parameter is a literal, the compiler may use this value to fold expressions in the generic body.
* Each library unit referenced in a WITH clause can require considerable processing time, including the time required to search the program library for the definition. The state of the program library might greatly influence the compile-time performance. Searching a program library which contains several hundred units can be much slower than searching a library which is nearly empty. Some implementations may use a serial search or may hash their search. Allocated space for new objects in the library may be dependent on the status of the disk space. If the available disk space is fragmented, finding a large block of contiguous storage may be time consuming, and linking together noncontiguous storage may slow down all later accesses to the library object.
Note that error messages and user friendliness (creation of cross reference listings, set/use lists, text formatting) are important aspects of compilers which can influence speed of compilation, but whose utility may well be worth a serious degradation in raw speed.
The definition of obsolete units in Ada will require the recompilation of some units when some changes are made to other units. Several Ada compilation systems provide user support to identify dependent units, and in some cases to automatically recompile all affected units to bring a program library "up-to-date". The presence of such facilities can be helpful and productive; however, it needs to be said that some recompilation orders, although valid, will result in recompiling the same unit more than once. This is an obvious performance problem, and if a system supports an automatic recompilation facility, users will want to know how efficient it is. Automatic recompilation and other facilities provided by the Ada program library management system are also tested in the ACES Library Assessor.
The emphasis of the ACES is on performance. Testing correctness of implementations is the charter of the Ada Compiler Validation Capability (ACVC).
Some test problems are constructed to measure the performance of language features which may not be supported on all target systems. If these test problems fail, then a user will know that the target system does not support the feature. Test problems for some implementation-dependent language features may need to be adapted for individual targets. This includes tests for tying tasks to interrupts, tests requiring support for pre-emptive priority interrupts, tests using UNCHECKED_CONVERSION, tests specifying time slicing, a test using PRAGMA INTERFACE to assembler, and tests packing boolean arrays (and requiring that they are "really" packed by performing UNCHECKED_CONVERSION on them).
Implicitly, by compiling and executing a large set of Ada code, a user will get some idea of the usability of a system. In compiling and running any substantial volume of Ada code through a system, users will learn something about how that system works and how convenient it is to use. Although this is not a quantitative assessment, it should not be dismissed lightly.
There are explicit tests for the usability of several aspects of a system, including symbolic debuggers, Ada program library managers, diagnostic messages, and capacity limits.
The result of executing the ACES Debugger Assessor scenarios will be a summary report of findings, completed by the user. In reviewing the report, a reader must keep several points in mind, and these are detailed in this section.
Some systems may provide a machine-level debugger which is ignorant of Ada symbolic names and linguistic constructions. On such a system, to display the value of an Ada variable, the programmer would map the Ada variable name to a memory address and examine that address; the compiler/linker may print this mapping information in listing files. As a general rule, such machine-level debuggers are not as convenient to use as symbolic debuggers, although they typically impose few restrictions on the programs to be executed under the debugger. The ACES Debugger Assessor is not intended to evaluate a machine-level debugger.
The LRM does not levy any requirement that an Ada compilation system provide a symbolic debugger, nor any standard set of capabilities or operational interface which it should use if one is provided. The ACES Debugger scenarios are designed to emphasize the determination of functional capabilities. They specifically do not consider the elegance or efficiency of the user interface, although this may be important to users in addition to the functional capabilities of a debugger.
Some organizations may want to evaluate the efficiency of a debugger interface by measuring elapsed time or by counting either keystrokes or commands. They must be careful to ensure that similar evaluation techniques are used on the different systems under test. Measuring elapsed time may result in evaluating the typing speed of the operator more than any inherent properties of the debugger or target system, and may be particularly misleading when a programmer has to stop and read a manual for an unfamiliar debugger in the process of performing the scenarios. Counting keystrokes or commands can also be misleading. A system which permits macros or user-defined function keys can arrange to execute an entire scenario with one keystroke. This may be very non-representative of typical performance; although users could define a macro that would save more than one keystroke every time, very few users actually will do this unless they anticipate using the macro multiple times.
Organizations evaluating several systems must decide what "comparable" usage on the target systems would be. This will generally require a case-by-case comparison of each system's facilities and the macro/function-key usage the organization anticipates will be typical in their projects.
The functional capabilities to be tested by the ACES Debugger Assessor were selected after a review of the capabilities of existing debuggers (for Ada and other languages) and capabilities whose lack has hampered the debugging of programs in previous systems. No priorities were assigned to the individual capabilities; each organization may have its own priority ranking of debugger capabilities. The template separately lists each capability so that users can easily see how the systems differ. Not all the scenarios are of equal importance. For example, a debugger which can perform all the scenarios except for not being able to examine the value of any variable would probably be of little practical value. The ACES Debugger Summary Report form provides a place for users to record subjective assessments and general comments. While these are not easily compared between different systems, or between different individuals using the same system, the information they provide can be valuable.
The ACES Debugger Assessor scenarios are designed to ask specific questions in the context of specific programs; the results should not depend strongly on the experience of the evaluator with the specific debugger. However, the completed ACES Debugger Summary Report will reflect judgments made by the evaluator. The debugger under test may provide a simple and elegant way to perform a particular Debugger Assessor scenario, but if the system documentation is unclear, the evaluator may conclude that the scenario cannot be performed. Therefore, the ACES Debugger Summary Report may also reflect the quality of the system documentation as much as the quality of the debugger. If the debugger can single-step through a program and if the programmer is very patient, almost any scenario can be performed. For example, on a debugger which does not support watchpoints on variables, a programmer could stop after every statement and examine the variables to be monitored.
The Library Assessor Summary Report should be filled out in order to reflect the capabilities discovered in running the library scenarios. The primary purpose of the Program Library System Assessor is to determine the functional capabilities of an Ada program library manager, although it also collects some performance data (elapsed time and disk space size) and determines whether the capacity of a system is large enough to accommodate the provided scenarios.
The LRM levies only minimal requirements on a program library system; after a unit is compiled into a library it shall be possible to subsequently compile units which reference it. The LRM (Chapter 10) suggests that a programming environment provide commands for creating the program library of a given program or of a given family of programs and commands for interrogating the status of the units of a program library. The form of these commands is not specified by the LRM.
Ada compilation systems provide program library systems with varying user interfaces, functional capabilities, and efficiencies. The ACES provides information for users to assess the functional capabilities of systems (and to a lesser extent, their efficiency). The different design approaches determine the framework in which the operations are performed; they are not directly evaluated. The ACES Library Assessor approaches are based on providing a set of scenarios consisting of compilation units, operations to perform using them, and instructions for evaluating the system responses.
An ACES user will have to adapt each scenario to the target system. On reviewing the completed summary report, an ACES report reader must be aware that just because a capability was not found does not necessarily reflect a failure in the system; the tester could have overlooked a supported capability, or the execution of the scenario which would determine the capability might have exceeded the capabilities of the configuration. For example, there might not have been enough free disk space on the test system to enter all the compilation units of a scenario into a program library. The user should draw very different conclusions from this situation than from one in which the system would not accommodate the scenario even if sufficient resources were available.
Different projects will assign different priorities to different capabilities. Support for concurrent library access will be critical to projects which will have cooperating multiprogrammer teams; it may be unimportant to single-user standalone systems.
The ACES scenarios emphasize determining functional capabilities. They do not attempt to evaluate the elegance or efficiency of the user interface; the scenarios are equally applicable to a "point and shoot" graphic-based user interface and to a command-line based user interface. They do not try to count the keystrokes or mouse clicks necessary to perform an operation (for any system supporting a macro capability this would be awkward; an evaluator wanting to make a system look good would define an entire scenario as a one character command). Comparing the sequences of keystrokes (or mouse clicks) required to perform the different scenarios on different systems can provide insights into the efficiencies of the user interface.
Library systems provide a structure in which Ada programs for the compilation system will be developed. Until users understand the structure that a library management system is designed to provide, they will judge all operations in it to be awkward. They may consider all operations to be awkward even after they understand the compilation system's library design as well, but such an assessment is properly made after initially learning the system. The ACES scenarios are designed assuming a "typical" library system design, which has mapped fairly easily to the sample systems tested during the development of the ACES, but that is no guarantee that it will map easily to all implementations.
A library system may have features not exercised by the set of scenarios. The Library Assessor Report form provides a space for user comments which can be used to report additional capabilities. Readers should review any comments in this area and decide how important they consider the additional capabilities.
The ACES Diagnostic Assessor Summary Report should be completed in order to reflect the discoveries made in running the diagnostic tests. In reviewing the report, a reader must keep several points in mind, and these are detailed in this section.
The LRM requires that a compilation system reject illegal programs, but it neither specifies the form/contents of diagnostic messages, nor exhaustively lists the conditions which should generate warning messages. The ACES Diagnostic Assessor tests include examples of illegal programs where the intent is to determine whether specific points are mentioned in the diagnostic message which would help explain and isolate the problem. The ACES Diagnostic Assessor tests also include examples of programs where a helpful compilation system would generate a warning message stating that the code, while not illegal, contains "suspicious" constructions and may contain logic errors or inefficiencies.
On reviewing the completed summary report, an ACES reader must be aware that the user has made a judgment about whether the generated message contained the anticipated information.
Each organization may have its own priority ranking of classes of diagnostics. The ACES report separately presents the categories so that users can easily see how the systems differ. Validated systems should reject illegal programs, more or less clearly. Large differences between systems occur with respect to the processing of warnings, and to a lesser extent in the presentation of non-local information.
The ACES Diagnostic Summary Report provides a place for users to record subjective assessments and general comments. While these are not easily compared between different systems, or between different individuals using the same system, the information they provide can be valuable.
The number of Diagnostic Assessor tests is fairly small. There may be some compilation systems with generally good diagnostic messages which happen to do poorly on the particular examples included in the ACES Diagnostic Assessor tests; or a system with generally poor diagnostics may do well on the ACES examples. If implementors start to "tune" their systems to do well on the ACES Diagnostic Assessor examples, the ACES results may not reflect a good sample of test cases. This is a larger risk for the diagnostic tests than for the performance tests because the relatively small number of examples makes it easier to modify the ACES results by "small" changes in the compilation system.
The ACES Capacity Assessor Summary Report should be completed in order to reflect the discoveries made in running the capacity tests. In reviewing the report, a reader must keep the following points in mind.
The LRM does not generally specify capacity limitation. It is not required for validation that a compilation system accept programs which are "large".
The ACES Capacity tests provide for user-specified upper and lower limits, and for a time limit for each feature test. The ACES provides a set of default values for these parameters, which can be modified by a user. The ACES summary report will include the limits tested against, along with the test results.
Capacity limitations may result from configuration limitations rather than from "hard" limits coded into the compilation system. Limitations may be due to available resources, such as the amount of main memory (and the size of the swap file), or the amount of available disk space (for temporary files), or on some systems the disk space allocated to the Ada program library. Increasing the resources available to the system may increase the size of a program the system will accept. Some experimentation may be required to determine whether a capacity test is revealing a "hard" limit or not; whether this is done will depend on the effort the ACES user is willing to invest in the Capacity Assessor.
The seriousness of a system rejecting a test program with feature size "N" will vary between projects. It is of no importance to a project which would never try to compile and/or execute programs close to that limit. On the other hand, if the anticipated usage of the feature by the project is large enough that to avoid "tripping over" system limits would affect the way a project develops code (or force it to modify pre-existing code) then the problem is more serious. The ACES can offer little advice about anticipated project usage except for one general principle: it is important to allow for a safety margin because programs often end up being larger than initially anticipated.
The fact that the test program for feature "X" accepts size "N" does not guarantee that the system will accept all user programs of size less than "N". Capacity limits often interact; particularly when the root cause of the capacity limitation is the amount of system resources available to the compiler, linker, or run-time system.
It is NOT recommended that the ACES user try to find the maximum size accepted by the system for every feature. When a capacity test shows that the system accepts the largest feature size tested, it is true that the user does not know how much larger a program could be and still be accepted. However, if the upper limits were chosen to reflect expected usage, then passing this size should be sufficient information. After all, if testing demonstrates that a compiler can accept an expression with a million nested parentheses, it is of little practical interest to discover whether it might accept expressions with twice as many!
The ACES Capacity Summary Report provides a place for users to record subjective assessments and general comments. In particular, the testers should comment on capacity tests which provoke program (or system) crashes. A system which generates an understandable diagnostic message when a capacity limitation is exceeded can be more valuable than a system which accepts slightly larger programs, but which operates erratically when its limits are exceeded.
The number of Capacity Assessor tests is fairly small. There may be some systems which limit the number of some language features which are not tested in the assessor and which are important to a particular project. If a project knows that it will stress the capacity of a system with respect to a particular feature (perhaps by experience on another compilation system), it may be worthwhile for the project to develop a test program for that feature using the existing ACES Capacity tests as a model (or to simply use an existing program which was "too large" on the other system).
It should be clear from the discussions in Section 3 that the ACES is not a simple collection of problems with a hierarchical structure.
The philosophy of the ACES is that end users will not have to examine in detail each individual test problem. Rather, they should run the test suite and let the CA tool isolate problems where a system does either unusually well or unusually poorly. These problems can then be examined in more detail to try to determine what characteristics of the problems were responsible for the unusual behavior. Of course, measures of overall performance are also collected and will be useful in comparing systems.
Each test problem is measured by inserting it into a template which will, when executed, measure and report on the execution time and code expansion size of the test problem contained within it, as discussed in Section 6.5 "HOW TEST PROBLEMS ARE MEASURED".
There is an extensive discussion in Section 6.8 "CORRECTNESS OF TEST PROBLEMS", on issues which might make a potential test problem invalid. The basic point to remember is that the test problem must be constructed to meet the following guidelines:
* The problem can be repetitively executed, and will follow the same path on each execution. In particular, the same control paths should be taken, and the repeated execution of arithmetic assignments should not produce numeric overflow (a test problem which increments an integer variable on each repetition is a mistake since it will eventually raise a CONSTRAINT_ERROR).
* An optimizing compiler must not be able to "unduly" optimize the problem. In particular, it should not be able to detect that the test problem is invariant with respect to the timing loop code and only execute it once. Most test problems will need to have variables initialized to ensure proper execution (for example, to prevent numeric overflow or other constraint errors), and if this initialization code is incorporated into the test problem using literal assignments, an optimizing compiler may be able to fold successive statements, essentially performing the intended test problem at compile time. While tests for folding are important, if that is not the purpose of the test problem being developed, it should be avoided.
* The test problem should be valid Ada. For example, values of variables should be defined before being referenced, even though an ACES user may be developing potential test problems on a system which assigns uninitialized variables to zero, as is permitted by the LRM. Ada programs can be written which depend on the order of evaluation of library packages. A good test problem will work with any valid (as defined by the LRM) order of elaboration, and not just the one adopted by the system used to originally develop the test problem.
* Test problems should avoid implementation dependencies except where the purpose of the problem is to test a feature which is inherently implementation dependent. For portability, test problems should not use the predefined types INTEGER and FLOAT, since their range is implementation dependent. In particular, Chapter 3 of the LRM states that a discrete range where both bounds are of type UNIVERSAL_INTEGER will be implicitly converted to the predefined, and implementation-dependent, type INTEGER. This should be avoided, since INTEGER'SIZE may vary on the same target based on implementation decisions in the compiler. In particular, a test problem should not contain a code fragment such as:
FOR I in 1..10 LOOP ... END LOOP;
where the type of the FOR loop index "I" will then be the predefined implementation dependent type INTEGER. It is preferred in such cases to use a code fragment such as:
FOR I in ZG_GLOB1.INTEGER16(l)..ZG_GLOB1.INTEGER16(10) LOOP...END LOOP;
or
FOR I in ZG_GLOB5.INTEGER32(l)..ZG_GLOB5.INTEGER32(10) LOOP...END LOOP;
which will effectively request a specific size. In the example cited, even when no syntax errors are introduced by the use of one type or the other, there may be considerably different timings and implications on register usage between the two versions. When a discrete range in an array declaration is used, the difference in performance could force the use of an unnecessary integer type for all index computations.
The simplest way to comply with this directive is to use the type defined in the zg_glob packages. But if user programs derive their own types from the universal types, that is also acceptable.
* Naming conflicts with the timing loop variables have to be avoided. A user can ensure this by studying the zg_glob packages. The simplest way is to define the user's test problem as a procedure with no parameters and compile it into the Ada program library. A standardized driver program which includes the timing loop code can then call on this procedure. See Figure 6-2, Timing Loop Template, for an example of what such a driver should look like. The user supplied name is a string parameter passed to "zg_glob2.put_test_name". The call on the user program (test problem) goes between "startime" and "stoptime0". The description is optional.
To use CA to compare results, including those on user-defined test problems, see the User's Guide Section 9.1.6 "Adding Subgroups, Tests, and/or Main Programs".
A form for submitting a change request and a sample template is included in the appendices of this document. Section 6.8 "CORRECTNESS OF TEST PROBLEMS" contains additional discussions on constructing test problems.
Three primary sets of output are available to the user: the operational software results file generated by executing the ACES performance tests, the automated analysis produced by the multi-system Comparative Analysis tool (CA) and the Single System Analysis (SSA) report. Each of these will be discussed. The optional reports produced by Condense will also be discussed.
The reader is reminded of the distinction between test problems and test programs, as explained at the beginning of Section 3.
Each test problem, when executed, generates a standardized output describing the timing and code expansion size measurements produced when the test problem is executed.
An ACES reader should know the following external information about a system to properly interpret the results.
* Identification of the system being tested.
+ Version number of compiler and the compiler options used in the different command files.
+ Operating system version number and relevant parameters (e.g., priority, available memory).
+ Hardware characteristics: amount of memory, type and number of disk drives (which can greatly impact the compile speeds).
+ Whether other users were on the system when measurements were made.
+ The dates the measurements were made.
* Identification of adaptations performed.
In particular, the reader must know what math library was used.
+ The math packages described in Annex A of the Ada 95 RM.
+ An implementor-supplied generic math package conforming to the Association for Computing Machinery, Special Interest Group on Ada, Numerics Working Group (NUMWG).
+ An interface to an implementor-supplied math library.
+ The ACES-supplied generic math package using the implementation-independent version of zm_depen (MATH_DEPENDENT).
+ The ACES generic math package using a version of zm_depen tailored to the target.
For a detailed discussion of the alternative ways of adapting math, refer to the User's Guide Section 5.1.6.1.1 "Alternative Methods for Math".
When compilation systems are compared, it is possible that performance differences in the test problems which used math functions may be due to the technique used to implement the math library. An optimized, vendor-supplied math library might be considerably faster than the ACES math library with the representation-independent version of zm_depen. Because the NUMWG recommendations have not been included in the Ada language standard, and because no Ada 83 implementation is required to provide a math library conforming to NUMWG (or to provide any math library at all), the ACES includes a math library designed for maximum portability. If a major project is committed to using a compilation system, it might invest the resources to develop (or adapt) an optimized math library which is tuned to the target and which provides significantly better performance than the ACES portable math library.
The following paragraphs describe the format of the results file produced by compiling and running the test suite. An example is given in Figure 5-1 that is part of the results of compiling and running the Conversion Fixed subgroup in the Arithmetic group. The execution results are described in the User's Guide Section 6.2.2 "Execution Time Log File", so that discussion is not repeated here, although the example does include execution results. A self-targeted system may include mixed compilation and execution-time results. A cross compiler will be likely to have these results in different files. The ACES program Condense handles both cases automatically. The headers, bracketed by back slashes "\", tell Condense what to expect. This output may be interspersed with various messages from the operating systems and the compiler. The headers allow Condense to easily ignore these extraneous messages.
==
\ACES begin\ OVERHEAD
\ACES begin el\ 40199.0900 19 SEP 1991
\ACES end el\ 40199.6100 19 SEP 1991
\ACES end \ OVERHEAD
\ACES begin\ OVERHEAD
\ACES begin lnk el\ 40200.1600 19 SEP 1991
\ACES end lnk el\ 40200.6800 19 SEP 1991
\ACES end \ OVERHEAD
\ACES begin\ AR_CX01_.ADA
\ACES begin el\ 40218.2500 19 SEP 1991
\ACES end el\ 40227.8700 19 SEP 1991
\ACES end \ AR_CX01_.ADA
\ACES begin\ AR_CX02_.ADA
\ACES begin el\ 40230.0500 19 SEP 1991
\ACES end el\ 40239.3400 19 SEP 1991
\ACES end \ AR_CX02_.ADA
\ACES begin\ AR_CXM01.ADA
\ACES begin el\ 40241.4800 19 SEP 1991
\ACES end el\ 40249.6000 19 SEP 1991
\ACES end \ AR_CXM01.ADA
\ACES begin\ AR_CXM01
\ACES begin lnk el\ 40250.2000 19 SEP 1991
!INFO: Linking AR_CXM01
\ACES end lnk el\ 40261.8200 19 SEP 1991
\ACES end \ AR_CXM01
\ACES begin mainprogram\ ******************** ar_cxm01
outer loop count
inner loop count |
bits microseconds | |
problem name size min mean | | sigma
\ACES_problem_name\ ar_cx_conv_fixed_01
fix1:=afix1(fix2);
\ACES_measurements\ 360.0 2.4836E+00 2.5437E+00 17 4 2.8%
\ACES_problem_name\ ar_cx_conv_fixed_02
fix2:=afix2(fix1);
\ACES_measurements\ 160.0 2.1310E+00 2.1675E+00 17 3 1.5%
\ACES end mainprogram\ ******************** ar_cxm01
===========================================================================
It is not ordinarily necessary for the user to understand this output, but for those who are curious, an explanation follows.
Subgroup results are self contained and always begin with the measurement flag, "\ACES begin\ OVERHEAD", which Condense uses to correct for the overhead time associated with measuring compile and link times. The timing measurements are all written by Ada programs. The first measurement is preceded by "\ACES begin" and the second is preceded by "\ACES end". The headers also indicate which kind of measurement is being made: "el" is the abbreviation for elapsed time; "cp" is the abbreviation for CPU time. The time stamps bracket an activity which we wish to time--a compile or a link. A complete set for a compile looks like:
\ACES begin\ AR_CXM01.ADA
\ACES begin el\ 40241.4800 19 SEP 1991
\ACES end el\ 40249.6000 19 SEP 1991
\ACES end \ AR_CXM01.ADA
We start with the file name, then the time stamp. The compile is then begun. After the compile, we have another time stamp, and then the file name is repeated, to confirm that we have not lost our place. The set for a link time is exactly parallel. We start and end with a main program name. This is actually sufficient, since Condense knows the main program names from the Structure (weights) file. The time measurements here also say "lnk", giving us additional confirmation.
\ACES begin\ AR_CXM01
\ACES begin lnk el\ 40250.2000 19 SEP 1991
!INFO: Linking AR_CXM01
\ACES end lnk el\ 40261.8200 19 SEP 1991
\ACES end \ AR_CXM01
The general pattern is several compiles, one link, and perhaps some execution results. The explanation of execution results is in the User's Guide, Section 7 "RUNNING THE PERFORMANCE TESTS", Section 7.3 "OUTPUT". Most test problems are in separate files. Then the main program is compiled (it WITHs the individual test problems which have just been compiled) and linked. The main program may then be executed on self-targeted systems. (There are exceptions. The Compilation Unit Size subgroup of the Systematic Compile Speed groups contains 25 test problems whose only purpose is to record compile-time measurements. These problems are not bound into executables, and therefore produce no run-time measurements. The compile-time measurements produced by these tests are used by the Library Assessor; see Section 8.3.)
NOTE: When using Harness-generated scripts, compile and link time stamps may not be generated. For each script generation step, time stamps are not generated if there is a subgroup in which some (but not all) the tests are selected.
The Condense tool produces three optional reports and two transportable data files in addition to the execution and compilation results databases. The Condense reports should be considered "intermediate". They may not reflect the final state of the database used for analysis, because the user may adjust the database manually (adding error codes or selecting different results) after Condense is run. Final error counts should be taken from the CA and SSA reports.
The Condense reports are useful in determining what tests did and did not run, how many times tests were run, and which tests have exceptional results. For execution results, the Harness provides this information interactively.
Condense input files and processing are discussed in more detail in the User's Guide Section 9.3. Input files (and their default names) are:
* System Names file - "za_cosys.txt".
* Structure (weights) file - Name given in System Names file; default is "za_cowgt.txt".
* Request file - Name given in the System Names file; default is "za_cnreq.txt".
* Log files - Names given in the System Names file.
* Database files (if any) - Names given in the System Names file; following each system name are the descriptions for execution_condensed and compilation_condensed along with the name of the file. The default extensions are ".e00" and ".c00", respectively. Additional running of the program will result in multiple files which will be consecutively numbered.
* Harness System Name file, "zh_cosys.txt" - Needed only for option to merge execution-time/code-size databases created by Harness.
* Harness Execution Database files - Names given in Harness System Name file; defaults are "??_test.dbs" where "??" is the group abbreviation. Needed only for option to merge databases created by Harness.
Condense output files are the database files corresponding to each log file input, two transportable data files, and the report files. The database files and the transportable data files are discussed in more detail in the User's Guide Section 9.3. The optional reports are produced if selected in the Analysis Menu or the Condense Request file.
Condense produces the following reports, the default names for which are given in parentheses. They consist of the name of the system plus the individual extensions.
* No Data Report ("system_name".nda)
* Exceptional Data Report ("system_name".exc)
* Multiple Results Report ("system_name".mul)
Each report includes a section for execution-time data and one for compilation/link time data.
Each report begins with a header that lists the system information as given in the System Names file ("za_cosys.txt"). The system name and system comments, compilation log file name and database file names are listed. The execution log file name is listed unless execution data was obtained from the Harness databases, in which case the directory containing the Harness files is listed.
In each report, test problem names have the format GG_SS_NAME, where 'GG' is a group abbreviation, 'SS' is a subgroup abbreviation, and 'NAME' identifies the test within the group and subgroup. Main program names have the format 'GG_SSMNN' where 'GG' and 'SS' are group and subgroup abbreviation, the 'M' is a constant that appears in this position in all main programs, and 'NN' is the number of the main program within the group. Test unit file names have the format '(GG_SSMNN) gg_ssNN_' where the string enclosed in parentheses is the name of the main program that calls the unit contained in the file, and the file name consists of a group and subgroup abbreviation and 'NN', which indicates the number of the file in the group. (The Structure (weights) file, "za_cowgt.txt", contains the mapping of files to tests.)
If Condense is run in incremental mode (adding log file data to an existing database), it reprocesses an entire group if any new data is added to that group. Such a group will appear in the Condense reports. If no new data is added to a group, the results for that group are copied from the old to the new database, and do not appear in the new reports.
Execution results that are marked "not_applicable" do not appear in the reports. When compilation times are checked against corresponding execution times, execution times that are marked "not_applicable" are treated as if they were missing.
The No Data Report lists each test and file for which no results were found in the log files. See Figure 5-2 for an example of the No Data Report.
ADA COMPILER EVALUATION SYSTEM
NO DATA REPORT <Date> <Time>
======================================================================
SYSTEM: system_1_name
COMMENTS:
EXECUTION TIME/CODE SIZE LOG : system_1_name.log
EXECUTION TIME/CODE SIZE DATABASE : system_1_name.e01
COMPILATION TIME LOG : system_1_name.log
COMPILATION TIME DATABASE : system_1_name.c01
======================================================================
======================================================================
EXECUTION TIME / CODE SIZE - NO DATA
======================================================================
PROBLEM ERROR
----------------------------------------------------------------------
ap: APPLICATION
de: DATA_ENCRYPTION_STANDARD
AP_DE_DES_05 EXECUTION
AP_DE_DES_06 EXECUTION
xh: EXCEPTION_HANDLING
pn: PRAGMA_NUMERIC_ERROR
ALL MISSING
gn: GENERICS
in: INSTANTIATION
GN_IN_ENUM_IO_01 MISSING
GN_IN_ENUM_IO_02 MISSING
======================================================================
EXECUTION : The log file contains a test name but no measurements.
It is assumed that the test failed during execution.
An execution-time error code has been inserted in the
database for this test.
MISSING : The log file contains no results for this test.
ALL MISSING : The log file contains no results for any test in this
group or subgroup.
DATABASE UNCHANGED : No new data for this group was input. The
database is unchanged for this group.
======================================================================
======================================================================
COMPILATION TIME / LINK TIME - NO DATA
======================================================================
PROGRAM FILE CMP TIME ERROR LNK TIME ERROR
----------------------------------------------------------------------
ap: APPLICATION
ai: ARTIFICIAL_INTELLIGENCE
(AP_AIM04) ap_ai04_ MISSING -
AP_AIM04 MISSING MISSING
ms: MISCELLANEOUS
il: INTERFACE_LANGUAGE_ASSEMBLY
ALL MISSING ALL MISSING
sy: SYSTEMATIC_COMPILE_SPEED
cu: COMPILATION_UNIT_SIZE
SY_CUM03 PARTIAL -
SY_CUM20 CALCULATION MISSING
======================================================================
PARTIAL : An incomplete result was found. No data was entered
in the database for this measurement.
CALCULATION : An error occurred in the calculation of this result.
No data was entered in the database for this
measurement.
MISSING : The log file contains no results for this program or
file.
ALL MISSING : The log file contains no results for any program or
file in this group or subgroup.
DATABASE UNCHANGED : No new data for this group was input. The
database is unchanged for this group.
======================================================================
System: <system name> Page <Page #>
Figure 5-2 Condense No Data Report (Continued)
A test problem will be listed in this report if:
* No results are found in the log file for the test.
* A test name was found in the log, but no corresponding measurement was found.
The report is divided into the following columns, with one test problem occupying one row:
* PROBLEM - Test problem name
* ERROR - The status of the test result
The possible entries under the "ERROR" column are:
* MISSING - No measurements were found in the log file for this test.
* EXECUTION - The test name, with no corresponding measurement, was found in the log. It is assumed that the test began to run, but failed after writing its name. Condense has inserted the execution error code in the database for this test.
* ALL MISSING - Applies to an entire group or subgroup. No test in the group or subgroup has results.
* DATABASE UNCHANGED - Applies to an entire group, when running in incremental mode. The group has data in the existing database, but has no new data in the log file which is being processed. Results for this group may not be complete, but this information appears in earlier reports.
No data is inserted in the database for a test listed as MISSING. An execution-time error code is inserted for tests listed as EXECUTION.
Listed in this report are:
* Each test file or main program with no compilation time.
* Each main program with no link time (except in the case of some Systematic Compile Speed main programs which have no associated link time).
The report is divided into the following columns, with one test file or main program occupying one row:
* PROGRAM - Main program name.
* FILE - Test unit file name.
* CMP TIME ERROR - Status of the compilation time result.
* LNK TIME ERROR - Status of the link time result.
The possible entries under the "CMP TIME ERROR" and "LNK TIME ERROR" columns are:
* MISSING - No measurements were found in the log file for this file or main program.
* PARTIAL - A partial compilation or link result was found. The result is missing a beginning or ending name marker, or a time stamp.
* CALCULATION - A compile or link result could not be calculated (the calculation did not result in a zero or positive time).
* ALL MISSING - Applies to an entire group or subgroup. No test file or main program in the group or subgroup has results.
* DATABASE UNCHANGED - Applies to an entire group, when running in incremental mode. The group has data in the existing database, but has no new data in the log file which is being processed. Results for this group may not be complete, but this information appears in earlier reports.
No data is inserted in the database for a test file or main program listed as MISSING, PARTIAL, or CALCULATION. If a compilation time is not present or is invalid, the corresponding link time is discarded.
The Exceptional Data Report lists each test and file for which exceptional or widely varying results were found in the log files. See Figure 5-3 for an example of the Exceptional Data Report.
ADA COMPILER EVALUATION SYSTEM
EXCEPTIONAL DATA REPORT <Date> <Time>
======================================================================
SYSTEM: system_1_name
COMMENTS:
EXECUTION TIME/CODE SIZE LOG : system_1_name.log
EXECUTION TIME/CODE SIZE DATABASE : system_1_name.e01
COMPILATION TIME LOG : system_1_name.log
COMPILATION TIME DATABASE : system_1_name.c01
======================================================================
======================================================================
EXECUTION TIME / CODE SIZE EXCEPTIONAL DATA
======================================================================
PROBLEM TIME UNR VFY CMP EXE DEP LNK NEG WDR OTH
-----------------------------|----|---|---|---|---|---|---|---|---|---|
ap: APPLICATION
ai: ARTIFICIAL_INTELLIGENCE
AP_AI_ARTIE 1* 1
ap: APPLICATION
cr: CYCLIC_REDUNDANCY_CHECK
AP_CR_CRC_00 1*
ar: ARITHMETIC
cf: CONVERSION_FLOAT
AR_CF_CONV_FLT_01 2*
dt: DELAYS_AND_TIMING
dp: DELAY_PROBLEMS
DT_DP_DELAY_07 1*
DT_DP_DELAY_ZERO_00 2*
DT_DP_DELAY_ZERO_05 2*V
======================================================================
* : A result from this category chosen for analysis.
V : Multiple valid but widely varying results found in data. The
smallest was chosen.
OTHER : Includes excess time and packaging errors.
======================================================================
======================================================================
COMPILATION TIME / LINK TIME EXCEPTIONAL DATA
======================================================================
PROGRAM FILE CTIME LTIME CMP LNK INC DEP WDR PAR CAL EXE DATA
----------------------|-----|-----|---|---|---|---|---|---|---|-------|
ap: APPLICATION
tr: TRIE
(AP_TRM01) ap_tr01_ 2 1* ERRORS
(AP_TRM01) ap_tr02_ 2 1* ERRORS
AP_TRM01 2*V -
AP_TRM01 link 2* -
do: DATA_STORAGE
rp: REPRESENTATION_PACK_UNPACK
(DO_RPM01) do_rp01_ 1 1* CMP ERROR
gn: GENERICS
in: INSTANTIATION
(GN_INM01) gn_in01_ 2*V MISSING
======================================================================
* : A result from this category chosen for analysis.
V : Multiple valid but widely varying results found in data. The
smallest compile time and its corresponding link time were chosen
CMP ERROR : The execution time data corresponding to this file
contains compilation errors. A compilation error code
has been inserted in the database for this file.
WITHDRAWN : The execution time data corresponding to this file
contains a withdrawn error code. A withdrawn error code
has been inserted in the database for this file.
VALID : The execution time data corresponding to this file
is valid.
VALID/ERR : The execution time data corresponding to this file
contains both valid times and execution errors. If only
one compile time is present, it is chosen for analysis.
If multiple compile times are present, the user must
choose among them.
MISSING : The execution time data corresponding to this file
is missing. There is no valid or exceptional data.
ERRORS : The execution time data corresponding to this file
contains errors. Result will not be used in analysis.
======================================================================
System: <system name> Page <Page #>
Figure 5-3 Condense Exceptional Data Report (Continued)
A test will be listed in this report if:
* It has more than one valid time measurement, and the times vary by 50% or more.
* It has one or more results which are error codes or unreliable times.
The report is divided into the following columns, with one test problem occupying one line:
* PROBLEM - Test problem name.
* TIME - Number of valid measurements, and "V" if widely varying.
* UNR - Number of unreliable time results.
* VFY - Number of verification errors.
* CMP - Number of compilation time errors.
* EXE - Number of execution-time errors.
* DEP - Number of dependent test results.
* LNK - Number of link time errors.
* NEG - Number of large negative time errors.
* WDR - Number of withdrawn test results.
* OTH- Other - Number of tests with packaging or excessive time results.
One of these columns will be marked with an asterisk (*) for each test. The column with the asterisk is the category whose result was selected for analysis, while all other results for this test are deselected. The selected result will be read by CA and SSA and used in their processing; the deselected results are preceded by Ada-style comments (--) and will be ignored by CA and SSA.
For example, if an asterisk appears in the "TIME" column, as for test ap_ai_artie in Figure 5-3, a result with a valid time (in this case there is only one) is selected in the database. If there is more than one valid time, the smallest is chosen. See the User's Guide Section 9.3.4.2 "Selecting from Several Results for Analysis", for more detail on how Condense chooses among multiple results.
A test unit file or main program will be listed in this report if:
* It has more than one valid compile-time measurement, and the times vary by 50% or more.
* It has one or more results which are error codes.
* The corresponding execution results are invalid, and thereby invalidate the compilation and link results.
* Condense cannot choose among several compilation measurements, because corresponding execution results are a mixture of valid and invalid times.
The report is divided into the following columns, with a test unit file occupying one line, and a main program usually occupying two lines (one line for compilation time results and one for link results):
* PROGRAM - Main program name.
* FILE - Test unit file.
* CTIME - Number of valid compilation time measurements, and "V" if widely varying.
* LTIME - Number of valid link time measurements.
* CMP - Number of compilation time errors.
* LNK - Number of link time errors.
* INC - Number of inconsistent results.
* DEP - Number of dependent test results.
* WDR - Number of withdrawn test results.
* PAR - Number of partial results (if test had only partial results it will not be listed in this report, but in the No Data Report).
* CAL - Number of calculation error results (if test had only calculation errors, it will not be listed in this report, but in the No Data Report).
* EXE DATA - The type of the corresponding execution results.
The possible entries in the EXE DATA column are:
* CMP ERROR - The execution-time data corresponding to this file contains compilation errors. A compilation error code has been inserted in the database for this file.
* WITHDRAWN - The execution-time data corresponding to this file contains a withdrawn error code that has been inserted in the database for this file. Even though this file has been withdrawn from the test suite, it has been modified to print a withdrawn error code so that all files are accounted for and don't appear to be missing.
* VALID - The execution-time data corresponding to this file is valid.
* VALID/ERR - The execution-time data corresponding to this file contains both valid times and execution errors. If only one compile time is present, it is chosen for analysis. If multiple compile times are present, the user must choose among them. The results are listed in the execution_condensed file whose default name is "system_name".e## where "##" stands for the sequence of files by that name starting with "00". Each new run is put in the next numbered file. See the System Names file for the actual filename.
* MISSING - The execution-time data corresponding to this file is missing (no valid or exceptional data) or is not applicable.
* ERRORS - The execution-time data corresponding to this file contains errors. Results will not be used in analysis unless they are manually selected in the database by the user.
One column for each file or main program will be marked with an asterisk (*). This indicates that for this file or program, a result from this category was selected for analysis, while all other results for this test are deselected. The selected result will be read by CA and SSA and used in their processing; the deselected results are preceded by the Ada-style comment delimiter (--) and will be ignored by CA and SSA. See the User's Guide Section 9.3.4.2 "Selecting from Several Results for Analysis", for more detail on how Condense chooses among multiple results.
Example 1 - If an asterisk appears in the "CMP" column, as for file "do_rp01_" in Figure 5-3, the corresponding execution results for the file contain compilation errors, and a compilation error code has been selected in the database for this file. The compilation error invalidates the compilation time collected.
Example 2 - If an asterisk appears in the "CTIME" column, as for file "gn_in01_", the corresponding execution results are valid, missing, or not_applicable. The smallest compilation time, and its link time, have been selected for analysis.
Example 3 - If an asterisk appears in the "INC" column, as for file "ap_tr01_" in Figure 5-3, an inconsistent results error code has been selected in the database for this file, because corresponding execution results were erroneous or were a mixture of valid and erroneous results. The user must decide which result is to be used in these situations. See the User's Guide Section 9.3.5 "Adding Data to the Database (Incremental Mode)" and Section 9.3.6 "Modifying the Database Manually" for more detail on resolving inconsistent errors.
The Multiple Data Report lists each test and file for which there are two or more results. Test problems and files for which there are no results or one result are not listed. Multiple results will occur when a test program is run more than once. See Figure 5-4 for an example of the Multiple Data Report.
ADA COMPILER EVALUATION SYSTEM
MULTIPLE RESULTS REPORT <Date> <Time>
======================================================================
SYSTEM: system_1_name
COMMENTS:
EXECUTION TIME/CODE SIZE LOG : system_1_name.log
EXECUTION TIME/CODE SIZE DATABASE : system_1_name.e01
COMPILATION TIME LOG : system_1_name.log
COMPILATION TIME DATABASE : system_1_name.c01
======================================================================
======================================================================
EXECUTION TIME / CODE SIZE MULTIPLE DATA
======================================================================
PROBLEM VALID ERRORS TOTAL
-------------------------------------------|-------|--------|-------|---
ap: APPLICATION
ai: ARTIFICIAL_INTELLIGENCE
AP_AI_ARTIE 1 1 2
ap: APPLICATION
tr: TRIE
AP_TR_TRIE_01 2 0 2
======================================================================
VALID : Valid times.
ERRORS : Includes unreliable times, verification errors, excess time
errors, execution, compilation, and link errors, withdrawn,
packaging, dependent errors, and large negative times.
======================================================================
======================================================================
COMPILATION TIME / LINK TIME MULTIPLE DATA
======================================================================
PROGRAM FILE CMP TIME ERRORS TOTAL LNK TIME ERRORS TOTAL
-----------------------|--------|------|-----|---|--------|------|----|-
ap: APPLICATION
tr: TRIE
(AP_TRM01) ap_tr01_ 2 1 3 - - -
AP_TRM01 2 0 2 2 0 2
======================================================================
VALID : Valid times.
ERRORS : Includes compilation and link errors, withdrawn, dependent,
and inconsistent errors.
======================================================================
System: <system name> Page <Page #>
Figure 5-4 Condense Multiple Data Report (Continued)
The report is divided into the following columns, with one test problem occupying one row:
* PROBLEM - Test problem name.
* VALID - Number of valid times (includes delay problems).
* ERRORS - Number of exceptional results. Includes unreliable times, verification errors, excess time errors, execution, compilation, and link errors, withdrawn, packaging, dependent errors, and large negative times.
* TOTAL - Total number of results.
The report is divided into the following columns, with one test unit file or main program occupying one row.
* PROGRAM - Main program name.
* FILE - Test file name.
* CMP TIME - Number of valid compilation times.
* ERRORS - Number of exceptional compilation results. Includes compilation and link errors, withdrawn, dependent, and inconsistent errors.
* TOTAL - Total number of compilation results.
* LNK TIME - Number of valid link times.
* ERRORS - Number of exceptional link results. Includes compilation and link errors, withdrawn, dependent, and inconsistent errors.
* TOTAL - Total number of link results.
A dash (-) in each of the link columns indicates that link time results are not associated with a file name.
The ACES Comparative Analysis (CA) tool analyzes collected sets of measurement data, produced by executing the test suite, from two or more systems. The CA report is designed to aid the reader in comparing these results. The system factors, described in Section 5.3.2.2.1, provide a single number summary of the relative performance of each system being compared. The outliers, described in Section 5.3.2.2.5, draw the reader's attention to the data points which are exceptions to the overall pattern.
The reader should remember that these numbers summarize data from each system (hardware and software). No attempt is made to separate these two components. System one may do better than system two because the hardware is faster or because the compiler generates more optimized code. Only in the case where the compiled code is run on exactly the same hardware can reliable conclusions about compiler quality be reached from the CA results. The statistical background and rationale for the CA analysis are described in Section 7.
CA assists the ACES user in satisfying the following high-level requirements.
* Compare the performance of several implementations.
The raw data matrix of measurements can be directly inspected to answer any specific performance questions. Summary statistics on overall performance characteristics are computed by the analysis program, CA. These statistics assist the end user in interpreting the significance of the measurement data. Where measurement data from several systems are available, CA computes system factors reflecting overall system performance and the residual matrix which gives insight into the performance characteristics of a system.
The residual matrix will flag test problems where a particular system performed significantly better or worse than expected, relative to the average of that system over all problems, and to the average of that problem over all systems. This flagging isolates the test problems where a system performs either strongly or weakly. Users can then examine the constructions and features used in these problems to see why the system behaves anomalously on the problem.
CA is a program to compare data between systems. To use CA to determine aspects of performance of a system for an end user who is testing only one system, a set of sample data is distributed as part of the ACES product. This sample data set is the problem factors computed by analyzing the results of several systems on which the ACES test suite was executed during development to verify the portability of the ACES and the support tools. This sample data set will be updated with each release of the ACES, using the then current releases of the Ada compilation systems. It will track the average Ada performance of systems over time.
* Isolate the strong and weak points of a specific system.
Executing CA with these two sets of measurement data, the sample set and that collected from the one system the user tested, permits isolation of weak and strong points relative to the "average" implementation as observed in the trial systems. Single System Analysis (SSA) can also be helpful here.
* Determine what significant changes were made between releases of a compilation system.
Presenting the measurements from successive releases of an Ada compilation system to CA highlights the differences between the two releases. Test problems using language features which were changed between releases should produce different measurements. Where these differences are large, CA will flag the test problems. Where a large difference is observed in a test problem which uses many language features, it may not be obvious why the difference occurred. The ACES contains many small tests which should isolate differences, and a large test problem can be studied to see what features it used. It is hoped that this approach will be sufficient to explain the performance of large test problems.
* Predict the performance of design approaches.
Information permitting estimation of the performance of different design approaches can be obtained by studying the actual results. The tests permit the observation of performance of some language features, such as rendezvous, which might be too slow to permit their frequent inclusion in some applications. The coding style tests are intended to be compared against other tests which perform the same function in a different manner. Comparing the results can suggest the faster alternative for a particular target system. The ACES provides information to permit users to make such decisions in an informed manner.
For large test problems, it is hard to tie performance to a specific language construction. If the Whetstone benchmark program is relatively slow, it is not easy to know why, without a detailed examination of the generated code. It could be due to procedure calling, parameter passing, array subscription, math library routines, arithmetic computations, loop constructions, or any combination of them. The test suite contains many small tests of particular features so that specific problem areas can be identified. Larger test problems combine the features in ways that reflect typical usage, and provide samples of code containing sequences of language feature usage. This permits the exposure of the interaction between features, and of the dependence of the performance of a construction on its context.
Not too much emphasis should be placed on any one test problem. Even ignoring the possibility of measurement errors, large residuals may be due to "unusual" interactions with the system rather than inherent properties of the system and the problem. For example, a test problem to measure the speed of access to intermediate scoped variables might be unusual due to the presence of extra instructions to set up a base register to maintain addressability in an S/360-like architecture. An unusual timing measurement should be viewed as an indication of the need for further study. It may be necessary to examine the machine code produced to see if a problem is due to peculiar system interactions not directly tied to the features used in the particular test problem.
The system factors give a measure of overall performance with respect to the workload defined by the test problems in the test suite. Readers who are interested in a different workload, should pay more attention to the test problems which reflect their interests. For example, where a reader is not interested in either tasking or fixed point operations, the results of test problems for these features can be ignored. If the total number of tests not of interest to a reader is large, it might be best to rerun CA ignoring these tests. This can be done by assigning a weight of 0.0 (see Section 9.1.5 of the ACES User's Guide) to these problems, or simply by not running them at all. All readers can get values from an examination of the test results to determine if, on the systems of their choice, there are any language constructions which they would want to avoid for performance reasons. If a system has a particularly bad performance on a feature of critical interest, it might be necessary to base the choice of a system not on fastest average performance, but on one which ensures that the worst case performance will be tolerable.
Input files are discussed in more detail in the User's Guide. The files (and their default names) are:
* System Names file - "za_cosys.txt"
* Request file - Given in the System Names file; default is "za_careq.txt"
* CA database file(s) - Given in the System Names file; default is "za_cadb.txt"
* Database files (produced by Condense) - Given in the System Names file
* Structure (weights) file - Given in the System Names file; default is "za_cowgt.txt"
Output files, also discussed in more detail in the User's Guide, depend on the options selected in Menu or in the request file. If the single output file option is selected, then all the output from the current request will be written to one file with the designated file name. Otherwise, one file will be produced for each group selected (and for the summary of all groups, if selected).
The three reports produced by CA are listed below. Each is further discussed in the following paragraphs.
Note that, in each case, there are 5 such reports, one for each metric (compile time, link time, total compile and link time, code size, and execution time).
* SUMMARY-OF-ALL-GROUPS-LEVEL REPORTS
+ High Level Summary For Selected Metric(s)
- Bar Charts
+ Intermediate Level Summary For Selected Metric(s)
- System Factors for Selected Metrics
- Errors for Selected Metrics
- For each group - comparing all systems
- For each system - comparing all groups
+ Full Report for Selected Metric(s)
* GROUP-LEVEL REPORTS
Comparative Analysis produces these output tables as part of the Full Report.
+ System Factors and Confidence Intervals
+ Significant Differences
+ Data Summary
+ Raw Data
+ Outliers and Individual Residuals
+ Goodness of Fit
+ Pairwise Comparisons
+ Number of Test Problems for All Groups
+ Test Problems with Errors (optional)
+ Sorted List of Outliers (optional)
* APPLICATION PROFILE REPORT (Special Analysis)
The high level bar charts are a simple graphical representation of system factors or the percent successful for the selected metrics (execution time, code size, compile time, link time, or combined compile/link time). Here, as for the other high level reports from CA, there is no new data being presented; the method of presentation is different. Users can choose between vertical or horizontal bar charts. An example of each is given below, in Figures 5-5 and 5-6. Figure 5-5 shows the execution time system factors produced by a Summary of All Groups Report from Comparative Analysis. Figure 5-6 shows the percentage of performance tests run successfully for those same three groups.
----------------------------------------------------------------------
Comparing System Factors
----------------------------------------------------------------------
2.46
+---------+
|#########|
|#########|
|#########|
|#########|
|#########|
2.00 |#########|
|#########|
|#########|
|#########|
|#########|
|#########|
|#########|
1.50 |#########|
|#########|
|#########|
|#########|
|#########| 1.16
|#########| +---------+
|#########| |#########|
1.00 |#########| |#########|
|#########| |#########|
|#########| |#########|
|#########| |#########|
|#########| |#########|
|#########| |#########|
|#########| |#########|
0.50 |#########| |#########|
|#########| |#########|
|#########| 0.28 |#########|
|#########| +---------+ |#########|
|#########| |#########| |#########|
|#########| |#########| |#########|
|#########| |#########| |#########|
0.00 +---------+ +---------+ +---------+
a_93 b_93 c_93
Execution Times --- <Date> <Time> --- <Version> --- Page (Page #>
----------------------------------------------------------------------
Comparing Successes
----------------------------------------------------------------------
0.0 20.0 40.0 60.0 80.0 100.0
|.........|.........|.........|.........|.........|
+-------------------------------------------------+
|#################################################|
|#################################################|
|#################################################|
a_93 |#################################################|
99.77 |#################################################|
|#################################################|
|#################################################|
|#################################################|
+-------------------------------------------------+
+----------------------------------------------+
|##############################################|
|##############################################|
|##############################################|
b_93 |##############################################|
93.58 |##############################################|
|##############################################|
|##############################################|
|##############################################|
+----------------------------------------------+
+-----------------------------+
|#############################|
|#############################|
|#############################|
c_93 |#############################|
59.66 |#############################|
|#############################|
|#############################|
|#############################|
+-----------------------------+
Execution Times --- <Date> <Time> --- <Version> --- Page <Page #>
The Intermediate Level Summaries also include system factors for selected metrics and error reporting in terms of the percent of successful tests. The difference is in the level of detail presented. The Intermediate Level reports show the data from the Group Level analysis in a format convenient for making comparisons between systems for each group, or in a format convenient for making comparisons between groups for each system. Examples of both are presented below. Figure 5-7, CA: System Factors for All Groups for All Systems, makes it easy to see that the rank order of the three systems is the same on every group. Figure 5-8, CA: Graphical Summary of Successes For Each Group, allows the user to see at a glance that the third system (which was an embedded target with limited memory) ran many fewer tests than the other two systems. Figure 5-9, CA: Summary of System Factors for All Groups For One System, makes it easy to see on which groups the system was faster (compared to its average) and on which groups the system was slower. The variability within each group can also be easily observed. In this example, performance was better than average on Delays_and_Timing and on Generics, and slower on the Systematic_Compile_Speed group. Performance on Exception_Handling (among others) was highly variable, while performance on the Arithmetic group was uniformly (slightly) faster than average. Finally, there is a summary table showing the total successes for all systems being compared, by group. An example may be found in Figure 5-10, CA: Graphical Summary of Successes For All Systems.
----------------------------------------------------------------------------
Summary of System Factors for All Groups for All Systems
----------------------------------------------------------------------------
============================================================================
Low Mean High Ratio 0 0.7 1.4 2.1 2.8 3.5 4.2 4.9 5.6
application (ap) ----------------|....|....|....|....|....|....|....|....|
a_93 2.72 3.05 3.42 1.00 | |--+-| |
b_93 0.24 0.26 0.29 0.09 | |+| |
c_93 1.07 1.31 1.59 0.43 | |+-| |
arithmetic (ar) -----------------|....|....|....|....|....|....|....|....|
a_93 1.83 2.06 2.33 1.00 | |-+-| |
b_93 0.22 0.25 0.29 0.12 | |+| |
c_93 1.11 1.22 1.34 0.59 | |+| |
classical (cl) ------------------|....|....|....|....|....|....|....|....|
a_93 1.61 1.91 2.26 1.00 | |--+-| |
b_93 0.18 0.20 0.24 0.11 ||+| |
c_93 1.04 1.20 1.40 0.63 | |-+| |
data_storage (do) ---------------|....|....|....|....|....|....|....|....|
a_93 1.71 2.05 2.46 1.00 | |--+--| |
b_93 0.16 0.19 0.22 0.09 ||+| |
c_93 1.00 1.10 1.20 0.53 | |+| |
data_structures (dr) ------------|....|....|....|....|....|....|....|....|
a_93 2.11 2.34 2.59 1.00 | |-+-| |
b_93 0.25 0.28 0.30 0.12 | |+| |
c_93 1.12 1.18 1.24 0.51 | |+| |
delays_and_timing (dt) ----------|....|....|....|....|....|....|....|....|
.
.
statements (st) -----------------|....|....|....|....|....|....|....|....|
a_93 2.09 2.38 2.72 1.00 | |-+-| |
b_93 0.26 0.28 0.31 0.12 | |+| |
c_93 1.12 1.17 1.23 0.49 | |+| |
storage_reclamation (sr) --------|....|....|....|....|....|....|....|....|
a_93 2.64 3.03 3.49 1.00 | |--+--| |
b_93 0.28 0.32 0.36 0.10 | |+| |
c_93 0.91 1.20 1.58 0.40 | |-+-| |
subprograms (su) ----------------|....|....|....|....|....|....|....|....|
a_93 1.81 2.18 2.63 1.00 | |--+--| |
b_93 0.17 0.20 0.23 0.09 ||+| |
c_93 1.11 1.18 1.26 0.54 | |+| |
systematic_compile_speed (sy) ---|....|....|....|....|....|....|....|....|
a_93 3.49 3.81 4.15 1.00 | |-+--| |
b_93 0.24 0.26 0.29 0.07 | |+| |
c_93 |---- missing ---- |
tasking (tk) --------------------|....|....|....|....|....|....|....|....|
a_93 2.01 2.25 2.52 1.00 | |-+-| |
b_93 0.44 0.51 0.60 0.23 | |+| |
c_93 0.71 0.81 0.92 0.36 | |+| |
---------------------------------|....|....|....|....|....|....|....|....|
Low Mean High Ratio 0 0.7 1.4 2.1 2.8 3.5 4.2 4.9 5.6
============================================================================
----------------------------------------------------------------------------
Graphical Summary of Successes For Each Group
----------------------------------------------------------------------------
============================================================================
success/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
application (ap) -------|....|....|....|....|....|....|....|....|....|....|
a_93 98/ 98 = 100% |----+----+----+----+----+----+----+----+----+----|
b_93 87/ 98 = 89% |----+----+----+----+----+----+----+----+---- |
c_93 27/ 98 = 28% |----+----+---- |
arithmetic (ar) --------|....|....|....|....|....|....|....|....|....|....|
a_93 114/114 = 100% |----+----+----+----+----+----+----+----+----+----|
b_93 114/114 = 100% |----+----+----+----+----+----+----+----+----+----|
c_93 104/114 = 91% |----+----+----+----+----+----+----+----+----+- |
.
.
tasking (tk) -----------|....|....|....|....|....|....|....|....|....|....|
a_93 130/131 = 99% |----+----+----+----+----+----+----+----+----+----|
b_93 118/131 = 90% |----+----+----+----+----+----+----+----+----+ |
c_93 82/131 = 63% |----+----+----+----+----+----+- |
----------------------------------------------------------------------------
Graphical Summary of Successes For Each System on All Groups
-----------------------------------------------------------------------------
success/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
--------------------------|....|....|....|....|....|....|....|....|....|....|
a_93 | |
1741/1745 = 100% |----+----+----+----+----+----+----+----+----+----|
b_93 | |
1633/1745 = 94% |----+----+----+----+----+----+----+----+----+-- |
c_93 | |
1041/1745 = 60% |----+----+----+----+----+----+ |
--------------------------|....|....|....|....|....|....|....|....|....|....|
success/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
============================================================================
Execution Times --- <Date> <Time> --- <Version> --- Page <Page #>
============================================================================
----------------------------------------------------------------------------
Summary of System Factors for All Groups for a_93
----------------------------------------------------------------------------
============================================================================
Low Mean High 0 0.6 1.2 1.8 2.4 3.0 3.6 4.2. 4.8 5.4
All Groups --------------|....|....|....|....|....|....|....|....|....|
2.01 2.46 3.00 | |--+----| |
application -------------| . . . |
2.72 3.05 3.42 | |-+--| |
arithmetic --------------| . . . |
1.83 2.06 2.33 | |-+-| |
classical ---------------| . . . |
1.61 1.91 2.26 | |--+--| |
data_storage ------------| . . . |
1.71 2.05 2.46 | |--+---| |
data_structures ---------| . . . |
2.11 2.34 2.59 | |+--| |
delays_and_timing -------| . . . |
0.99 1.67 2.80 | |-----+--------| |
exception_handling ------| . . . |
2.27 2.98 3.91 | |-----+-------| |
generics ----------------| . . . |
0.91 1.55 2.65 | |----+--------| |
input_output ------------| . . . |
2.20 2.91 3.85 | |-----+-------| |
miscellaneous -----------| . . . |
1.22 2.11 3.67 | |-------+------------| |
optimizations -----------| . . . |
2.16 2.34 2.53 | |+-| |
program_organization ----| . . . |
2.65 2.97 3.33 | |--+--| |
statements --------------| . . . |
2.09 2.38 2.72 | |--+--| |
storage_reclamation -----| . . . |
2.64 3.03 3.49 | |--+---| |
subprograms -------------| . . . |
1.81 2.18 2.63 | |--+---| |
systematic_compile_speed | . . . |
3.49 3.81 4.15 | |--+--| |
tasking -----------------| . . . |
2.01 2.25 2.52 | |-+-| |
---------------------------|....|....|....|....|....|....|....|....|....|
Low Mean High 0 0.6 1.2 1.8 2.4 3.0 3.6 4.2. 4.8 5.4
============================================================================
Execution Times --- <Date> <Time> --- <Version> --- Page <Page #>
==============================================================================
----------------------------------------------------------------------------
Graphical Summary of Successes For All Systems
----------------------------------------------------------------------------
============================================================================
success/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Totals for each group ---|....|....|....|....|....|....|....|....|....|....|
application (ap) ------| |
297/297 = 100% |----+----+----+----+----+----+----+----+----+----|
arithmetic (ar) -------| |
342/417 = 82% |----+----+----+----+----+----+----+----+- |
classical (cl) --------| |
261/264 = 99% |----+----+----+----+----+----+----+----+----+----|
data_storage (do) -----| |
303/303 = 100% |----+----+----+----+----+----+----+----+----+----|
data_structures (dr) --| |
726/912 = 80% |----+----+----+----+----+----+----+----+ |
delays_and_timing (dt) | |
123/144 = 85% |----+----+----+----+----+----+----+----+--- |
exception_handling (xh) |
174/174 = 100% |----+----+----+----+----+----+----+----+----+----|
generics (gn) ---------| |
75/ 81 = 93% |----+----+----+----+----+----+----+----+----+- |
input_output (io) -----| |
348/366 = 95% |----+----+----+----+----+----+----+----+----+--- |
interfaces (in) -------| |
| |
miscellaneous (ms) ----| |
51/ 51 = 100% |----+----+----+----+----+----+----+----+----+----|
object_oriented (oo) --| |
| |
optimizations (op) ----| |
972/972 = 100% |----+----+----+----+----+----+----+----+----+----|
program_organization (po) |
222/282 = 79% |----+----+----+----+----+----+----+---- |
protected_types (pt) --| |
| |
statements (st) -------| |
276/276 = 100% |----+----+----+----+----+----+----+----+----+----|
storage_reclamation (sr) |
195/195 = 100% |----+----+----+----+----+----+----+----+----+----|
subprograms (su) ------| |
240/240 = 100% |----+----+----+----+----+----+----+----+----+----|
systematic_compile_speed (sy) |
252/252 = 100% |----+----+----+----+----+----+----+----+----+----|
tasking (tk) ----------| |
402/426 = 94% |----+----+----+----+----+----+----+----+----+-- |
user_defined (ud) -----| |
| |
Total for all groups --| |
5259/5652 = 93% |----+----+----+----+----+----+----+----+----+-- |
-------------------------|....|....|....|....|....|....|....|....|....|....|
success/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
============================================================================
The full report at the summary level is very similar to the full report at the group level, which is discussed in detail in the next section. There are two differences. The raw data for the summary report is the collection of system factors produced by the group level analyses. Also, the table showing Number of Test Problems for All Groups appears in the Summary Report, but not in the group level reports. Figure 5-11, from the Summary Over All Groups Report, tells us the number of valid individual test problems for each group from each system. This data is produced in each individual group report, but with this table, we can more easily compare systems based on the number of tests that ran.
========================================================================
No. Problems | sys_1 sys_2 sys_3 | Possible
------------------------------------------------------------------------
application | 25 32 31 | 99
arithmetic | 17 65 63 | 139
classical | 21 75 77 | 88
data_storage | 21 39 36 | 101
data_structures | 69 144 138 | 304
delays_and_timing | 13 16 16 | 48
exception_handling | 39 46 38 | 58
generics | 12 22 21 | 27
input_output | 31 63 83 | 122
interfaces | 12 5 6 | 12
miscellaneous | 0 0 0 | 17
object-oriented | 9 8 9 | 9
optimizations | 134 198 171 | 324
program_organization | 24 24 9 | 94
protected_types | 10 11 7 | 12
statements | 37 50 44 | 92
storage_reclamation | 9 17 8 | 65
subprograms | 27 44 46 | 80
systematic_compile_speed| 0 0 0 | 84
tasking | 34 38 38 | 142
user_defined | 0 0 0 | 20
========================================================================
Total | 544 897 841 | 1937
========================================================================
Comparative Analysis produces these output tables as part of the Full Report.
* System Factors and Confidence Intervals
* Significant Differences
* Data Summary
* Raw Data
* Outliers and Individual Residuals
* Goodness of Fit
* Pairwise Comparisons
* Number of Test Problems for All Groups
* Test Problems with Errors (optional)
* Sorted List of Outliers (optional)
The output tables from CA are discussed below in some detail, with examples of each. The data is real, but the system names have been changed. Notice that there is not enough space to print long system names. If the names are too long, they will be truncated. The amount of truncation varies from table to table. In general, the user should make the first two or three letters unique to ensure readability in all tables under all conditions.
A large quantity of data can easily overwhelm the reader. The objective is to summarize the important patterns in the data to aid the reader in interpreting the results. This is done by fitting the data to a model. Section 7 "CA REPORT BACKGROUND" discusses the statistical model used by CA to analyze the performance data collected by executing the performance tests. The concept is fairly straightforward. There is a row of data representing each test problem and a column for each compilation system. The CA tool fits the performance data to a statistical model that assumes measurements can be estimated by the product of a factor for each system and a factor for each problem. It calculates "average" factors for each problem and each system, using a statistically robust technique which can tolerate missing data and is not greatly distorted by the presence of a few values (problem/system pairs) which widely differ from the model.
The first table produced by CA (after listing the system identifications) reports the system factors and confidence intervals. The system factors are calculated using techniques applicable to sample data from a population that follows the product model. The reported system factor is considered to be an approximation of the population system factor. Figure 5-12 gives this approximation as the "mean" system factor, along with the endpoints of the 95% confidence interval. This is the smallest interval for which the probability of containing the population system factor is at least 0.95. That is, for 19 out of 20 such reports, the population system factor ("actual" system factor) lies within the cited range.
See Figure 5-12 for an example of the System Factors and Confidence Intervals report.
======================================================================
---- System Factors and Confidence Intervals (including graph)
======================================================================
Systems Low Mean High Ratio | 0.1 2.3
------------------------------------|--------------------------------|
sys_1 0.38 0.93 2.26 1.00 | |--------+------------------|
sys_2 1.80 1.87 1.94 2.00 | |-+| |
sys_3 0.06 0.15 0.37 0.16 |+---| |
======================================================================
Interpretation of system factors: The system factors produced by CA are only meaningful in relation to others in the same report. As explained in Section 7, data is fitted to a mathematical model. If a different set of systems is used in comparisons, different system factors are produced. It is very unlikely that the comparison between two systems will change, but their system factors will not, in general, have the same value, when they are part of different subsets of systems being compared.
Confidence intervals: A range of values is given for each system factor. For each system, a low value, a high value, and a point (mean) estimate are provided. In addition, there is a ratio column to simplify relative comparisons. In this column the value for the first system is always 1.0.
Depending on how well the data fits the model, these low and high values may be close together, or far apart. In Figure 5-12, there is a relatively precise estimate of the system factors for sys_2. The point estimate is 1.87, and the range is 1.80 .. 1.94. Our estimate for sys_1 is much less precise. The point estimate is 0.93 and the range is 0.38 .. 2.26. The ratio column shows that the sys_2 point estimate is twice that of sys_1. However, because of the uncertainty in our system factor values for sys_1, we cannot say with confidence that these two values are different. This is reflected in Figure 5-13, CA: Significant Differences, which does not show a significant difference between the system factors for sys_1 and sys_2.
============================================================================
---- Significant Diff = * | ---- Data Summary: Total n = 24
============================================================================
sys_1 sys_2 sys_3 | Valid NoData Comp RunTim Exclu Other
----------------------------------------------------------------------------
sys_1 - * | 12 12 0 0 0 0
sys_2 - * | 22 0 0 0 2 0
sys_3 * * | 21 0 2 0 0 1
============================================================================
The confidence intervals and point (mean) estimates are also given graphically, which facilitates quick comparisons.
See Figure 5-13 for an example of a Significant Differences report.
A visual examination of the graphs in Figures 5-5 and 5-12 will not always tell us whether we can reliably conclude that two system factors are significantly different. The Significant Differences table will do this. In this table, a "*" means a significant difference and "-" means no significant difference. The table is always symmetrical around the diagonal from the upper left to the lower right corner. This diagonal is left blank.
In Figure 5-13, sys_1 and sys_2 are not significantly different; this means that we cannot confidently conclude that sys_1 is faster than sys_2, even though, on the average, the first system ran the tests twice as fast as the second.
(If we examine the data in the Raw Data Table, Figure 5-15, we find that in most cases sys_1 was faster. However, there are many cases where sys_2 was faster. Averages can conceal as much as they reveal.)
Figure 5-13 includes the Data Summary Table from one of the group reports. In order to make the reports as compact as possible, the Data Summary Table will be combined with the significant differences table whenever that is possible. Generally these two tables will be combined when the number of systems being compared is less than or equal to eight.
The table shown in Figure 5-14 is from a Summary of All Groups report. The format and interpretation are the same in Figures 5-13 and 5-14, except that the Summary of All Groups report includes a column for number of groups ("Gps"). Of course, the numbers in Figure 5-14 reflect the cases from all the groups, rather than from a single group, as in Figure 5-13.
============================================================================
---- Significant Diff = * | ---- Data Summary: Total n = 1745
============================================================================
a_93 b_93 c_93 | Gps Valid NoData Comp RunTim Exclu Other
----------------------------------------------------------------------------
sys_1 * * | 17 1579 2 1 0 93 70
sys_2 * * | 17 1583 9 39 44 29 41
sys_3 * * | 16 1012 479 86 109 0 59
============================================================================
A list of headers with an explanation of their meaning follows.
* Gps: Groups - This is the number of groups used in the summary. The user can select which groups to exclude by giving a weight of zero to a group to be excluded (this is discussed in the User's Guide in Section 9.1.5 "Modifying the Structure (Weights) File, "za_cowgt.txt"") or by not providing the data for that group. All group results in the current CA database are included. The number of predefined groups in the ACES is 21 for Ada 95 systems, and 18 for Ada 83 implementations.
* Valid: Valid data - This is the good data that the analysis is based on. This is all the good data in the database except for any that might have been excluded. See the following "Exclu: Excluded data" bullet.
* NoData: No Data - The data is missing and we do not know why. Reasons for missing data are not always available. Some reasons are produced automatically when compiling and running the test suite, such as the compile-time error codes generated by the dummy test programs. Others will not be known, unless the user enters the appropriate code. For example, run-time errors from which the system did not recover may not generate any data for Condense to read.
* Comp: Compile-time failure - This problem failed at compile time.
* RunTim: Run-time failure - This problem failed at run time.
* Exclu: Excluded data - This data was excluded from the analysis. We must have data from at least two systems for each problem. If for some problem we have data from only one system, this data is not useful for our comparison: we have nothing to compare it with. This data is excluded. This exclusion is controlled by a constant defined in the body of CA package:
minNprobsInRow : CONSTANT := 2 ;
For each system, we must have data from at least four problems. If we have less data then the confidence intervals we generate will be so large that they are meaningless. This exclusion is also controlled by a constant defined in the body of CA package:
minNprobsInCol : CONSTANT := 4 ;
* Other - all other reasons, includes:
+ Dependent => system dependent: this test used a system-dependent feature which was not supported by this implementation. These problems may have appeared at compile time or they may have been detected by self-checking code at run time.
+ Packaging => packaging error: this test did not run because an earlier test in the same main program failed and the system did not recover.
+ Unreliable => unreliable time: this test ran successfully but the time measurements were highly variable. Rerunning might help.
+ Withdrawn => withdrawn test: this test has been withdrawn.
+ DelayTest => delay problem: this test ran successfully, but the time measured is the time for an Ada delay, and is not appropriate for CA.
+ LinkError => link time: this test failed at link time.
+ ClaimExcessTime => claim excess time: this test did not complete because it would have required an excessive amount of time to do so. The claim tests in the Storage Reclamation group try to detect failures to recover storage that the system has (implicitly) allocated. Without the test for excessive time, some of these problems may take many hours to run to completion on some target systems.
+ NegativeTime => negative time: this test produced a large negative time which indicates a fundamental timing failure.
The No Data Report from Condense can provide more detailed information on why individual problems failed. The Data Summary Report in SSA provides a more detailed breakdown of failure categories for each system, as do the reports generated by the Harness.
This table (see Figure 5-15) shows the raw data that the analysis results are based on, including the weights. If a system has no data for a problem, the word "missing" is printed in place of the numeric result. If only one system has data for a problem, no comparison can be performed for that problem. In this case, the word "excluded" is printed in place of the result for the system having a result. (See Section 5.3.2.2.3.) There is a count of unused problems for each subgroup, and for the table as a whole.
GENERICS (gn) - product model
===============================================================================
---- Raw data
===============================================================================
Raw Data: | sys_1 sys_2 sys_3 | Weights
-------------------------------------------------------------------------------
instantiation (in) -- Missing: 1
enum_io_01 | 58.10 17.20 595.90 | 1.0
enum_io_02 | 115.00 36.90 1263.00 | 1.0
enum_io_03 | 115.00 33.50 1044.70 | 1.0
enum_io_04 | 72.80 11.00 22.50 | 1.0
enum_io_05 | 144.10 33.50 31.30 | 1.0
enum_io_06 | 328.40 589.90 107.10 | 1.0
enum_io_07 | 659.60 1171.00 214.80 | 1.0
enum_io_08 | 1269.40 4875.50 | 1.0
enum_io_09 | missing missing missing | 1.0
subprogram (su) -- Missing: 1
subprogram_01 | 20013.20 40833.20 721.00 | 1.0
subprogram_02 | 84.20 21.70 11.00 | 1.0
subprogram_03 | 72.90 12.60 2.42 | 1.0
subprogram_04 | 21.90 5.89 0.00 | 1.0
subprogram_05 | missing 33.10 1.02 | 1.0
subprogram_06 | missing 32.30 1.05 | 1.0
subprogram_07 | missing 27.10 0.92 | 1.0
subprogram_08 | missing 20.00 0.96 | 1.0
subprogram_09 | missing missing excluded | 1.0
subprogram_10 | missing 21.00 0.85 | 1.0
subprogram_11 | missing 31.80 0.80 | 1.0
subprogram_12 | missing 59.50 2.94 | 1.0
subprogram_13 | missing 58.70 2.93 | 1.0
subprogram_14 | missing 49.80 2.79 | 1.0
subprogram_15 | missing 41.90 3.80 | 1.0
-------------------------------------------------------------------------------
---- Total missing: 2
===============================================================================
Test problem names (but not compile speed problem names) are always printed without the first six letters. The first six letters are the group abbreviation, an underscore, the subgroup abbreviation, and another underscore. These abbreviations are included in the page header and the table body. Subgroup names and abbreviations are printed at the start of the subgroup, as well as count of the missing problems.
When we fit the data to a simple model, there will always be discrepancies between the model and the data. If these discrepancies are large, we call them outliers.
Residuals are defined by the formula Residual * System Factor * Row Mean = Actual. A residual of 1.00 would mean that we have a perfect fit. A larger residual would indicate that a particular test result was larger (for measurements of time, this means slower) than expected. A small residual means that a test result was smaller (faster, for time measurements) than expected. Residuals that are much smaller or larger than expected are flagged in the output. These residuals are called outliers. Minuses are used to flag results that are smaller than expected and pluses are used to flag results that are larger than expected (slower).
An example of an outlier statistics summary table is shown in Figure 5-16.
========================================================================
---- Outlier Statistics: residual * system factor * row mean = actual
========================================================================
Bounds Expect Got | sys_1 sys_2 sys_3
------------------------------------------------------------------------
-- Very Low : 0.08 2 4 | 0 3 1
- Low : 0.12 2 0 | 0 0 0
+ High : 5.63 2 0 | 0 0 0
++ Very High: 8.28 2 3 | 0 0 3
------------------------------------------------------------------------
Totals : 6 7 0 3 4
=======================================================================
The column headings are described as follows:
* Bounds: The expected value of the residuals is always 1.0. The lower bounds for residuals flagged with High and Very High indicators are given here. The upper bounds for residuals flagged with Low and Very Low indicators also appear here.
* Expect: Based on normal variation, this is the number of outliers we would expect in each category. NOTE: The total of the Expect column for each category may not match the "Totals" row because of rounding.
* Got: This is the actual number of outliers in each category. These are further broken down by system.
See Figure 5-17 for an example of an individual residuals table.
========================================================================
---- Residual * System Factor * Row Mean = Actual
========================================================================
Residuals: | sys_1 sys_2 sys_3 | Means
------------------------------------------------------------------------
instantiation (in) -- Missing: 1
enum_io_01 | 0.28 0.04-- 17.58++ | 223.73
enum_io_02 | 0.26 0.04-- 17.67++ | 471.63
enum_io_03 | 0.31 0.05-- 17.33++ | 397.73
enum_io_04 | 2.21 0.17 4.19 | 35.43
enum_io_05 | 2.22 0.26 2.97 | 69.63
enum_io_06 | 1.03 0.92 2.07 | 341.80
enum_io_07 | 1.04 0.92 2.08 | 681.80
enum_io_08 | 0.44 0.85 | 3072.45
enum_io_09 | missing missing missing | missing
subprogram (su) -- Missing: 1
subprogram_01 | 1.05 1.07 0.23 | 20522.47
subprogram_02 | 2.32 0.30 1.86 | 38.97
subprogram_03 | 2.67 0.23 0.54 | 29.31
subprogram_04 | 2.54 0.34 0.00-- | 9.26
subprogram_05 | missing 1.04 0.39 | 17.06
subprogram_06 | missing 1.04 0.42 | 16.67
subprogram_07 | missing 1.04 0.43 | 14.01
subprogram_08 | missing 1.02 0.60 | 10.48
subprogram_09 | missing missing excluded | missing
subprogram_10 | missing 1.03 0.51 | 10.93
subprogram_11 | missing 1.05 0.32 | 16.30
subprogram_12 | missing 1.02 0.62 | 31.22
subprogram_13 | missing 1.02 0.63 | 30.82
subprogram_14 | missing 1.01 0.70 | 26.30
subprogram_15 | missing 0.98 1.10 | 22.85
------------------------------------------------------------------------
---- Total missing: 2
========================================================================
System Factor | 0.93 1.87 0.15 |
========================================================================
Seven outliers appear in this example. Three are very high and four are very low. The means are problem means, based on the raw data. The "missing" information is the same data that appears in the raw table. System factors are given at the end of the table.
Residuals are a measure of how well a data point fits the model as explained in the beginning of this section. For example, the residual for subprogram_14 from sys_2 is 1.01. This means that the fit is quite good. The predicted value from the model is:
predicted = system_factor * problem mean.
49.18 = 1.87 * 26.30
The actual value, from the Raw Data table (Figure 5-15), is 49.80.
residual = actual / predicted.
1.01 = 49.80 / 49.18
If we examine the outlier for sys_3 on problem enum_io_01, we find that the predicted is
0.15 * 223.73 = 33.56,
while the actual is 595.9. This estimate is off by a factor of almost 18.
The outlier flag tells the user that something unusual has occurred. The ACES user should always examine the raw data corresponding to the outliers to ascertain what has happened.
Two of the more puzzling outcomes are described and interpreted below. In both cases, the clue that something unusual has happened is that ALL the (non-missing) results for a problem are flagged as outliers.
One question that always comes up when estimating parameters for a model is, how well does the data fit the model? If the fit is very poor, then the estimated parameters are not meaningful. The fit is never perfect for real data. A related question concerns the fit between the model and the data for each system that is examined. Some systems show a good fit between the data and the model; other systems may be exceptional. A related issue is examined in Section 5.3.2.2.5 where residuals are discussed. If the fit were perfect, the residuals would all be 1.0 and there would be no outliers.
The goodness of fit table is based on the same information presented in the residual table. "Av Error" is the arithmetic mean of the absolute differences of the logarithms of the actual values and the logarithms of the predicted values.
Av Error = SUM ( | Log (Actual) - Log (Predicted) | ) / N
If the data fit the product model perfectly, each actual would be equal to the predicted, and the mean would be zero. Thus, the calculated "Av Error" summarizes the extent to which the data fail to fit the model.
Since
Residual = Actual / Predicted (see 5.3.2.2.5),
we may also say that
Av Error = SUM ( | Log (Residual ) | ) / N.
To bring the focus back to the actuals and residuals (rather than their logarithms), we also report "Exp Err", which is the exponential (inverse logarithm) of "Av Error".
Exp Err = e ** Av Error (since we are using natural logarithms)
Thus, for perfect fit data, we would have
Exp Err = e ** 0.0
= 1.0.
The closer "Exp Err" is to 1.0, the better the fit.
Reading Figure 5-18, CA: Goodness of Fit , we see that the average error is 1.00 and the exponential of the average error is 2.72 (natural logarithms were used). The worst fit was for sys_3, which had an average error of 1.38, which is almost twice as large as the first two systems. The exponential of the average error for sys_3 is 3.96. This is the number to compare with the values in the residual table shown in Figure 5-17, since they are scaled the same. If we look back at the residual table we see that the outliers for sys_3 are very extreme; three are greater than 17.0, which means that these values were more than 17 times slower than expected.
=========================================================================
---- Goodness of fit: Variation per system: | ln (x) - ln (predicted) |
=========================================================================
Total | sys_1 sys_2 sys_3
-------------------------------------------------------------------------
Sample n 55 | 12 22 21
Av Error 1.00 | 0.75 0.77 1.38
Exp err 2.72 | 2.13 2.17 3.96
=========================================================================
(The other outlier is printed as 0.00, which means that it must be very small. If we look at the raw data table, we find that the actual value is 0.00, so this is a problem which sys_3 optimized to zero. Zeroes always appear as outliers since they never fit the product model.)
If we examine the other two systems more closely, we will see why a simple examination of the residual table is not a good way to assess goodness of fit. Sys_2 also has extreme outliers. The first three cases are approximately 20 times faster than expected, yet the overall average error is less than for sys_3.
The reason for this phenomenon is the large number of cases where the fit for sys_2 is extremely good (values close to 1.0). The first system does no better overall than sys_2, even though it has no outliers; but it only has 3 cases close to 1.0, compared to the 13 case sys_2 has that are within 10% of a perfect fit (between 0.92 and 1.07). Ten of those close fits are between 0.98 and 1.05. Sys_3 has only one case that is within 10% of a perfect fit.
The conclusion to be drawn is that the data summary provided by the goodness of fit table does not tell us anything we could not figure out by closely studying the residual table; but it is faster. (Remember that this residual table is one of the smallest such tables in the CA output.)
The relationship between goodness of fit and the confidence intervals should be mentioned. If everything else is equal (which it rarely is), wide confidence intervals will be associated with poor fit, and narrow confidence intervals will be associated with good fit. Confidence intervals are also a function of sample size. If everything else is equal (which it rarely is), larger samples will yield smaller confidence intervals and smaller samples will yield larger confidence intervals.
Missing data can distort our findings and occasionally even cause us to reach incorrect conclusions if we ignore it. If there is no missing data, then this section is irrelevant. In fact, in that case, the pairwise comparison table will not be produced. We must emphasize that it is different patterns of missing data between systems that we are concerned about. If we compare several systems, which are all missing the same data, then our results are not distorted. However, when we compare three or more systems, where data is available on some systems for problems but it is not available for one or more of these systems, then misleading conclusions can be reached.
This problem never occurs when we are only comparing two systems. Such comparisons are based on the problems where data is available on both systems. All other data is excluded. See Figure 5-19 for an example of CA: Pairwise Comparisons report.
======================================================================
---- pairwise comparisons: total n = 24
======================================================================
1 Systems | sys_1 sys_2 sys_3 | Mean
2 n: | 12 22 21 | Vari-
3 Sys Factor: | 0.93 1.87 0.15 | ation
----------------------------------------------------------------------
4 sys_1 n: | 12 11 |
5 Sys Factor: | 0.93 1.01 | 4.0%
----------------------------------------------------------------------
4 sys_2 n: | 12 21 |
5 Sys Factor: | 0.47 1.89 | 37.9%
----------------------------------------------------------------------
4 sys_3 n: | 11 21 |
5 Sys Factor: | 0.35 0.15 | 66.2%
======================================================================
Read Figure 5-19, CA: Pairwise Comparisons, as follows:
* The first row of this table contains the names of the systems being compared (left title: "Systems").
* The second row contains the sample sizes of the systems being compared (left title: "n:")
* The third row contains the system factors from the complete analysis for the systems being compared (left title: "Sys Factor:"). For the system named "sys_1", the system factor based on all the available data (n = 12) is 0.93.
* The fourth row contains the number of cases shared by the two systems being compared (left title: "sys_1 n:"). The systems being compared are the system on the left (sys_1 for row 4) and the systems listed across the top.
* The fifth row contains system factors for the system listed on the left, based on the sample data which the two systems have in common (left title: "Sys Factor:"). These are the modified system factors. At the end of this line is an indicator of how much variation there is between system factors. This number, expressed as a percent, is the sum of the absolute value of the differences between each modified system factor in the line and the system factor based on all the data, divided by the system factor based on all the data.
sum | modified_system_factor - overall_system_factor |
percent = ---------------------------------------------------------------------- * 100.0
overall_system_factor
The pattern established by the fourth and fifth rows repeats through the rest of the table for the rest of the systems. For the comparison between the first ("sys_1") and second ("sys_2") systems, the values (from row 3) are 0.93 versus 1.87. However, if we base the comparison on the 12 cases where both systems have data the values are 0.93 versus 0.47. If we look at all data, it appears that sys_1 is twice as fast as sys_2. If we only look at the common data, sys_2 is twice as fast as sys_1. Not only did sys_2 run twice as many cases as sys_1, but on the cases they both ran, it ran twice as fast. Missing data can lead to the wrong conclusions. (This is actual data, not an artificially constructed example.)
(A user doing an evaluation of these three systems might have already removed sys_1 from consideration because of the large number of failures. But this will not always be appropriate. In some other group(s) the other two systems may have failed a large number of tests. In general the number of failures is an important piece of evaluation information, but it is not always sufficient.)
The conclusions reached by examining the other comparisons do not change when the missing data impact is considered.
If the impact is large, and if you are interested in comparing these two systems, then you should rerun the analysis for this group with just these two systems. In this case there is probably no need to rerun the results. But if the findings were less clear cut, then rerunning is recommended.
NOTE: the numbers on the extreme left (1,2,3,4,5) were added for reference. They are not part of the table as printed by CA.
Figure 5-20 from the Summary Over All Groups Report tells us the number of valid individual test problems for each group from each system. This data is produced in each individual group report, but here we can more easily compare systems based on the number of tests that ran.
============================================================================
No. Problems | sys_1 sys_2 sys_3 | Possible
------------------------------------------------------------------------
application | 25 32 31 | 99
arithmetic | 17 65 63 | 139
classical | 21 75 77 | 87
data_storage | 21 39 36 | 101
data_structures | 69 144 138 | 304
delays_and_timing | 13 16 16 | 48
exception_handling | 39 46 38 | 58
generics | 12 22 21 | 27
input_output | 31 63 83 | 122
interfaces | 12 5 6 | 12
miscellaneous | 0 0 0 | 17
object_oriented | 9 8 9 | 9
optimizations | 134 198 171 | 324
program_organization | 24 24 9 | 94
protected_types | 10 11 7 | 12
statements | 37 50 44 | 92
storage_reclamation | 9 17 8 | 65
subprograms | 27 44 46 | 80
systematic_compile_speed| 0 0 0 | 84
tasking | 34 38 38 | 142
user_defined | 0 0 0 | 0
========================================================================
Total | 544 897 841 | 1937
========================================================================
Users may optionally ask CA to provide a list of all missing test problems for each system in the comparison. Figure 5-21, Individual Problem Errors by System, is an example. All systems are listed, even though some may have no missing problems.
GENERICS (gn) - product model
======================================================================
---- Individual Problem Errors by System
======================================================================
----------------------------------------------------------------------
sys_1
----------------------------------------------------------------------
---- No errors found
----------------------------------------------------------------------
sys_2
----------------------------------------------------------------------
gn_in_enum_io_02 link time
gn_in_enum_io_03 link time
gn_in_enum_io_05 link time
gn_in_enum_io_07 link time
gn_in_enum_io_09 link time
----------------------------------------------------------------------
sys_3
----------------------------------------------------------------------
gn_in_enum_io_02 compile time error
gn_in_enum_io_03 compile time error
gn_in_enum_io_04 compile time error
gn_in_enum_io_05 compile time error
gn_in_enum_io_06 compile time error
gn_in_enum_io_07 compile time error
gn_in_enum_io_08 no data
gn_in_enum_io_09 no data
Users may also optionally ask CA for a sorted list of outliers by system. Not all residuals appear in this list; only those extreme cases which CA has flagged as outliers. These large residual values are sorted and listed by system. See Figure 5-22, Sorted Problem Outliers by System, for an example.
GENERICS (gn) - product model
======================================================================
---- Sorted Problem Outliers by System
======================================================================
----------------------------------------------------------------------
sys_1
----------------------------------------------------------------------
gn_in_enum_io_06 0.22--
gn_su_subprogram_02 0.43--
gn_in_enum_io_04 0.54-
gn_su_subprogram_07 2.49++
gn_su_subprogram_12 2.50++
gn_su_subprogram_14 2.63++
gn_su_subprogram_11 4.10++
----------------------------------------------------------------------
sys_2
----------------------------------------------------------------------
gn_in_enum_io_01 1.97+
gn_in_enum_io_06 4.60++
----------------------------------------------------------------------
sys_3
----------------------------------------------------------------------
gn_in_enum_io_01 0.51-
gn_su_subprogram_02 3.32++
gn_su_subprogram_03 35.15++
The Application Profile Report option is designed for users who can estimate what Ada features will go into their application and who can assign a run-time weight to these features. It is NOT for users who already have an application (in some form); they should benchmark their application.
If users can estimate the run-time weight of the Ada features in their applications, then a natural way to evaluate performance is to calculate a weighted average of the performance of these features. This is what the application profile report mode does. In this mode the CA program uses an additive, not a product model. In this mode (and only in this mode) data can be selected from any (or all) of the test groups. This selection process is described in the User's Guide Section 9.1.5 "Modifying the Structure (Weights) File". Briefly the process is:
* Within groups, exclude subgroups by giving them default weights of 0.0.
* Within subgroups without default weights, all cases with nonzero weights are selected.
* Select the desired groups in the menu or in the request file.
The Application Profile Report mode produces a subset of the regular report. The following report sections are provided.
* System Factors and Confidence Intervals (including graph)
* Significant Differences
* Data Summary
* Raw Data
* Pairwise Comparisons
The system factors correspond to weighted averages. All other tables are interpreted in the same way as before.
The following report sections are not appropriate and not provided: outlier analysis (we are not fitting a model); goodness of fit analysis (we are not fitting a model); summary of all groups (we are not doing analysis by group).
The ACES Single System Analysis (SSA) tool is designed to help extract the information implicit in the relationships between related test problems. It analyzes the measurements obtained from executing the ACES test suite on one system by comparing related test problems. Some relationships between problems which allow comparisons include:
* Optimizable and hand-optimized versions of a problem.
* Performance of the same operations using different coding styles (including Ada 83 and Ada 95 styles for accomplishing the same result).
* Versions with and without specification of certain pragmas, such as SUPPRESS or INLINE.
These comparisons add no information to that already present in the "raw" performance data. However, with over seventeen hundred separate test problems, comparing the related test problems by hand is a time-consuming task, particularly for anyone not initially familiar with the relationships between the test problems. The SSA tool knows the relationships between test problems, and the report it generates highlights the significant results.
The CA tool displays how one system performs relative to other systems. The SSA tool compares results from related test problems executed on the same system. Roughly speaking, the CA tool provides data most useful for selecting between different compilation systems while the SSA tool provides data to help programmers efficiently use a compilation system after it has been selected.
The SSA report can be used for comparisons between systems by manually examining two (or more) reports on a table by table basis. This would be tedious if complete reports were being perused, but could be very useful for someone with a tightly focused interest.
To elaborate on the differences between the SSA tool and the CA tool, consider two test problems, one of which is a hand-optimized version of the other. When either all sample systems perform the optimization or none of them do, the CA residuals for both problems might not flag any system as anomalous. An examination of the CA residual matrix will not tell the ACES reader whether or not any system optimized the optimizable problem. This information is precisely what the SSA tool will provide. Both types of information can be valuable; if there are no performance differences between the different systems, then for a source selection activity, the systems are comparable and need not be distinguished. Programmers writing code for a system may want to know whether an optimization is performed; unless they are concerned with portability, they may not care whether the optimization is performed by any other compilation system.
By reporting whether specific optimizations are performed, the SSA tool permits a programmer to change coding styles to adapt to the strengths and weaknesses of a system. A programmer who knows that a compilation system performs no loop invariant motion will know that using temporary variables and performing loop invariant motion "by hand" in the source text can be a profitable operation. With an optimizing compiler, such a coding style may be superfluous and duplicate effort that the compiler will perform by itself. There may be some systems where record processing is faster when coded as a sequence of operations on each component of the record. On such a system, a programmer would minimize the use of record operations in time-critical code. Other areas where the SSA tool may highlight potential performance problems include: aggregate processing; package elaboration; subprogram linkages; passing of unconstrained parameters; tasking constructions; exception processing; and block entry.
The sections of the SSA report are as follows:
* High Level Summary Report
+ Execution
+ Code Size
+ Compilation
+ Errors
* Main Report
+ Sections
- Language Feature Overhead
- Optimizations
- Runtime System Behavior
- Coding Style Variations
- Ancillary Data - Options
- Problem Descriptions
- Statistical Tables
- Table Summaries
- Missing Data Report
* Table of Contents
This section lists the individual tables in the Main Report and the page number they may be found on. If there is not sufficient data to permit a finding to be made, there is no entry in the table of contents for it. The order of presentation is the same for every analysis.
* Missing Data Report
This is an optional section which lists the test problems in the Main Report, by tables, for which data was not available.
The actual report consists of four files, corresponding to each of these four sections. Each of these will be discussed and illustrated in turn.
Input and output files are discussed in more detail in the User's Guide. A brief description is provided below.
* Input files
+ System Names file - "za_cosys.txt"
+ Request file - defined in the System Names file. Default is "za_sareq.txt"
+ Structure (weights) file - defined in the System Names file. Default is "za_cowgt.txt". Weights are not used in SSA, but the structure file is essential.
+ Table templates for SSA
- "za_salft.ssa" - template file for language feature tests
- "za_saopt.ssa" - template file for optimization tests
- "za_sarts.ssa" - template file for run-time system tests
- "za_sasty.ssa" - template file for coding style tests
* Output files
+ High Level Summary report - "system_name.hls"
+ Table of Contents - "system_name.toc"
+ Main Report - "system_name.rep"
+ Missing Data Report - "system_name.mis"
The High Level Summary Reports provide useful information on execution time data, code size (relative to line counts and semicolon counts), compilation times, and errors. The reports are different in each of these areas. Examples and explanations are given below. Note that the execution time reports and the compilation time reports include examples and failure analyses. There is no separate failure analysis in the code size reports because code size is collected for the same tests as execution time. Compilation time data is collected for complete programs (set of tests linked as single executables), so failures are analyze separately in this case.
One summary of execution time performance numbers is an examination of the results from the classical benchmarks, Whetstone and Dhrystone which are found in the Classical (Cl) group. The first is a measure of numeric programming speed; the second is a measure of non-numeric speed. Results are presented graphically in Figure 5-23 and reflect the differences between turning checking on and off. Differences resulting from optimization for space (as opposed to time), are also presented if the compiler supports this option.
==============================================================================
------------------------------------------------------------------------------
Execution Time Performance Report
(Examples) - in milliseconds
------------------------------------------------------------------------------
==============================================================================
--------------------------------------------------------------------------
Test Name Checking Optimize Time 0 100 200 300 400 500 600 700
-------------------------------------|....|....|....|....|....|....|....|
--------------------------------------------------------------------------
Whetstone On Time 395 |----+----+----+---- |
Whetstone Off Time 358 |----+----+----+-- |
Whetstone* Off Time 606 |----+----+----+----+----+----+ |
Whetstone Off Space 363 |----+----+----+--- |
-------------------------------------|....|....|....|....|....|....|....|
Test Name Checking Optimize Time 0 100 200 300 400 500 600 700
--------------------------------------------------------------------------
* Double Precision Floating Point Arithmetic
---------------------------------------------------------------------------
If the user has requested error data, then several additional tables will be produced. The first table, in Figure 5-24, gives an example. Valid times include null times, as well as, delay times. Timing errors include unreliable time measurements, verification errors, and time outs due to excess times. Problems that failed include missing problems (those with no data), errors, not applicable tests (set by the user), withdrawn tests, and the 25 tests for which there is never any execution time data. These 25 tests, all in the systematic compile speed group appear in the Harness totals, but not in the totals for the SSA reports. Otherwise, these report numbers should be identical. Figure 5-25, SSA: Execution Time - Failure Analysis by Category, shows a more detailed breakdown of this data.
============================================================================
----------------------------------------------------------------------------
Failure Analysis Report - Execution Results
Graphical Summary of Successes And Failures
----------------------------------------------------------------------------
============================================================================
number/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
------------------------|....|....|....|....|....|....|....|....|....|....|
problems that ran | |
Valid Times | |
1543/1745 = 88% |----+----+----+----+----+----+----+----+---- |
Timing Errors | |
22/1745 = 1% |- |
problems that failed | |
Missing problems | |
120/1745 = 7% |--- |
Errors | |
60/1745 = 3% |-- |
------------------------|....|....|....|....|....|....|....|....|....|....|
number/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
---------------------------------------------------------------------------
Total = 1745 Excluded from total = 25
============================================================================
----------------------------------------------------------------------------
<system name> High Level Report <Date> <Time> Page <Page #>
===========================================================================
---------------------------------------------------------------------------
Failure Analysis Report - Execution Results
Graphical Breakdown of Failures by Category
---------------------------------------------------------------------------
===========================================================================
number/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
------------------------|....|....|....|....|....|....|....|....|....|....|
unreliable time | |
17/202 = 8% |---- |
verification | |
0/202 = 0% | |
negative time | |
0/202 = 0% | |
claim excess time | |
5/202 = 2% |- |
packaging error | |
0/202 = 0% | |
compile time error | |
22/202 = 11% |----+ |
link time | |
0/202 = 0% | |
run time error | |
28/202 = 14% |----+-- |
dependent test | |
10/202 = 5% |-- |
inconsistent data | |
0/202 = 0% | |
no data | |
120/202 = 59% |----+----+----+----+----+----+ |
Not applicable | |
0/202 = 0% | |
withdrawn test | |
0/202 = 0% | |
------------------------|....|....|....|....|....|....|....|....|....|....|
number/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
----------------------------------------------------------------------------
============================================================================
<system name> High Level Report <Date> <Time> Page <Page #>
Figure 5-26, SSA: Execution Time - Failure by Groups, breaks this data down even further, by category and by group. This report, except for the 25 tests in the Systematic Compile Speed group which never have execution results, is identical to the Harness report produced by the Write_groups command.
======================================================================++======
----------------------------------------------------------------------------
Failure Analysis Report - Execution Results
by Group and by Type of Failure
----------------------------------------------------------------------------
============================================================================
----------------------------------------------------------------------------
Groups Data Summary Categories
----------------------------------------------------------------------------
Va Dy Un Vr Ng Xc Pk Cm Ln Rn Dp In No NA Wd Sum
----------------------------------------------------------------------------
application
99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 99
arithmetic
114 0 0 0 0 0 0 0 0 0 0 0 25 0 0 139
classical
87 0 0 0 0 0 0 0 0 0 0 0 1 0 0 88
data_storage
101 0 0 0 0 0 0 0 0 0 0 0 0 0 0 101
data_structures
242 0 0 0 0 0 0 0 0 0 0 0 62 0 0 304
delays_and_timing
26 15 0 0 0 0 0 0 0 0 0 0 7 0 0 48
exception_handling
58 0 0 0 0 0 0 0 0 0 0 0 0 0 0 58
generics
25 0 0 0 0 0 0 0 0 0 0 0 2 0 0 27
input_output
116 0 0 0 0 0 0 0 0 0 0 0 6 0 0 122
interfaces
0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 12
miscellaneous
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17
object_oriented
0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 9
optimizations
324 0 0 0 0 0 0 0 0 0 0 0 0 0 0 324
program_organization
74 0 0 0 0 0 0 0 0 0 0 0 20 0 0 94
protected_types
0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 12
statements
92 0 0 0 0 0 0 0 0 0 0 0 0 0 0 92
storage_reclamation
64 0 0 0 0 1 0 0 0 0 0 0 0 0 0 65
subprograms
80 0 0 0 0 0 0 0 0 0 0 0 0 0 0 80
systematic_compile_speed
84 0 0 0 0 0 0 0 0 0 0 0 0 0 0 84
tasking
126 8 0 0 0 0 0 0 0 0 0 0 8 0 0 142
user_defined
0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 20
----------------------------------------------------------------------------
Totals 1729 23 0 0 0 1 0 0 0 0 0 0 184 0 0 1937
----------------------------------------------------------------------------
---------------------------------------------------------------------------
In Figure 5-26, counts appearing in some of the columns do not represent run-time failures, as follows:
* Va Valid timing results
* Dy Delay tests not included in execution time analysis
* Cm Tests not completing compilation
* Ln Tests not completing linking
* Dp Tests requiring some unsupported implementation-dependent feature
* NA Tests labelled as not applicable by the user
* Wd Withdrawn tests
The other columns (except for the "Sum" column) represent run-time failures, as follows:
* Un Tests whose timing measurements are statistically unreliable
* Vr Tests for which the null loop timing could not be verified
against the baseline established at the beginning of testing
* Ng Tests reporting negative execution times (indicating unexpected
treatment of the timing mechanism)
* Xc Tests that were aborted because preliminary analysis during
execution showed they would take excessive time to complete
* Pk Tests not reporting results because of packaging (occuring
after an uncompleted test in the same executable)
* Rn Tests exhibiting some unexplained run-time error
* In Tests with inconsistencies between compile-time status and run-time status
* No Tests for which no data is available
The Compile Speed Summary is based on the combined compile and link times for the over 700 main programs in the ACES performance test suite. After the timing loop code has been included, there is a total line count of over 300,000 lines. This report gives results based on a count of physical lines and separate results based on a count of semicolons. In both cases, the counts include the test files and the main program files. (In most cases, there is one test per test file and several tests are WITHed by the main program and called from it.)
Figure 5-27, SSA: Compile Speed - Frequency (Semicolons Per Minute), gives an example of the distribution of compile speeds (semicolons per minute). Frequency is simply the number of programs in each classification. At a glance we can see that a typical program compiled at a rate of 50 to 90 semicolons per minute. Several (9) compiled at less than 10 semicolons per minute. Nineteen programs compiled at more than 170 semicolons per minute. Figure 5-28, SSA: Compile Speed - Frequency (Lines Per Minute), shows a similar table based on physical lines instead of semicolons.
Figure 5-29, SSA: Compile Speed - Average, High, Low, provides another way to examine the data in the first two tables. For both physical lines and semicolons, the total count is given, followed by the main programs which had the highest average lines (semicolons) per minute and the lowest average lines (semicolons) per minute. These findings are followed by the results from some selected large performance tests, including Dhrystone and Whetstone (see Figure 5-30, SSA: Compile Speed - Examples).
If the user has requested error (failure) data, then the tables in Figures 5-30, 5-32, and 5-33 will be produced. Figure 5-31, SSA: Compile Speed - Failure Analysis, gives a graphical picture of successes versus failures. Figure 5-32, SSA: Compile Speed - Failures by Category, breaks the failures down by category. Figure 5-33, SSA: Compile Speed - Failures by Group, breaks the failures down by category and by group. The column headers for Figure 5-33 have the same meanings as in Figure 5-26. See the previous section for these meanings. Note that only the "Cm", "Ln", and "No" columns represent failures at compilation time.
================================================================================
--------------------------------------------------------------------------------
Compilation Speed Report
Semicolons per minute
--------------------------------------------------------------------------------
================================================================================
Frequency ";"/min 0 10 20 30 40 50 60 70 80 90 100
----------------------|....|....|....|....|....|....|....|....|....|.....|
9 1 .. 10 |---- |
----------------------| |
12 11 .. 20 |----+- |
----------------------| |
17 21 .. 30 |----+--- |
----------------------| |
23 31 .. 40 |----+----+- |
----------------------| |
20 41 .. 50 |----+----+ |
----------------------| |
52 51 .. 60 |----+----+----+----+----+- |
----------------------| |
51 61 .. 70 |----+----+----+----+----+ |
----------------------| |
62 71 .. 80 |----+----+----+----+----+----+- |
----------------------| |
83 81 .. 90 |----+----+----+----+----+----+----+----+- |
----------------------| |
30 91 .. 100 |----+----+----+ |
----------------------| |
27 101 .. 110 |----+----+--- |
----------------------| |
19 111 .. 120 |----+---- |
----------------------| |
23 121 .. 130 |----+----+- |
----------------------| |
33 131 .. 140 |----+----+----+- |
----------------------| |
14 141 .. 150 |----+-- |
----------------------| |
5 151 .. 160 |-- |
----------------------| |
3 161 .. 170 |- |
----------------------| |
19 >= 171 |----+---- |
----------------------| |
Total: 502 | |
----------------------|...|....|....|....|....|....|....|....|....|......|
Frequency ";"/min 0 10 20 30 40 50 60 70 80 90 100
------------------------------------------------------------------------------
============================================================================
----------------------------------------------------------------------------
Compilation Speed Report
Lines per minute
----------------------------------------------------------------------------
============================================================================
Frequency Lines/min 0 10 20 30 40 50 60 70 80 90 100
----------------------|....|....|....|....|....|....|....|....|....|....|
7 1 .. 10 |--- |
----------------------| |
5 11 .. 20 |-- |
----------------------| |
7 21 .. 30 |--- |
----------------------| |
8 31 .. 40 |---- |
----------------------| |
10 41 .. 50 |----+ |
----------------------| |
3 51 .. 60 |- |
----------------------| |
16 61 .. 70 |----+--- |
----------------------| |
13 71 .. 80 |----+- |
----------------------| |
12 81 .. 90 |----+- |
----------------------| |
30 91 .. 100 |----+----+----+ |
----------------------| |
24 101 .. 110 |----+----+-- |
----------------------| |
22 111 .. 120 |----+----+- |
----------------------| |
28 121 .. 130 |----+----+---- |
----------------------| |
38 131 .. 140 |----+----+----+---- |
----------------------| |
.
.
----------------------| |
7 261 .. 270 |--- |
----------------------| |
26 >= 271 |----+----+ |
----------------------| |
Total: 502 | |
----------------------|....|....|....|....|....|....|....|....|....|....|
Frequency Lines/min 0 10 20 30 40 50 60 70 80 90 100
---------------------------------------------------------------------------
<system name> High Level Report <Date> <Time> Page <Page #>
============================================================================
Compilation Speed : (physical lines)
============================================================================
----------------------------------------------------------------------------
----- average lines per minute : 145.81
-- based on total line count of 190970.0
----------------------------------------------------------------------------
0 150 300 450 600 750 900 1050 1200
--------------------------------|....|....|....|....|....|....|....|....|
Highest Lines Per Minute | |
-----------------------------| Main program : io_txm02 |
lines / minute : 843 |----+----+----+----+----+--- |
line count : 545 | |
semicolons / minute: 469 |----+----+----+ |
semicolon count : 303 | |
--------------------------------|....|....|....|....|....|....|....|....|
Lowest Lines Per Minute | |
-----------------------------| Main program : sy_cum10 |
lines / minute : 2 | |
line count : 76 | |
semicolons / minute: 1 | |
semicolon count : 40 | |
--------------------------------|....|....|....|....|....|....|....|....|
0 150 300 450 600 750 900 1050 1200
============================================================================
Compilation Speed : (semicolons)
============================================================================
----------------------------------------------------------------------------
----- average semicolons per minute : 80.48
-- based on total semicolon count of 105404.0
----------------------------------------------------------------------------
0 150 300 450 600 750 900 1050 1200
--------------------------------|....|....|....|....|....|....|....|....|
Highest Semis Per Minute | |
-----------------------------| Main program : io_txm02 |
lines / minute : 843 |----+----+----+----+----+--- |
line count : 545 | |
semicolons / minute: 469 |----+----+----+ |
semicolon count : 303 | |
--------------------------------|....|....|....|....|....|....|....|....|
Lowest Semis Per Minute | |
-----------------------------| Main program : sy_cum10 |
lines / minute : 2 | |
line count : 76 | |
semicolons / minute: 1 | |
semicolon count : 40 | |
--------------------------------|....|....|....|....|....|....|....|....|
0 150 300 450 600 750 900 1050 1200
============================================================================
===========================================================================
---------------------------------------------------------------------------
Compile Speed Report
Compilation Speed : (Examples)
---------------------------------------------------------------------------
===========================================================================
---------------------------------------------------------------------------
Test Name Checking Optimize 0 150 300 450 600 750 900 1050 1200
--------------------------------|....|....|....|....|....|....|....|....|
Dhrystone On Time | |
-----------------------------| Main program : cl_dhm01 |
lines / minute : 246 |----+--- |
line count : 756 | |
semicolons / minute: 119 |--- |
semicolon count : 367 | |
.
.
--------------------------------|....|....|....|....|....|....|....|....|
Whetstone On Time | Double Precision Arithmetic |
-----------------------------| Main program : cl_whm03 |
lines / minute : 99 |--- |
line count : 290 | |
semicolons / minute: 59 |- |
semicolon count : 172 | |
--------------------------------|....|....|....|....|....|....|....|....|
Whetstone Off Space | |
-----------------------------| Main program : cl_whm04 |
lines / minute : 100 |--- |
line count : 287 | |
semicolons / minute: 60 |-- |
semicolon count : 172 | |
--------------------------------|....|....|....|....|....|....|....|....|
Avionics Off Time | |
-----------------------------| Main program : ap_avm01 |
lines / minute : 456 |----+----+----+ |
line count : 1974 | |
semicolons / minute: 199 |----+- |
semicolon count : 862 | |
--------------------------------|....|....|....|....|....|....|....|....|
Kalman Filter Off Time | |
-----------------------------| Main program : ap_kfm01 |
lines / minute : 486 |----+----+----+- |
line count : 4762 | |
semicolons / minute: 251 |----+--- |
semicolon count : 2461 | |
--------------------------------|....|....|....|....|....|....|....|....|
Test Name Checking Optimize 0 150 300 450 600 750 900 1050 1200
---------------------------------------------------------------------------
<system name> High Level Report <Date> <Time> Page <Page #>
============================================================================
----------------------------------------------------------------------------
Failure Analysis Report - Compilation Results
Graphical Summary of Successes And Failures
----------------------------------------------------------------------------
============================================================================
number/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
------------------------|....|....|....|....|....|....|....|....|....|....|
problems that ran | |
Valid Times | |
554/656 = 84% |----+----+----+----+----+----+----+----+-- |
Timing Errors | |
0/656 = 0% | |
problems that failed | |
Missing problems | |
102/656 = 16% |----+--- |
Errors | |
0/656 = 0% | |
------------------------|....|....|....|....|....|....|....|....|....|....|
number/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
----------------------------------------------------------------------------
Total = 656 Excluded from total = 0
============================================================================
----------------------------------------------------------------------------
<system name> High Level Report <Date> <Time> Page <Page #>
============================================================================
----------------------------------------------------------------------------
Failure Analysis Report - Compilation Results
Graphical Breakdown of Failures by Category
----------------------------------------------------------------------------
============================================================================
number/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
------------------------|....|....|....|....|....|....|....|....|....|....|
unreliable time | |
0/102 = 0% | |
verification | |
0/102 = 0% | |
negative time | |
0/102 = 0% | |
claim excess time | |
0/102 = 0% | |
packaging error | |
0/102 = 0% | |
compile time error | |
0/102 = 0% | |
link time | |
0/102 = 0% | |
run time error | |
0/102 = 0% | |
dependent test | |
0/102 = 0% | |
inconsistent data | |
0/102 = 0% | |
no data | |
102/102 = 100% |----+----+----+----+----+----+----+----+----+----|
Not applicable | |
0/102 = 0% | |
withdrawn test | |
0/102 = 0% | |
------------------------|....|....|....|....|....|....|....|....|....|....|
number/total = % 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
----------------------------------------------------------------------------
============================================================================
---------------------------------------------------------------------------
<system name> High Level Report <Date> <Time> Page <Page #>
==============================================================================
----------------------------------------------------------------------------
Failure Analysis Report - Compilation Results
by Group and by Type of Failure
----------------------------------------------------------------------------
============================================================================
----------------------------------------------------------------------------
Groups Data Summary Categories
----------------------------------------------------------------------------
Va Dy Un Vr Ng Xc Pk Cm Ln Rn Dp In No NA Wd Sum
----------------------------------------------------------------------------
application
40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40
arithmetic
23 0 0 0 0 0 0 0 0 0 0 0 9 0 0 32
classical
40 0 0 0 0 0 0 0 0 0 0 0 1 0 0 41
data_storage
16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16
data_structures
54 0 0 0 0 0 0 0 0 0 0 0 10 0 0 64
delays_and_timing
12 0 0 0 0 0 0 0 0 0 0 0 1 0 0 13
exception_handling
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17
generics
6 0 0 0 0 0 0 0 0 0 0 0 1 0 0 7
input_output
31 0 0 0 0 0 0 0 0 0 0 0 2 0 0 33
interfaces
0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2
miscellaneous
6 0 0 0 0 0 0 0 0 0 0 0 1 0 0 7
object_oriented
0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 3
optimizations
73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 73
program_organization
20 0 0 0 0 0 0 0 0 0 0 0 5 0 0 25
protected_types
0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 4
statements
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19
storage_reclamation
52 0 0 0 0 0 0 0 0 0 0 0 1 0 0 53
subprograms
15 0 0 0 0 0 0 0 0 0 0 0 1 0 0 16
systematic_compile_speed
101 0 0 0 0 0 0 0 0 0 0 0 8 0 0 109
tasking
126 0 0 0 0 0 0 0 0 0 0 0 5 0 0 131
user_defined
0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 20
----------------------------------------------------------------------------
Totals 651 0 0 0 0 0 0 0 0 0 0 0 74 0 0 725
----------------------------------------------------------------------------
---------------------------------------------------------------------------
The Code Size Report is based on the size measurements for the Ada code lines which are bracketed by the timing loop. Since many of the ACES performance tests are only a few lines, the average number of lines per test is quite small (on the order of 20 lines per test). This report gives results based on a count of physical lines and separate results based on a count of semicolons. The structure of this report is similar to the structure of the Compile Speed Report described in the previous section.
A frequency distribution based on bytes per semicolons (and bytes per line) is given first. See Figure 5-34, SSA: Code Size - Semicolons, and Figure 5-35, SSA: Code Size - Lines. Next, there are summary tables for both semicolons and physical lines. In both cases, the average bytes per unit and the total test count are given. The next two tables list the performance tests which had the highest average bytes per line (or semicolon) and the lowest average bytes per line (or semicolon). See Figure 5-36, SSA: Code Size - Average, High, Low, for an example of this kind of table. These findings are followed by the results from some selected large performance tests, including Dhrystone and Whetstone. Figure 5-37, SSA: Code Size - Examples, shows what these example reports look like.
============================================================================
----------------------------------------------------------------------------
Code Size Report
Bytes Per Semicolon
----------------------------------------------------------------------------
============================================================================
Frequency Bytes/";" 0 100 200 300 400 500 600 700 800 900 1000
----------------------|....|....|....|....|....|....|....|....|....|....|
37 = 0.0 |- |
----------------------| |
0 > 0.0, < 1.0 | |
----------------------| |
259 1 .. 10 |----+----+-- |
----------------------| |
470 11 .. 20 |----+----+----+----+--- |
----------------------| |
256 21 .. 30 |----+----+-- |
----------------------| |
149 31 .. 40 |----+-- |
----------------------| |
57 41 .. 50 |-- |
----------------------| |
65 51 .. 60 |--- |
----------------------| |
19 61 .. 70 | |
.
.
12 141 .. 150 | |
----------------------| |
5 151 .. 160 | |
----------------------| |
5 161 .. 170 | |
----------------------| |
19 >= 171 | |
----------------------| |
Total: 1433 | |
----------------------|....|....|....|....|....|....|....|....|....|....|
Frequency Bytes/";" 0 100 200 300 400 500 600 700 800 900 1000
----------------------------------------------------------------------------
=================================================================================
---------------------------------------------------------------------------------
Code Size Report
Bytes Per Line
---------------------------------------------------------------------------------
=================================================================================
Frequency Bytes/line 0 100 200 300 400 500 600 700 800 900 1000
----------------------|....|....|....|....|....|....|....|....|....|....|
37 = 0.0 |- |
----------------------| |
0 > 0.0, < 1.0 | |
----------------------| |
461 1 .. 10 |----+----+----+----+--- |
----------------------| |
511 11 .. 20 |----+----+----+----+----+ |
----------------------| |
174 21 .. 30 |----+--- |
----------------------| |
106 31 .. 40 |----+ |
----------------------| |
27 41 .. 50 |- |
----------------------| |
33 51 .. 60 |- |
----------------------| |
10 61 .. 70 | |
----------------------| |
22 71 .. 80 |- |
----------------------| |
7 81 .. 90 | |
----------------------| |
6 91 .. 100 | |
----------------------| |
5 101 .. 110 | |
----------------------| |
5 111 .. 120 | |
----------------------| |
11 121 .. 130 | |
----------------------| |
18 >= 131 | |
----------------------| |
Total: 1433 | |
----------------------|....|....|....|....|....|....|....|....|....|....|
Frequency Bytes/line 0 100 200 300 400 500 600 700 800 900 1000
--------------------------------------------------------------------------------
<system name> High Level Report <Date> <Time> Page <Page #>
========================================================================
------------------------------------------------------------------------
Code Size Report
Code Size Report : (physical lines)
------------------------------------------------------------------------
========================================================================
------------------------------------------------------------------------
----- average bytes per line : 13.79
-- based on total line count of 18842
------------------------------------------------------------------------
0 150 300 450 600 750 900 1050 1200
--------------------------------|....|....|....|....|....|....|....|....|
Highest Bytes Per Line | |
--------------------------| Test : dr_ba_bool_arrays_06 |
bytes per line : 464 |----+----+----+ |
line count : 1 | |
bytes per semicolon: 464 |----+----+----+ |
semicolon count : 1 | |
--------------------------------|....|....|....|....|....|....|....|....|
Lowest Bytes Per Line | |
--------------------------| Test : op_as_alge_simp_01 |
bytes per line : 0 | |
line count : 1 | |
bytes per semicolon: 0 | |
semicolon count : 1 | |
--------------------------------|....|....|....|....|....|....|....|....|
0 150 300 450 600 750 900 1050 1200
------------------------------------------------------------------------
<system name> High Level Report <Date> <Time> Page <Page #>
============================================================================
----------------------------------------------------------------------------
Code Size Report
Code Size Report : (Examples)
----------------------------------------------------------------------------
============================================================================
----------------------------------------------------------------------------
Test Name Checking Optimize 0 5 10 15 20 25 30 35 40
--------------------------------|....|....|....|....|....|....|....|....|
Dhrystone On Time | |
-----------------------------| Test : cl_dh_dhrys_01 |
bytes per line : 19 |----+----+----+---- |
line count : 47 | |
bytes per semicolon: 36 |----+----+----+----+----+----+----+- |
semicolon count : 25 | |
--------------------------------|....|....|....|....|....|....|....|....|
Dhrystone Off Time | |
-----------------------------| Test : cl_dh_dhrys_02 |
bytes per line : 12 |----+----+-- |
line count : 47 | |
bytes per semicolon: 23 |----+----+----+----+--- |
semicolon count : 25 | |
--------------------------------|....|....|....|....|....|....|....|....|
.
.
--------------------------------|....|....|....|....|....|....|....|....|
Whetstone Off Space | |
-----------------------------| Test : cl_wh_whet_04 |
bytes per line : 13 |----+----+--- |
line count : 134 | |
bytes per semicolon: 20 |----+----+----+----+ |
semicolon count : 88 | |
--------------------------------|....|....|....|....|....|....|....|....|
Avionics Off Time | |
-----------------------------| Test : ap_av_ew |
bytes per line : 8 |----+--- |
line count : 1 | |
bytes per semicolon: 8 |----+--- |
semicolon count : 1 | |
--------------------------------|....|....|....|....|....|....|....|....|
Kalman Filter Off Time | Procedure Call |
-----------------------------| Test : ap_kf_kalman |
bytes per line : 8 |----+--- |
line count : 2 | |
bytes per semicolon: 16 |----+----+----+- |
semicolon count : 1 | |
--------------------------------|....|....|....|....|....|....|....|....|
Test Name Checking Optimize 0 5 10 15 20 25 30 35 40
----------------------------------------------------------------------------
The Table of Contents lists the names describing groups of related tests being compared and the page number in the main report where the relationships between the problems in the group are presented.
Typical table of contents entries are shown in Figure 5-38, SSA: Table of Contents.
------------------------------------------------------------------------------
Optimizations
------------------------------------------------------------------------------
Algebraic Simplification - Integer page 12
.
.
------------------------------------------------------------------------------
Coding Style
------------------------------------------------------------------------------
Array Assignment page 22
.
.
------------------------------------------------------------------------------
The Main Report always begins with the Anomalous Data Report, Figure 5-39. This report has two sections: unexpected null (zero) execution times and null times with non-zero sizes. Both of these sections may themselves be null (i.e., not find any tests that meet these criteria). This section is considered important; however, since it may indicate serious problems with the ACES execution timing. Certainly, any tests listed here should be examined more closely. It is possible that one or more tests will be listed which did not actually have a zero time. Any criterion used here is necessarily arbitrary. ACES would rather list plausible test problems than fail to list a test which has real measurement difficulties.
----------------------------------------------------------------------
Anomalous Execution Time Results
----------------------------------------------------------------------
Unanticipated Zero Times
----------------------------------------------------------------------
The following is a list of test problems with
suspect timing measurements. They were reported
as taking zero (or not statistically significantly
different from zero) but were not anticipated to
be optimized into nulls. They may reflect a
problem with the ACES timing measurement technique.
The total number of test problems with timing
measurements reported as taking zero (or not
statistically significantly different from zero)
which were not anticipated to be optimized into nulls.
was 0.
----------------------------------------------------------------------
Zero Times and Non-zero Sizes
----------------------------------------------------------------------
The following is a list of test problems with
timing measurements of zero (or not statistically
significantly different from zero) and with
a code expansion size greater than zero. While
this is not necessarily an error - the target
processor may be able to completely overlap
execution of the test problem with the measurement
code, but it is suspect and should be examined.
The total number of test problems which were measured
as taking zero time AND having a code expansion size
greater than zero was 0.
----------------------------------------------------------------------
<system name> Main Report <Date> <Time> Page <Page #>
In four major categories (Language Feature Overhead, Optimizations, Run-time System Behavior, and Coding Style Variations) a table is printed for each group of related test problems. The names and brief descriptions of the problems are presented. These tables are always one of two types, Multiple Comparisons or Paired Comparisons. Each table begins with a paragraph of descriptive text explaining the purpose of the comparison(s). Some tables are followed by a "Table Summary" which summarizes the important findings, either in tabular form or in text form (or both). All these alternatives are illustrated in the figures that follow.
Multiple Comparison reports are used to display the results of several related test problems, such as those in which the same operation has been coded in different styles.
Each problem is named and briefly described, and the execution times (and, if available, code sizes) are presented for them.
The report will display data from test problem sets and an indicator of the statistical differences between the results. These differences are based on the magnitude of the estimates for each problem and the observed variation in each problem. When there is no statistically significant difference between test problem results, there is no performance reason for preferring one alternative to another. When there is a statistically significant difference, the user must still make a decision as to whether it is large enough to justify modifying coding style for performance reasons.
In cases where the ACES user has had to copy results (e.g., from an embedded target without an upload capability where these results would have had to be typed in by hand) complete data may not be available. In such a case, the SSA may not have enough information to perform the similar group comparisons; however, it will perform the rest of the analysis and print as much of the report as it can.
A typical Multiple Comparison Report is shown in Figure 5-40. The header presents the type and name of the group. In this example, it is a set of problems reflecting different coding styles for array assignment. The Code Size measurements are displayed in the same standard format as the Execution Time data, except that there is no statistical analysis and therefore, no Similar Groups section. If code-size data is available, it was exactly measured, and requires no further analysis. The unit of measurement is bytes (8 bits).
The report contains:
* Test problem names.
* Execution time in microseconds (problems for which timing data is unavailable will be flagged as such).
* A graphic representation of the execution time under the heading "Bar Chart". This is presented to make simple visual comparison easy, permitting a reader to judge whether two problems are roughly comparable without having to examine the numeric values. (One could flip through the main section of the report looking for groups where large differences show up.)
* Similar Groups column. If the vertical lines align, then there is no statistically significant difference among the associated performance times. If they do not align, then there are differences. In Figure 5-40, the difference between the first three problems, st_is_if_code_style_30, st_is_if_code_style_28, and st_is_if_code_style_31, are not significant (and should be ignored). The differences between st_is_if_code_style_29 and all of the other tests are significant. (In this case the difference is obvious, but this is not always true.)
-----------------------------------------------------------------------------
Coding Style Variations
-----------------------------------------------------------------------------
IF versus Exception Handler for Detecting Constraint Errors
-----------------------------------------------------------------------------
Description
--------------------------------------------------------------------------
These problems determine whether there is a performance difference
between using an exception handler to process a fault condition versus
using an explicit IF statement. The problems measure both coding styles
for the case when the condition is true and when it is false. It is
common for the time to raise an exception to be much higher than the
time to process an ELSE alternative; when fault conditions will be rare,
it is more appropriate to emphasize the case where the condition is TRUE.
(Block entry overheads will be a major factor in determining the
performance of the exception handler approach).
-----------------------------------------------------------------------------
Test Execution Bar Similar
Name Time (Microsec) Chart Groups
-----------------------------------------------------------------------------
st_is_if_code_style_30 0.8 |
st_is_if_code_style_28 0.8 |
st_is_if_code_style_31 1.3 |
st_is_if_code_style_29 528.3 ************************ |
-----------------------------------------------------------------------------
Test Code Bar
Name Size(Bits) Chart
-----------------------------------------------------------------------------
st_is_if_code_style_28 416.0 ********************
st_is_if_code_style_30 448.0 *********************
st_is_if_code_style_31 448.0 *********************
st_is_if_code_style_29 512.0 ************************
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
<system name> Main Report <Date> <Time> Page <Page #>
-----------------------------------------------------------------------------
Individual Test Descriptions
-----------------------------------------------------------------------------
st_is_if_code_style_28 =>
This is the first of a set of 4 test problems
to explore possible stylistic alternatives
using IF statements or exception handlers.
Problem ST_IS_IF_CODE_STYLE_28 uses an IF
statement to test for a discrepancy, which
IS found.
----------------------------------------------------------------
st_is_if_code_style_29 =>
This is the second of a set of 4 test problems
to explore possible stylistic alternatives
using IF statements or exception handlers.
Problem ST_IS_IF_CODE_STYLE_29 uses an exception
handler to detect that a discrepancy occurred, and
in this problem the exception IS raised.
----------------------------------------------------------------
st_is_if_code_style_30 =>
This is the third of a set of 4 test problems
to explore possible stylistic alternatives. It
uses an IF statement to detect a discrepancy and
in this problem NO discrepancy is present.
---------------------------------------------------------
st_is_if_code_style_31 =>
This is the fourth of a set of 4 test problems
to explore possible stylistic alternatives. It
uses an exception handler to detect a discrepancy and
in this problem NO discrepancy is present.
------------------------------------------------------------------------
TABLE SUMMARY
------------------------------------------------------------------------
Entering a frame and raising an exception takes approximately 635.7
times are long as a comparable IF statement (where ELSE is skipped).
Entering a frame and NOT raising an exception takes
approximately the same time as a comparable IF statement (where
ELSE is taken).
Time in microseconds to use an IF versus
exception handler to detect constraint errors.
+-----------------------------+------------+------------+
| | IF | HANDLER |
+-----------------------------+------------|------------|
| discrepancy detected | 0.831 | 528.3 |
| discrepancy NOT detected | 0.751 | 1.300 |
+-----------------------------+------------+------------+
Figure 5-40 SSA Main Report - Multiple Comparison Report (Continued)
The "Table Summary" (which is not present in all tables) provides a compact explanation of the findings. In addition, there is a tabular summary of the most important data included in some of the table summaries.
An additional form of table summary includes an equation with data dependent parameters. Figure 5-41, SSA: Main Report - Equation Fitting, provides an example. Here execution time (y) is estimated as a function of the number of parameters in a rendezvous (x).
-----------------------------------------------------------------------------
Runtime System Behavior
-----------------------------------------------------------------------------
Increasing Number of Parameters Passed in a Rendezvous
-----------------------------------------------------------------------------
Description
--------------------------------------------------------------------------
These test problems explore the effect of passing an increasing
number of parameters in a rendezvous. The test problem with no
parameters is likely to show a different pattern because some
setup code may be omitted in this case.
-----------------------------------------------------------------------------
Test Execution Bar Similar
Name Time Chart Groups
-----------------------------------------------------------------------------
tk_rz_task_no_parameters 1026346.4 **************** |
tk_rz_task_one_parameter 1047034.6 ***************** |
tk_rz_task_ten_parameters 1076699.0 ***************** |
tk_rz_task_hundred_parameter1500489.6 ************************ |
-----------------------------------------------------------------------------
Code Size
-----------------------------------------------------------------------------
tk_rz_task_no_parameters 208.0 ************************
tk_rz_task_one_parameter 208.0 ************************
tk_rz_task_ten_parameters 208.0 ************************
tk_rz_task_hundred_parameter 208.0 ************************
-----------------------------------------------------------------------------
-----------------------------------------------------------------
TABLE SUMMARY
-----------------------------------------------------------------
The average time to perform a rendezvous with "N" integer
parameters, based on a least squares fit of the time for passing
0, 1, 10 and 100 is (in the form time = a * N + b)
y = 1032959.8 + 4673.2 * x
Many systems will simplify the case of zero parameters, so the
formula will often not fit this case well.
Time in microseconds for reference to pass N
integer parameters in rendezvous.
+-----------------------------+------------+
| 0 parameters | 1026346.4 |
| 1 parameter | 1047034.6 |
| 10 parameters | 1076699.0 |
| 100 parameters | 1500489.6 |
+-----------------------------+------------+
These are primarily used for detecting whether a particular optimization is performed, based on whether or not there is a significant difference between two problems where one is a hand-optimized version of the other. Figure 5-42 is an example of a Paired Comparison Table. The optimization being tested for is exponentiating by zero. The column "Description" lists the pair of problems being compared. The "Optimized" column will indicate whether the SSA believes that the optimization has been performed. Possible answers are YES, MAYBE, NO, and MISSING. The detection is based on the difference between the execution times of the two problems. If the calculated probability that the actual times are different is at least 0.9, then the software reports YES (the optimization is performed). If the probability is no more than 0.1, then the software reports NO (the optimization is not performed). If the probability is between 0.1 and 0.9, then the software reports MAYBE (no decision can be made). If one or both times are missing the table will not be printed. If other data needed for the statistics is missing, the result will be MISSING.
-----------------------------------------------------------------------------
Optimizations
-----------------------------------------------------------------------------
Algebraic Simplification : Exponentiating by Zero
-----------------------------------------------------------------------------
Description Optimized?
--------------------------------------------------------------------------
These two test problems check whether the system is simplifying an
integer expression which is explicitly exponentiated by zero (test problem
OP_AS_ALGE_SIMP_05) versus an expression which references the simple
literal one (test problem AR_IO_INTEGER_OPER_01). On an optimizing
compiler, these problems will take the same time.
-----------------------------------------------------------------------------
Time : op_as_alge_simp_05 ( 0.3 ) vs
ar_io_integer_oper_01 ( 0.2 ) no
-----------------------------------------------------------------------------
Size : op_as_alge_simp_05 ( 128.0 ) vs
ar_io_integer_oper_01 ( 128.0 )
-----------------------------------------------------------------------------
op_as_alge_simp_05 => ii := ll ** 0;
-------------------------------------------------------------------
ar_io_integer_oper_01 => kk := 1;
-----------------------------------------------------------------------------
Any measurements which are subject to error (as are the timing results from the ACES) must be interpreted with care. If we know that time A is 10.0 and that time B is 20.0, we do not know whether we can safely say that time B is twice as large as time A, or even if it is safe to say that time B is larger than time A, unless we know how large the measurement error is. If the error magnitude is less than 0.1, then both of those conclusions are safe. If the error magnitude is greater than 100.0, then both of those conclusions are obviously false (not only can we not say that B is larger than A, we cannot even say that A and B are nonzero times).
Statisticians have studied problems like this and we make use of their findings. Statistically significant differences are differences which can be assumed to be reliable and believable differences. That is, they are the result of differences in the objects being measured, rather than of errors in the measurements.
The statistical analysis for Multiple Comparisons uses Bonferroni's method (refer to Chapter 3 of Analysis of Messy Data Volume 1, by G. Milliken and D. Johnson, Van Nostrand Reinhold, 1986 for details). This method was selected because it permits all pairwise comparisons and does not require equal sample sizes. The statistical analysis for Paired Comparisons uses the method explained in Chapter 5 (pages 139-140) of Biostatistics: A Foundation for Analysis in the Health Sciences by Wayne W. Daniel, John Wiley and Sons, 1978.
The determination of significant differences requires knowledge of the number of samples and variation; in the case of the ACES execution-time measurements, the number of samples is the outer timing loop count and the variation is the observed standard deviation. In cases where the ACES user has had to copy results (e.g., from an embedded target without an upload capability where these results would have had to be typed in by hand) complete data may not be available. In such a case, the SSA may not have enough information to perform the similar group comparisons; however, it will perform the rest of the analysis and print as much of the report as it can.
We do need to distinguish between statistical significance and practical significance. If the measurements are very accurate, then very small differences will be statistically significant, that is, they will be real differences. However, the reader of these results may not care if one coding style is really one percent faster. The rules for interpretation are simple: first, look only for differences which are statistically significant. Then, make a judgment as to whether these differences are large enough to be of practical importance. Such decisions will depend on non-statistical considerations, that is, it depends on the user's judgment.
Some of the performance tests produce ancillary information as a side effect of executing them. This data is collected by the SSA tool and presented in the SSA report. The types of information included in the ancillary data report are:
* Rates for the Whetstone and Dhrystone programs in statements per second
* Time per rendezvous for the tasking tests which perform rendezvous
* Time for one procedure call on cl_ac_acker_01, cl_ac_acker_02, and su_sl_local_tak
* Observed numeric errors in the calculation for cl_gm_gamm_01 and cl_gm_gamm_02
* Whether the Storage Reclamation (sr_im*) test problems immediately reuse space
* Details about the treatment of arithmetic expressions
* Size of packed data structures
* Activation record size
* Whether asynchronous I/O is performed
* Details of task scheduling, including whether the system is using a run-till-blocked scheduler or time-slicing among equal priority tasks, and the quantum-size where appropriate, and whether the system waits until outstanding delay statements are completed before completing an aborted task.
* Whether generic instantiations are shared
* Whether constraint checking is performed when pragma suppress is requested
* Other miscellaneous information
The Missing Data Report, Figure 5-43, lists the test problems in the Main Report, by title, for which no timing measurements were reported and, if the appropriate error codes have been entered, the correct reason for the missing data.
---------------------------------------------------------------------
Optimizations
---------------------------------------------------------------------
Algebraic Simplification: Integer
---------------------------------------------------------------------
op_as_alge_simp_06 is missing err_at_compilation_time
---------------------------------------------------------------------
Coding Style
---------------------------------------------------------------------
Array Assignment
---------------------------------------------------------------------
dr_ao_array_oper_07 is missing err_at_execution_time
---------------------------------------------------------------------
This section describes how time and code expansion measurements are taken, the constraints placed on test problems to permit them to be measured, the sources of measurement errors, and the steps taken to minimize errors and the error bounds.
An ACES user not familiar with computer performance tests may wonder why there needs to be any discussion about timing measurement, anticipating measurement code similar to:
START_TIMING := CALENDAR.CLOCK;
test_problem_to_be_executed;
STOP_TIMING := CALENDAR.CLOCK;
ELAPSED_TIME := STOP_TIME - START_TIME;
output_problem_name_and_elapsed_time.
Before discussing the ACES approach to measuring execution time, it is worthwhile to explore some of the reasons why this simple approach was not adopted.
* Precision
The precision of the type CALENDAR.TIME returned by the CALENDAR.CLOCK function is implementation dependent. This precision is typically in the range from one to fifty milliseconds, but on one system was one second. The precision of the clock measurement will introduce quantization errors. Where one clock tick is a large fraction of the measured difference between START_TIME and STOP_TIME, the measured problem execution time cannot be considered very precise.
The standard benchmarking technique to deal with this concern is to execute a test problem multiple times so that the execution time will be large enough that the quantization errors are acceptably small. Then the measured time is divided by the number of iterations. The overhead of a NULL loop is then subtracted off and an estimate for the execution time of a single iteration is reported.
Where no consideration has been given to this question, the results will contain an unknown implementation-dependent quantization error. Measurements of "fast" test problems will contain many zero values where the clock did not change between starting and stopping measurement.
* Economy
In an attempt to compensate for the precision problem, a test developer could explicitly code in looping factors to the test problems to ensure that they execute for a "sufficiently long" time. However, this approach leads to a suite of tests that execute long on some systems and unbearably long on other systems. The execution-time results are a direct result of different code generation techniques, optimizations, and target hardware characteristics; the variations are not uniform between all test problems on a system.
In an effort to avoid test problems which don't run long enough to be measured reliably, it is easy to develop test problems which run for many hours. This is not an economical use of computer resources. Using more sophisticated measurement techniques will permit comparable accuracy with much less elapsed time and much less use of computer resources.
* Error estimates
The naive approach will give no estimate of the repeatability of the measurements. As a physical process, clock readings should be expected to be subject to random variations. Statistical arguments would suggest that the variability in measurements should be observed and reported. At a minimum this would suggest calculation of sample standard deviation. Calculation of confidence levels for measurements is possible and would be reassuring. Neither statistical concept is supportable with the simple measurement methodology.
It is necessary to perform multiple measurements to estimate errors.
* Optimization
Ada permits compilers to generate code which executes some statements in a different order than presented in the source text. Using the naive approach, it is possible that an optimizing compiler might move test problem code outside the calls on the clock function. This is a potential problem for any measurement technique, but design approaches which explicitly consider the problem and try to allow for it are more likely to be successful than the straightforward approach which ignores the issue.
There are many decisions to be made in implementing an Ada compilation system. Enhancing compilation system performance in one area may involve degrading performance in other areas. General areas of trade-offs in implementations may be time/space and implementations for data handling, error handling, register usage, scheduling, and portability requirements. An efficient system will have objectives and requirements which will impact the implementation throughout its life. These objectives may not be the primary concern of the ACES evaluator for their particular project.
Most Ada users are greedy, wanting EVERYTHING: low cost; fast execution time; fast compile time; small generated code size; small compilation system size; small disk utilization; timely corrections to reported problems; good diagnostics; good debuggers; powerful Ada program library system; recompile/MAKE facility; good listings (cross-reference, set/use, storage allocation maps); easy portability to other platforms; robust operations (no compiler/link/library system crashes); easy installation; ... these desires are often in conflict. An implementor may choose an implementation technique which makes one Ada language feature execute slowly in order to make another feature execute quickly. In comparing implementations, such possible trade-offs must be kept in mind. Slow times on tests using one feature might be balanced by good results on tests using features on the "other side" of a trade-off. Below are listed sets of language features where enhancing one may degrade another.
* Time versus space.
* Compile speed versus execution speed.
* Compile speed versus compile-time information (listings and diagnostics).
* Exception raising and handling times versus frame entry/exit overheads.
* Display versus static links for referencing objects of intermediate scope.
* Nonscalar parameters by copy versus by reference.
* Generics by macro versus shared code.
* Time-slicing of equal-priority tasks versus run-till-blocked scheduling (time-slicing may be slower but "nicer").
* Fair versus unfair scheduling at a selective wait.
* Efficiency due to optimization versus complications in reliability, verifiability, and debugging.
* Minimizing interrupt latency versus complexity of run-time system.
* Average versus worst-case exception propagation time.
* Implementing unconstrained discriminant records using the heap may prevent some limitations on the bounds of the discriminant at some cost in access times.
* Pre-elaboration of units may decrease the elaboration times and save some space, but prevent easy restarting of a program.
* Asynchronous I/O support versus blocking I/O (if all tasks in a program halt while any task performs I/O, then the system uses "blocking" I/O, which is easier and faster to implement, but may give undesirable performance for multitasking applications).
* Efficiency versus maintainability (a run-time system written in Ada may be slower but more maintainable than if it were written in assembler).
* Efficiency versus portability (a run-time system written in Ada may be slower but more portable than if it were written in assembler).
* Run-time efficiency in exception propagation versus diagnostics (maintaining information about where an exception was originally raised may make the system slower and larger, but may produce more helpful diagnostic messages).
* Single user program library versus multiuser library.
* Efficiency versus robust program library.
* Minimizing task switch time versus maximizing register usage (code for a single task which uses many registers may execute faster once scheduled, but may increase the time required to save and restore registers for a task switch).
* Implementation-dependent extensions versus optimization of standard features (a system may provide implementation-dependent extensions for efficient support of interrupts or "monitor" tasks, or it may try to recognize special cases in "general" Ada code; the trade-off involves portability and tuning of applications).
* Filling "holes" in a record with fields which do not naturally and densely pack the record to provide for fast record comparison versus "wasted" overhead if no record comparisons are made.
* Heap management procedures may trade off packing density (storage utilization) versus allocation/deallocation speed.
* Distributed overhead versus cost-per-use. Some features might be implemented by techniques which add a little additional overhead in several areas but permit the feature to be implemented quickly, as opposed to a technique which only adds a larger overhead when the feature is used. This may be true of the following features:
Unconstrained arrays
Exception processing
Recursive subprograms
Because a system does one side of a trade-off slowly does not guarantee that it will do the other side quickly; it is quite possible to do both slowly.
Understanding the trade-offs of the compilation system does not necessarily make the evaluation process easier. However, if a project has a requirement for a fast task switch time, they might understand the reluctance of an implementor to slow down all other processing in order to speed up context switching.
The ACES timing loop code is designed to measure the execution time and code expansion size of Ada code. The timing loop code is designed to satisfy several requirements.
* It produces accurate measurements. It attempts to satisfy predetermined error tolerances and confidence levels.
* Its output shows the variations in the observations, that is, the standard deviation of the measurements.
* Sufficient information will be displayed to permit interested users to compute other standard statistical functions. A user will have sufficient information to compute a z-statistic (refer to statistics textbooks, such as Introduction To the Theory of Statistics by A. Mood and F. Graybill, McGraw-Hill, 1963, where it is discussed in the chapter on the Central-Limit Theorem).
* The timing loop code is portable. Its execution does not require special test equipment, system-dependent calls, or operator interaction. It does not need to be manually tuned to the properties of the target system. In particular, it is not necessary to modify a system parameter based on the speed of the target machine, either making it larger to produce a measurement with sufficient accuracy, or making it smaller to cut down on elapsed time to perform the tests.
* The timing loop code does not require the support of integer types with more than 16 bits of precision.
* The timing loop code does not assume a system clock which is precise to the level of a few machine instructions.
* The timing loop code tests for jitter (small random variations in the clock readings) and tries to compensate for it. The compensation is achieved by requiring each measurement to execute for a minimum elapsed time - long enough so that the number of clock ticks will "average out" the random jitter.
* The timing loop code can be adapted to compilation systems which defer code generation until link time. This can be difficult, because a system which defers code generation will have access to information about the entire Ada program, and can apply optimizations across compilation units. This can circumvent the standard techniques to thwart optimizing compilers, making it hard to ensure that a test problem is actually executed once for every iteration of the timing loop. Such an optimizer can perform detailed analysis to determine that a test problem is actually a no-op.
Calling assembler-coded subprograms should be sufficient to thwart deferred code generators. Assembler routines can perform arbitrary operations including reading and writing blocks of data between memory and I/O devices. A compiler must assume that a called assembler routine could reference and/or modify any externally visible variable. For a compiler to analyze an arbitrary assembler routine and determine correctness-preserving optimizing transformations is unfeasible. An assembler subprogram could branch to an absolute address, presumably in the operating system, whose operations are not available to the Ada compilation system. An arbitrary assembler routine might be self-modifying, and a compiler must assume that it is. Including a call on an assembler routine in the timing loop should be sufficient to thwart unintended optimizations by compilation systems using deferred code generation. A compilation system which does not support interface to assembler-coded subprograms, and does deferred optimization, will make it difficult to collect measurements. It is expected that serious Ada compilation systems for Mission Critical Computer Resource (MCCR) applications will provide the capability to link to assembler code, so that the applications can access unique I/O and processor-dependent features only visible to assembler-coded routines.
* The timing loop code minimizes the use of computing resources where this does not compromise the other requirements.
* When the time to execute a test problem is not converging to stable values, the timing loop does not increase the number of repetitions (or cycles) excessively in the vain hope that it will eventually converge. In particular, the elapsed time for any individual test problem (which does not fall into an infinite loop due to errors in the implementation) should be bounded.
After a test problem has executed for an excessive time and the timing loop regains control, even if the measurements have not converged to the confidence level requested, execution should stop. The variable ZG_GLOB3.EXCESSIVE_TIME controls this cutoff and in the distributed ACES has a value of 30 minutes. This value is adaptable by the user.
After a test problem has executed for a smaller but still large time, it prints a message to the results file. It is reassuring when running the test suite interactively to know that the program has not fallen into an infinite loop.
The error tolerances computed by the software clock vernier determine an error bound on the test problems. The technique for testing these bounds is described in detail in the subsections of Section 6.4. The significant point to remember in this section is that the ACES does try to ensure that the test problems are correct in the sense that the timing measurements satisfy predetermined error tolerances.
By specifying desired tolerances before executing the test problems the ACES fixes a priori goals for measurement accuracy.
The timing loop code measures time by referencing the system clock. Clock readings are subject to the usual statistical variations associated with physical measurements, and can be expected to show random variations. Assuming that the errors are normally distributed, there are statistical techniques to estimate the accuracy of the results achievable in the presence of such random errors.
Some systematic sources of potential errors are not amenable to statistical methods for compensation. The methods used to detect the presence of such sources are discussed in the following subsections.
The timing loop calculates and outputs the standard deviation of the observations, and an indicator if the predetermined confidence level was not achieved. Refer to Section 6.4.2 "The Timing Loops". Timing measurements are subject to errors; the statistical fluctuations in measurements is an indicator of the variability in the underlying process.
The test problem times being measured are usually much smaller than the granularity of the system clock. The ACES works around this problem by using a "timing loop" which repetitively executes the Ada problem to be measured. The time to execute the timing loop overhead is subtracted out of each measurement.
The ACES uses a "dual loop" approach to measuring test problems. It calculates the time to execute a test problem by executing the test multiple times (to make the total execution time large enough to accurately measure with the relatively coarse clocks available for program use) and subtracts out the overhead introduced by the looping structure. As part of the initialization in each test program (Inittime), the ACES computes the execution time of the NULL control loop. A verification error is reported if the NULL loop time measured in the test problem is not within the range observed in the Pretest.
The ACES minimizes the effect of small random errors by requiring that the test problem execute for more than some minimal time before accepting a measurement. See the discussion of the clock vernier in Section 6.4.1 "Clock Resolution Problems" for details. This minimum time is chosen to be large enough that random errors (jitter) observed during program initialization by Inittime will not contribute more than an additional one percent error to the measurement. The number of iterations performed is printed, labeled "inner loop count".
The magnitude of remaining random errors is statistically estimated by repeated measurements. The timing loop code uses standard statistical techniques (Student's t-test) to determine whether the sample mean will be within a predetermined tolerance level of the population mean with a predetermined confidence level. The percent standard deviation of the measurements is printed, labeled "sigma". A large sigma indicates a measurement where substantial errors may be present. If the timing loop was not able to satisfy the stated confidence level, an error indication (a "#" character) is also output.
Measurement of test problems which have been optimized into NULL statements could produce small non-zero values for a minimum time because of noise in the timing loop. The ACES checks for this possibility before printing a result. If the code expansion size measurement is zero, it will print the time as zero.
If the measured problem time is less than the precomputed NULL loop time, a meaningless negative execution time could be reported for the test problem. This condition is explicitly tested for in the ACES timing loop and appropriate compensating action taken:
* When the difference between the NULL loop and the measured test problem is small (that is, less than the variations observed in the initialization of the NULL timing loop) then the difficulty is assumed to be due to noise in the measurement process and the calculated time is replaced with a zero measurement. Replacing small negative values with zero improves the cosmetic appearance of the output and is necessary because the analysis programs consider negative values to be special codes used to indicate the reasons for problem failures.
* When the difference between the NULL loop and the measured test problem is large, the difficulty is assumed to be more serious. An error code is printed. This condition can (most often) occur when there are contending tasks in the system, which were executing when the NULL loop was initialized but were terminated when the actual test problem was run. It can also occur, as discussed in Section 6.3.2 "Systematic Errors", when the instructions generated to execute the timing loop in the NULL loop are significantly longer than in a test problem.
A careful ACES user should examine all test problems which execute in zero time.
If the timing loop code generated for the NULL statement takes a different amount of execution time than for a general test problem, the calculated performance estimate for the test problem will be erroneous. Such an error cannot be compensated for by performing additional iterations since it is not a random term which can be "averaged out". It is a systematic error. Listed below are potential causes for such variations.
* Systematic errors would result if different instructions were generated for the timing loop code for measuring the NULL statement than for general test problems.
* An optimizing compiler performing inter-unit analysis might be able to determine that the null timing loop occurring in the zg_init programs is a no-op, and report the loop overhead time as zero. If it did this, then all the non-null ACES timing measurements will be larger than they should be because the compilation system will be executing the loop overhead instructions but the timing loop overhead that is subtracted off will not properly reflect the actual loop overhead.
Many compilation systems do not perform inter-unit analysis because exploiting this information will either introduce compilation dependencies on the body of a unit (making it necessary to recompile units whenever the body changes, even if the specification is not modified) or will require code generation to be deferred until sometime after the initial compilation (for example, it could be done by the linker or by a separate optimizer but might have to be redone if the bodies of units are modified).
The approach of the ACES to minimize this source of systematic errors has been to:
+ Code the timing loop with techniques which make it difficult for optimizing compilers to translate the timing loop into a no-op. If this is successful in thwarting optimization of the timing loop, users will not have to deal with this problem.
+ Have the zg_init and zp_basic programs write error messages if they measure timing loop overheads which are not statistically significantly greater than zero. This will alert users if the condition occurs and they can take appropriate steps to modify the timing loop code before spending much time in collecting performance measurements which may be inaccurate.
+ Have the Single System Analysis (SSA) program report the names of test problems with zero times which were not expected to have zero times - this will alert the users to systems which are performing unexpected optimizations which might invalidate timing measurements.
* Establishment and maintenance of addressability to external packages can introduce unanticipated overhead.
* Some compilers can generate different code for a construct depending on the nesting level where the construction occurred.
* Memory organization in processors can cause the timing loop to take variable amounts of time.
* Pipeline processors exaggerate the effect of statement context on performance.
* On a multiprogramming target system, concurrent users will contend for resources.
* The system clock may not be accurate.
* The code for similar constructions may vary if certain "magic numbers" are crossed.
* The size of the test program can modify performance.
The following sections outline the steps taken in the timing loop code to minimize the impact of these causes.
Systematic errors would result if different instructions were generated for the timing loop for measuring the NULL statement than for general test problems. An optimizing compiler which keeps the timing loop control variables in registers for the NULL statement might not be able to keep them in registers in general. This would result in different instructions for the NULL loop and for the test problem, and would invalidate the timing hypothesis. If this were to happen, complex test problems will be reported as being slower than they really are, because the measured times would include the extra time (relative to the NULL timing loop) to maintain the loop control variables in memory.
To prevent this from happening, the timing loop most often uses library scope variables. Values for library scoped variables must be assumed to be modified by external subprograms, making values for library scope variables saved in registers unreliable across subprogram calls where the test includes an external subprogram call. The timing loop contains a conditional call on an external subprogram to "break" register allocation. Making the external call conditional depending on external variables should inhibit compile-time optimizers. By arranging for a false condition so that the subprogram is not called, the time to execute the NULL loop will be minimized.
Not calling a subprogram will decrease the variability in execution time of the timing loop code between the NULL loop and "real" test problems on target machines with memory caches. This is because the cache will not be "flushed" by loading the code for the external procedure into the cache on each iteration of the test problem. There could be a discrepancy introduced here if the test code always called a procedure and, depending on the alignment of the test problem and the procedure relative to the cache, calling on the procedure forced a cache fault in the NULL loop and not in the test problem (or visa versa). System sensitivity to cache alignments is a real effect (not induced by measurement artifact) which can make it risky to extrapolate timing measurements from small test problems to other usages. Minimizing execution time for the NULL loop can decrease the execution time of test problems in two ways: first (and least important it will reduce the time to execute the NULL loop itself; second, it will increase the fraction of elapsed time spent executing the test problem reducing the overhead due to the NULL loop.
An example of a systematic error was encountered with the DEC Ada Version 1.0 system. The base address of a package was required in a register. For some problems, the base address was loaded outside the timing loop, and for others it was loaded on each iteration. The timing loop therefore had a variable number of instructions. The measured time to execute a (short) test problem including the timing loop code was less than the measured time to execute the NULL timing loop initialization code.
Of course, even if the same machine instructions are generated, they may take different times to execute due to machine architecture variations. These sources of variation are discussed later. It should be clear that when the instructions generated for the timing loop code are different, the time to execute the timing loop code can vary.
On target processors which have different formats (e.g., long and short branch instructions) the timing loop code execution time for a large test problem will be longer than for the NULL loop. The computed time to execute a large test problem will be reported as longer than it should be because the branch in the timing loop code will now be a long format branch which will make the timing loop overhead longer than in the NULL loop time as computed during program initialization. For a large test problem, the difference in execution time due to a long format branch will usually be (relatively) small.
The approach of the ACES to minimize this source of systematic errors has been to exercise care in the design and testing of the timing loop code. The initialization code in the timing loop tests the consistency of the NULL loop time, and large variations in the initialization code may suggest variations in the generated code. The most likely cause of variation is contending processes on the target system, but if contention can be eliminated as a source of variation, then investigation of alternate code may be appropriate. If the size of the NULL loop code varies between different test programs, compiled with the same suppression and optimization options, then the user has strong reason to suspect that different code is being produced. For some test problems on some target machines, anomalous results may still occur. It is hoped that this discussion will alert ACES users to the potential problems in collecting measurements.
Establishment and maintenance of addressability to external packages can introduce unanticipated overhead. As has been seen in decades of experience with the IBM S/360 architecture, the efficiency of compilers in maintaining addressability is a major factor in determining overall performance. The amount of time it takes to execute a high order language statement such as "X: =Y;" can vary greatly depending on whether base registers are available for "X" and/or "Y". An optimizing compiler will preserve base registers across several statements. This effect is a reflection of a real and unavoidable aspect of target processors with base register addressing. What appears to a casual programmer to be irrelevant differences between statements occurring in different locations in a program can have major performance impacts. The ACES includes a sample of large test problems to try to observe these effects.
There are many popular machine architectures where this particular complication in obtaining high performance code does not occur. Several test problems reference data from multiple packages to observe possible addressability overheads. In particular, test problems po_pa_pkg_ovrhd_01 through po_pa_pkg_ovrhd_09 are designed to specifically study this issue.
The ACES minimizes this problem's possible effect on the calculation of timing measurements by insuring that the variables controlling the timing loop are in an external package, zg_glob3, and the timing loop contains a call on an external procedure, which potentially can modify all register settings.
Some compilers can generate different code for a construct depending on the nesting level where the construction occurred. The timing loop code surrounds each test problem with two levels of loop nesting. A compiler might keep the innermost FOR loop index in a register. Here, the time it takes to enter a FOR loop will depend on the nesting level of the FOR loops, because the nested loops must save and restore the registers for the outer loops. Predictable complications are avoided in the timing loop code by not using FOR loops for timing.
Memory organization in processors can cause the timing loop to take variable amounts of time. The time to execute the timing loop can vary with the alignment of the code in the address space. Where the test problems being measured are of differing sizes, the beginning and end of the timing loop (Startime and Stoptime0) will be in different relative locations. This can cause performance differences. Consider the following examples:
* On a virtual memory system, for two small test problems of the same size, the entire test problem code might be assigned to one memory page in one example and span two memory pages in the other. If a page fault occurs, the execution time will be orders of magnitude slower than when no page fault occurs. Assuming that the test problem is being executed with a sufficient working set size so that enough physical memory is available to keep the test problem loaded, one iteration through the test problem will be sufficient to load all the necessary virtual memory space into real memory. If there is not sufficient real memory, the ACES test problems will measure a paging system while it thrashes, which is neither helpful nor interesting. An application with insufficient memory on a virtual memory system, either on the physical machine or in the allocation of real memory permitted by the target operating system, will run slowly.
The way to fix the problem is well known and independent of the use of Ada: give the program sufficient memory. An Ada compilation system may aggravate the problem by generating programs with large working sets!
The ACES timing loop will compensate for initial gross timing errors, caused by the initial page fault(s) to load the test problem into memory in the following three ways:
+ The inner timing loop will increase the test problem repetition count until the measured execution time for the test problem is greater than the value of "MIN_JITTER_COMPENSATION". On most target systems, this time will be greater than the time to execute a small test problem including processing a page fault, so the normal timing loop control mechanism will be to increase the repetition count until the execution time for the test problem is sufficiently large.
+ Stoptime0 will reset the inner timing loop whenever a new measurement is much less than or greater than the execution time of the current best estimate. When an initial execution is much longer than subsequent measurements due to processing page faults, the later measurements will observe a much faster apparent execution time. This will trigger an increase of the repetition count and a re-entry into the inner timing loop.
+ The outer timing loop reports the minimum time over several samples, which is the value used for later analysis. An initial measurement which is slower than prior readings (but not so much slower as to trigger the reset of the inner timing loop) will be discarded. If the variation in measurements when the first execution of the test problem includes page fault processing is small enough so that the measurements were within the requested confidence level anyway, the errors introduced by processing the memory faults were a small percentage of the total problem execution time.
The interested reader should refer to Section 6.5 "HOW TEST PROBLEMS ARE MEASURED", for more discussion of the timing loop code.
Even when page faults do not occur, some virtual memory systems may incur some additional overheads in accessing memory pages, such as keeping track of the least recently used page.
* Cache memory organizations can introduce variability. Cache memory organizations typically partition physical memory into fixed size blocks and define a simple memory mapping from "main" memory to cache frames, using high order memory address bits. Several "main" memory blocks will map to the same cache frame, and the cache management hardware tracks the current contents of the frame and will load the "main" memory block into the cache frame as memory references require.
In a worst case, the head and tail of the timing loop might be allocated to "main" memory locations which map into the same cache frame, forcing a cache fault on each iteration. Similarly, the test problem being measured may call procedures or make data references which force cache faults. In a best case, one pass through the timing loop code and test problem will load all the instructions into the cache where they will remain for subsequent iterations of the timing loop, permitting all instruction fetches to be made from the cache with no main memory accesses.
The time to execute the timing loop instructions may sometimes include cache faults and sometimes avoid cache faults, depending on the contents of the instructions enclosed by the timing loop and the memory alignment of the timing loop code (with respect to their mapping into cache frames). On a large target system, there may be sufficient cache memory to hold an entire test problem (or even the entire test program of which it is a part). When the ACES testing is performed on an unloaded system and typical applications normally execute in a multiprogramming environment, the ACES test results may overestimate the speed achievable by the processor.
Designing programs to minimize cache faulting behavior is rather difficult. It is typical practice to assume that the cache fault rate will be roughly constant. It also helps analysis that the time penalty for a cache fault is much smaller than for a paging fault in a virtual memory system, making the payoff from adapting code to minimize cache faults much less than for paging systems. A simple way to exploit a cache memory system is to write small programs, which fit entirely within the high speed cache. The ACES includes both large and small test problems.
* Analyzing the performance of processors which prefetch instructions is more complex than for processors which do not. When an instruction is ready for execution (prefetched) as soon as the flow of control is ready to execute it, the processor will execute it faster. This effect can make the time to execute a linear sequence of statements appear faster than expected.
For example, to execute the statements "X1 := Y1; X2 := Y2;" can take less than twice the time to execute as "X1 := Y1;" this effect is explicitly examined in the ACES in test problems ar_io_integer_oper_04, st_ov_ovrhd_02, and st_ov_ovrhd_01 which contain one, two, and three integer assignment statements (respectively) using distinct variables. If the execution times for these statements form a linear series, there is not a "prefetch" effect.
Instruction prefetching can make the execution time for the instructions at the beginning and end of the timing loop vary depending on the code being enclosed. If the flow of control sequentially "falls into" the instructions forming the end of the timing loop, the instructions will have been prefetched and the processor will not have to wait before executing the instructions. On the other hand, if the flow of control has jumped to the instructions at the end of the timing loop, the instructions may not have been prefetched and the processor may have to wait for them. On some prefetching processors, the outcome of a branch may be predicted and instructions fetched. This is straightforward (although requiring more expensive hardware) for unconditional branch and branch-and-link instructions. An example: the last statement in the test problem is a procedure call resulting in the last instruction executed in the test problem being a RETURN, or when the test problem terminates with something like:
IF boolean_function_returning_true THEN
proc0;
ELSE
RAISE program_error;
END IF;
where the statement will take the THEN alternative and then execute a machine level jump instruction to get to the end of the IF statement. Prefetching could cause the time to execute the NULL loop to be different from the time to execute the timing loop code in a test problem.
The ACES minimizes the potential for systematic errors due to prefetching by including a conditional subprogram call within the timing loop (in Stoptime0) which will force a break in the prefetching and preclude prefetching of the decrement and test code at the end of the timing loop.
Because the NULL loop sequentially "falls into" the end of the timing loop, subtracting off this NULL loop time may add a systematic error (of one prefetch setup time) for test problems which branch into the end of the timing loop code on prefetching processors. Some prefetching processors include logic which will anticipate branching instructions and will fetch the instructions beginning a subprogram, and some will even prefetch both alternatives of a conditional branch. On such processors the coding of the timing loop will not suppress prefetching, but the actual measurements returned will be accurate since the instructions will be prefetched anyway. The specific conditions under which any particular processor prefetches instructions can be very complex. For complex processors, the results of short test problems should not be given too much emphasis, although they will still be useful for code expansion measurements and for some comparisons to related problems.
* Memory bank contention can introduce delays. High speed memory systems are designed to process several memory requests concurrently. This supports I/O devices with Direct Memory Access (DMA) connections. High performance CPUs also exploit concurrent memory access by having several memory requests outstanding simultaneously (e.g., prefetching several sequential instructions and making a data reference). Such memory systems do not work at full speed when requests to the same bank of memory are made. Although true multiport memory chips exist, they are much more expensive than standard memory chips and are not ordinarily used for main memory systems.
On multi-bank systems, the alignment of instruction and data references can introduce delays in completing memory references.
Many simple processors do not have a memory bank organization, and this complication in performance prediction does not occur.
* To save costs, a memory system may contain memory of various speeds. Code executed in slow memory will not run as fast as when it is loaded into higher speed memory. On embedded targets, Read Only Memory (ROM) may have a different access speed than RAM.
The ACES approach to this potential problem is to acknowledge that it may exist. ACES users need to be aware of the characteristics of the processors they are testing. If the NULL timing loop is executed out of high speed memory and some test problem is executed out of slow speed memory, then the measurements will contain an error term reflecting the difference in execution time between executing the NULL loop in different speed memories. The ACES CA tool will detect this difference for small test problems because they will be reported as faster than expected (perhaps zero time) when the NULL loop is executed from slow memory and the test problem is in fast memory, or they will be reported as much slower than expected when the assignments to slow and fast memory are reversed.
* Alignment in memory can impact performance, and different alignments of the timing loop code in the NULL loop and in each test problem can introduce systematic measurement errors on some processors. For example, some memory systems respond with fixed width aligned blocks, such as 64 bit wide chunks which always have a beginning byte address which is 0 mod 8. A memory request which is not aligned with the memory system will take multiple memory cycles to service. For example, fetching a 64 bit word which is not aligned on a 0 mod 8 byte boundary will be slower than when it is aligned. When the instructions forming the timing loop have different memory alignments, then a systematic error will be introduced.
The alignment effect discussed here is relative to instructions rather than data references, since the references to the variables which control the timing loop in the NULL loop and in general test problems will be to the same library scope variables, whose alignment will not change.
The ACES approach to this problem is to test as part of zg_init that the NULL loop has reliable times when replicated four times. It is hoped that the timing loop will not have the same alignment by chance. Because each target processor may be sensitive to different alignments, it is not possible to systematically explore all variations which might be important to some particular target processors. The variations between four NULL loops are indicative of the variations which may occur in the code. The variation between measurements of the NULL loop time is printed (differences between test problem measurements less than this value should not be considered significant) and this variation is used to calculate the fastest execution time where a variation of this magnitude will be less than the requested error tolerance.
The variation in NULL loop measurements is also used to detect probable faults in measurements. If the elapsed time to execute a test problem is less than for the NULL loop, a negative value would be calculated for the test problem execution time; if the absolute value of the estimated negative execution time is less than the variation in the NULL loop, the result is reported as a zero; if it is greater than the variation in the NULL loop, it is reported as erroneous.
If the initialization routines do not produce consistent times, the ACES user will be told that there is a problem. Users may have to accept that when the NULL loop measurement is not consistently repeatable, all calculated timing loop measurements reflect the error associated with the computation of the NULL loop time (as displayed each time the test program is executed). The relative error introduced by considering this effect can be large for short test problems.
Systematic errors in the execution-time measurements can result when the timing loop executes differently in the NULL loop than for general test problems. This source of potential errors has been minimized in the ACES by careful design and testing of the timing loop code. For some targets, the results may not be completely successful. It is hoped that this discussion will alert ACES users about the potential of systematic errors in the ACES measurements they obtain. The ACES includes related test problems to detect the presence of systematic errors not compensated for.
The fact that a test problem can take a different amount of execution time depending on the memory address it is loaded into, or the nature of the Ada statements preceding it, is a direct result of the memory organization of some processors. This fact complicates the interpretation of results and essentially precludes the naive extrapolation of short test problems to overall system evaluations.
The ACES approach of including both small and large test problems will help in making fair assessments.
Pipeline processors exaggerate the effect of statement context on performance. A pipeline processor can have several instructions at different stages of execution at any one time: fetching the instruction; performing the address computation; fetching the data operands; performing address computations; performing arithmetic/logical operation; and storing results. The time to execute any particular instruction will vary depending on the state of prior instructions in the pipeline. For example, when one instruction references the result of a prior instruction, the processor must wait until the result is available. This condition is the definition of a pipeline interlock. Some current designs of pipeline processors have no provisions for hardware interlocks and require the programmer (that is, the high level language compilers) to ensure that the executed code is hazard-free.
The performance of a pipeline processor may be sensitive to the mix of operations performed. There may be several dedicated functional units, as in machines derived from the CDC 6600 series processors. Because an instruction will have to wait for an available logical processing unit, performing a sequence DIVIDE, DIVIDE, ADD, ADD may be much slower than the sequence DIVIDE, ADD, ADD, DIVIDE (assuming that data dependencies between the instructions permit the reorganization). Short test problems, which execute only a few instructions before they branch, will underestimate the performance benefit of pipeline processors, since they present the pipeline with worst case behavior.
Test problems which execute a long sequence of instructions between branches will execute well on pipeline processors.
On a pipeline processor with a "loop mode", after the first time the loop is executed, the CPU will retrieve instructions from the pipeline and will not access memory to fetch instructions until the loop is exited. This can improve the execution time of many loops dramatically, since high speed CPUs tend to be limited by the speed of memory references. A statement like "SUM:=SUM+X(I);" can be repeatedly timed to take "n" microseconds, but if it is the only statement in the body of a loop, it might well take much less than the time for a NULL loop plus "n" times the number of iterations of the loop.
A pipeline processor presents potential problems to programmers constructing benchmarks, to readers of benchmark results, and to programmers attempting to obtain optimal performance on the processor. They make the execution time of a sequence of machine instructions depend strongly on the surrounding instructions. This fact must be remembered when considering the reported result from any test problem. For short test problems, the pipeline will never be fully "primed" and the reported result will be a pessimistic estimate of overall performance. Linearly extrapolating from the speed of a simple statement measured as a test problem to executing the same statement within a tight loop can dramatically underestimate the speed of a pipeline processor. When a tight loop contains an external procedure call (not suitable for inline expansion), performing the CALL (or Branch And Link, or equivalent) instruction and returning control will force two "breaks" in sequential instruction prefetching, which would not occur if the test problem did not include any branching instructions. A test problem for two or three simple assignments in a row may take much less than two or three times the execution time of a single assignment. When a test problem contains a loop, the code for the test problem might be small enough to fit within the "loop mode" of the pipeline processor and require no references to main memory for instruction references. Such a "loop mode" could run much faster than a simple linear extrapolation would predict.
When examining measurements on pipeline processors with loop modes, the user should remember that small test problems containing simple loops should run much faster than problems without loops. This is an expected consequence of the target machine. Users must understand this aspect of the target machine when they use the ACES test results to extrapolate the performance on other programs. The vendors of such machines typically include advice on how to best exploit such target machines. General advice with respect to the machine properties should be verified with the Ada compiler being used, since the compiler may introduce overheads which preclude exploiting the features of the hardware in Ada in the same way that it can be exploited in other languages. It might perform all array aggregate operations with calls on routines in the RTS, which preclude using loop mode operations, whereas an explicit FOR loop structure may result in much faster execution.
The ACES has attempted to minimize system measurement errors in the timing loop by using a measurement process which will work for pipeline and nonpipeline processors. For the individual test problems, the effects of pipeline processors should not be viewed as a source of systematic errors which must be minimized, but as an unavoidable aspect of some target processors which application programmers will have to deal with. It is not simply a "feature" which hardware designers put into processors to complicate the life of benchmark developers. The ACES will execute the test problems as written on each processor it is executed on. It does not claim to be optimal code for each target processor.
It is an oversimplification to assume that each language feature will have a fixed execution time and that a user can estimate the total time for a procedure to execute by adding up the time for each feature multiplied by the number of executions of the statement. Optimizing compilers falsify this assumption on simple processors, moving operations out of loops, removing dead assignments, evaluating foldable expressions at compile time, reusing common subexpressions, and applying all the techniques available to it to improve generated code. Pipeline processors falsify the assumption on any sequence of machine instructions. Prefetching processors, to a lesser extent than pipeline ones, also can falsify the assumption for arbitrary sequences of machine instructions. The ACES test suite contains both large and small test problems. For some target processors, it will be unsafe to extrapolate from short test problems. Many "real" applications contain only short sequences of code between branches. So, it could be that the short problems predict "real" performance better than the "optimal" performance that the target is capable of.
How fast a pipeline processor executes a particular application is influenced by how much branching (either due to subprogram calling or to conditional statements) the application does and by the patterns of memory references. The results from the ACES test problems relative to loop unrolling are of particular interest on pipeline processors, since this is an optimization technique which can make a large difference. If unrolling loops prevents a "loop mode" operation, it can degrade performance; otherwise it might enhance performance considerably. Loop unrolling is the expansion at compile time of a loop into the equivalent sequence of statements.
The ACES approach to minimizing systematic errors due to execution on pipeline processors relates to insuring that the execution time of the timing loop code in the NULL loop will be comparable to the execution time for general test problems. Including an external procedure call in the timing loop will introduce a "break" in the pipeline.
On a multiprogramming target system, concurrent users will contend for resources. The time between the beginning and end of a test problem is not necessarily all spent executing that problem on a multiprocessor. The processor may have been diverted to execute another, higher priority process for execution. System services may pre-empt system resources; for example, servicing I/O completion interrupts is usually a higher priority operation than any user program.
There are two things an ACES user can do to minimize the impact of contending users. First, the user can arrange for exclusive use. This is not typical of usage of multiprogramming systems, but the primary audience for the ACES is MCCR applications, many of which are executed in a stand-alone environment. A small amount of contention can be compensated for by the design of the timing loop, which uses multiple cycles, and by the use of minimum time for analysis. Condense will extract the minimum time among the repetitions of the test problem for analysis by CA. If contention occurs in short bursts, the effects will be limited to the cycle where the burst occurred and it will always add time to the measurement. Because the timing loop synchronizes with the system clock, if the interference from contending tasks stops before the next system clock tick, it will not contaminate the next measurement cycle. Taking the minimum measurement for timing analysis is comparable to assuming that the minimum measurement does not have spurious additions.
Second, the user may adapt the ACES to use CPU time measurements. For most problems, the difference between CPU time and elapsed time on a dedicated processor should be small. Exceptions include:
* Tasking tests - On multiprocessors, parallel processing can produce smaller elapsed time than CPU time. On both uniprocessor and multiprocessors, the accounting system which "charges" time to processes may treat task scheduling as an overhead operation.
* I/O test problems - The CPU may be idle, waiting for the completion of an I/O operation.
* Delay test problems.
Here, CPU and elapsed time are not comparable. In general, CPU time measurements require the use of system-dependent routines, which are not universally available. Most multiprogramming operating systems which maintain CPU time measurements do so for accounting purposes and the accuracy demanded for this task is not great. Few systems accurately allocate the CPU time during interrupt processing to the job which was responsible for the interrupt. The additional overhead required to proportion this time correctly is substantial, the differences in billing would be very small, and it is not always clear what job should be charged (consider servicing the system clock interrupts).
The approach of the ACES to minimize this source of systematic errors is to instruct users to try to execute the test problems when there are not concurrent users on the system, or to use CPU time measurements. The timing loop code contains checks for consistency of measurements and will indicate failure to converge. When contending users are executing on the target, the measurements will often be flagged as unreliable.
The system clock may not be accurate. There are tests for clock accuracy in the test suite. The "zp_tcal1.ada" and "zp_tcal2.ada" programs test for long term drift in the clock. If these programs show that the clock is inaccurate, the ACES users should have the problem corrected before proceeding. The timing loop initialization code, Inittime, checks for jitter (random variation in the clock) and compensates for it by ensuring that each test problem will execute for a long enough time that the relative error attributable to jitter will be less than the tolerable relative error.
The code for similar constructions may vary if certain "magic numbers" are crossed. For example, on a DEC VAX, the length of an instruction can vary, depending on how many bits are required to reference a data object, or to encode a literal value. Long format instructions can be slower than short format instructions. If relative jump instructions are used with a variable-length operand, the jump instruction to return to the head of the timing loop can vary depending on the size of the test problem. Small test problems, including the NULL loop, may be able to use short format instructions, while larger test problems may require longer formats. Therefore, large test problems will be incorrectly reported as slower by the difference between the execution times of jump formats.
The ACES approach to this potential problem is to assume that the systematic errors will be small relative to the requested error tolerances (systematic errors here referring to those introduced when the timing loop overhead for NULL loop is different from the overhead of the timing loop for a general test problem which is large enough to force a long format branch). Since large test problems are often long running, the difference in NULL loop time will often be a small fraction of the test problem time. In the test problems themselves, compilation systems will be presented with specific code sequences, and the compilers will either generate "short format" instructions or not. Comparing times of related test problems can reveal the presence of special formats.
The size of the test program can modify performance. Some compilers do not optimize large programs; some forms of analysis might take excessive times on large programs and are not attempted, resulting in the same expressions producing better code in small blocks than in large blocks. This might result in optimized timing loop code for the NULL loop and non-optimized code for the test problem's timing loop being measured. This can introduce a systematic error into the measurements.
ACES test problems mc_cc_consis_ck_time_loop_01 through mc_cc_consis_ck_ time_loop_05 all test the same statement for consistency. If the performance of a system degraded on large programs, examination of these results might disclose this fact, although it is quite possible that some compilers will have "cutoffs" which are too large to be detected by these test problems.
The test suite includes procedures designed to expose systematic errors. Typically the tests are variations on a problem, so that the system's performance under various conditions may be observed, and the source of the error identified.
One way is to present test problems which repeat the same construction once, twice, and multiple times. If the measured times are not multiples of the single occurrence, a measurement anomaly is present. Discrepancies might be due to cache organization or memory alignment. They might be due to different code being generated for the timing loop itself. They might be due to possible sharing of setup code between repetitions of the constructions. They might be due to better flow through prefetching. The detailed explanation for such discrepancies will require examination of the machine code and the target hardware. While the ACES does not perform this search automatically, the presence of timing anomalies can be detected by comparing the measurements between sets of related test problems. For example:
* su_se_external_01 and op_de_dead_12. These test problems are, respectively, one procedure call and ten calls on the same procedure.
* ar_io_integer_oper_04, st_ov_ovrhd_02, and st_ov_overhd_01. These test problems are, respectively, one, two, and three simple assignment statements (using different variables). The Single System Analysis (SSA) looks at these results in "Sequence of Assignments" in the Language Feature section of the main report.
It would be simpler for benchmark developers if processors were simpler to analyze, but the ultimate users of compilers derive considerable benefits from the non-simple aspects of processors, and it is the responsibility of benchmark developers to develop a capability to evaluate real processors.
The ACES timing loop code consists of two basic phases. The first phase is a search for the appropriate number of times to iterate the test problem to produce a reliable measurement. The second phase repetitively executes the test problem (using the iteration count from the first phase) until consistent measurements are obtained. There are refinements added to these basic phases to enhance accuracy and to prevent any one test problem from executing for an "excessive" time.
On most target systems, the resolution of the system clock is such that many instructions can be executed between changes in the clock. This observation leads to two conclusions.
First, the test problems need to be interactively executed to ensure that they will consume sufficient elapsed time to permit reliable measurements to be made. This is a standard benchmarking technique. It involves dividing the elapsed time measurement by the number of iterations and subtracting off a precomputed NULL loop time.
Second, where the system clock is accurately (but coarsely) maintained, a software clock vernier can be constructed which permits high resolution measurements to be made. No measurement may be more accurate than its standard. However, the accuracy of the standard is not the same as the precision measurement process which compares measurements to the standard. On many IBM S/360 computer systems, the time measurement standard is based on the 60 Hz power line frequency. This is considerably more accurate over a small interval than the 16.7 millisecond resolution of the clock would indicate, but not as accurate as the 26 microsecond resolution of the value returned by a supervisor call requesting the time.
Accurate measurements can be performed whenever there is an accurate hardware clock with minimal jitter and a short program loop whose time is accurately known that can detect when the clock is incremented.
The timing loop code proceeds as outlined in Figure 6-1, where the major vertical lines indicate a time when the clock changes value: that is, a "clock tick".
1. Synchronize the beginning measurement with the system clock by looping until the clock changes. This is shown at time S1 of Figure 6-1.
2. Execute the test problem starting at time S1 and ending at time S2 of Figure 6-1.
3. Loop, reading the clock until it changes, counting the number of iterations this takes. This is shown at S3 of Figure 6-1.
The time measurement desired is then S2-S1. This can be computed as S3-S1 - NUMBER_OF_ITERATIONS_COUNTED * TIME_PER_TICK. TIME_PER_TICK reads the clock and compares it to the prior value to see whether it has changed, and increments a counter when there was no change in the clock value. TIME_PER_TICK is computed during initialization.
The principles of a software clock vernier have been published before by both Wichmann (ALGOL60 Compilation and Assessment, 1973) and by Koster (Low Level Self-Measurement in Computers, 1969).
Some systems overlap the display of output results and the continuation of normal processing. On these systems, when output is completed, an interrupt will be generated which must be serviced to complete all the output operations. In addition, while output is in progress, it will steal memory cycles. To ensure valid results on systems which do this, it is important to ensure that processing associated with output operations is completed before starting the next measurement. This potential source of systematic errors is minimized by including a one second DELAY, sufficient to permit I/O completion, before starting each measurement. This is done in Startime.
The time between ticks of the system clock is typically many instruction execution times. A one millisecond time between clock ticks is a thousand instructions on a one Million Instruction Per second (MIP) machine. To obtain reliable estimates of test problem execution times, the ACES timing loop executes each test problem repetitively within a loop, and subtracts out an estimate of the NULL loop overhead. The number of iterations to perform each test problem is dynamically determined. This permits execution on different target processors of different speeds, without having to modify the source text or interrogate an operator interactively to request a proper value for the iteration count. While the precise value of the iteration count is not critical, too large an iteration count will take excessive computer resources and elapsed time to collect measurements, while too small a count can produce inaccurate measurements.
On some systems the granularity of the clock will be small enough and/or the time to execute the call on the function CALENDAR.CLOCK will be large enough that TIME_PER_TICK will be greater than the clock granularity. On these systems, the clock will always return a different value each time it is read and the vernier compensation will be pointless. On such target systems, the use of the software clock vernier will not add precision to the measurement process and will actually introduce a small error term into the measurements since the timing loop will have to execute the vernier loop compensation code (basically adding the time to perform a call on CALENDAR.CLOCK and a comparison of two values of type TIME) which will be added into the NULL loop compensation time and subtracted out of each measurement. On such systems, the timing loop code could be simplified by removing the clock vernier loop. The ACES does not distribute a version of the timing loop code without a software clock vernier because:
* Using a software clock vernier on target systems where TIME_PER_TICK is greater than the clock granularity does little harm.
* Not using a software clock vernier on target systems where TIME_PER_TICK is greater than the clock granularity will make the measurements take longer by requiring more iterations to obtain consistent measurements than otherwise necessary. It will result in accepting less precise measurement results.
* Most target systems explored during the development of the ACES have a TIME_PER_TICK which is much smaller than the clock granularity. Using a software clock vernier is usually beneficial.
* The use of the ACES would be complicated if it required each user to select variations of the timing loop based on the relationship between TIME_PER_TICK and the clock granularity. Interested users could make this system adaptation if they desire.
The remainder of this section describes the details of the timing loop code. This is provided here for reference, so that an ACES user can understand what the numbers printed after executing an ACES test program mean: Size (code expansion size in bits output as a floating point number); Min and mean execution times (minimum and mean execution times for the test problems output in microseconds); Inner loop and outer loop count (Timing Loop variables that are displayed); Sigma (standard deviation as a percentage of the mean printed in the column headed by the title "sigma"); and the Unreliable indicator (the "#" occurring after the "%" symbol for some entries in the output indicates that the measurement is not considered reliable based on Student's t-test). An ACES user not concerned about the meaning of the count fields can skip the rest of this section.
The timing loop code requires some initialization before it can be used to measure test problems. The NULL loop time is computed in the initialization code, as is the value of "TIME_PER_TICK". The jitter compensation time calculation proceeds in two steps. The initialization code repetitively synchronizes with the system clock and then counts the number of times it can read the system clock until the clock changes. If no jitter exists in the system clock, this count will be constant (barring quantization errors). The standard deviation of the count of ticks is a measure of the system jitter. The minimum value of the elapsed time for a test problem measurement is then the maximum of:
* The standard deviation compensation. A minimum elapsed time is computed such that the observed standard deviation is less than the requested error tolerance with the requested confidence level.
* The quantization compensation. A minimum elapsed time is computed such that an error of one TIME_PER_TICK interval will be less than the requested error tolerance with the requested confidence level.
The code for the ACES timing process contains two loops, an inner loop and an outer loop. The inner code loop, as outlined in the clock vernier discussion, will:
* Synchronize with the system clock.
* Execute the test problem NCOUNT times. NCOUNT is an integer variable whose value is controlled by the outer code loop. Actually, NCOUNT is represented by two integers, ZG_GLOB3.NCOUNT_UPPER and ZG_GLOB3.NCOUNT_LOWER, so that the ACES will not require support from more than 16 bit integers in order to run the timing loop. The actual value is NCOUNT_UPPER * 2**14 plus NCOUNT_LOWER.
* Re-synchronize with the system clock, counting loops as discussed in the vernier compensation discussion.
* Compute an estimate for the test problem execution time, based on NULL loop time and the vernier compensation term.
The outer code loop implements the two-step jitter compensation calculation by executing and examining the results of the inner loop.
The first phase of the timing process is a search for the appropriate value of NCOUNT. That is, the smallest value of NCOUNT large enough that the measurement will provide a reliable estimate. If the current value of NCOUNT is too small, it is replaced by 2*NCOUNT+l and the inner loop executed again. The search starts with NCOUNT having a value of one. NCOUNT is never made larger than BASIC_ITERATION_COUNT (BASIC_ITERATION_COUNT is also represented as two 16 bit integers, ZG_GLOB3.BASIC_ITERATION_COUNT_UPPER and _LOWER. Without this limitation, for test problems translated into NULLs, an infinite NCOUNT would not be large enough to result in a reliable measure of problem time and the Phase 1 timing loop logic would try to increase NCOUNT without limit. It might eventually stop when it exceeded SYSTEM.MAX_INT (but when constraint checking is suppressed that is not guaranteed), but this could represent more elapsed time than either desirable or necessary to determine that a test problem has been translated into a NULL. The sequence of values of NCOUNT is 1, 3, 7, ... 2**N-1.
The second phase basically uses the value of NCOUNT from the first phase to perform the inner loop code repetitively, testing after each cycle if it should terminate the measurement of the test problem. Cycles are repeated until one of these conditions is met:
* The number of cycles is greater than ZG_GLOB3.MAX_ITERATION_COUNT.
* The number of cycles is greater than ZG_GLOB3.MIN_ITERATION_COUNT and a Student's t-test indicates that, to the requested confidence level in zg_glob3, the timing measurements are being drawn from a sample with the same mean.
* Any individual test problem has executed for more than thirty minutes.
* The CPU time is less than 10% of the elapsed time and the elapsed time is greater than ten seconds (only applicable when measuring CPU time).
* Large variations are observed between cycles and Phase 2 restarted with a larger value of NCOUNT.
Measurement of the problem will terminate if the number of cycles is greater than ZG_GLOB3.MAX_ITERATION_COUNT. MAX_ITERATION_COUNT is an integer which limits the number of cycles the timing loop will perform. It can be modified by an interested ACES user. It should not be made smaller than MIN_ITERATION_COUNT or larger than ZG_GLOB3.T_VALUE'LAST - 1. The confidence level test for termination is based on the standard error in the means which is defined as the standard deviation of the samples divided by the square root of the number of samples: when the measurements are statistically independent, the standard error will asymptotically approach zero (it is not necessary to make any assumptions about the distribution of the errors, such as being normally distributed).
Measurement of the problem will terminate if the number of cycles is greater than ZG_GLOB3.MIN_ITERATION_COUNT and a Student's t-test indicates that, to the requested confidence level, the timing measurements are being drawn from a sample with the same mean. (Refer to a standard statistical text for a detailed description of the Student's t-test, such as Introduction To the Theory of Statistics, by A. Mood and F. Graybill.) The confidence level is built into zg_glob3. It may be modified by adjusting the named number ZG_GLOB3.TIMER_CONFIDENCE_LEVEL and the contents of the array ZG_GLOB3.T_VALUE, as described in the comments in the source files "zg_glob3.*".
When any individual test problem has executed for more than thirty minutes, the test problem is terminated when the outer loop code gets control and a descriptive error message is generated. This test is intended to prevent excessive execution times and should not occur often. An informational message is also written when more than five minutes have passed since a test problem started and the outer code loop gets control.
To prevent any one test problem from taking excessive elapsed time when measuring CPU time, the outer timing loop terminates when the CPU time is less than 10% of the elapsed time and the elapsed time is greater than ten seconds. Without this modification, the timing loop code would execute for days in measuring the CPU time for a "DELAY 1.0;" statement. This modification only affects test problems which are not CPU bound: language features whose use can make a test problem non-CPU bound include DELAY statements; I/O statements where the CPU will be idle until a physical I/O operation completes; perhaps tasking on some target systems where task scheduling is charged to system overhead rather than to each user job; or a system which has heavy contention from other (higher priority) user jobs or system daemons. If this condition is detected, it will be noted on the output and an "unreliable measurement" error code reported.
The outer code loop may be "terminated" when large variations are observed between cycles and Phase 2 restarted with a larger value of NCOUNT. This "restarted" Phase 2 will then be terminated by one of the conditions listed here. Large variations between cycles can be caused by transient conditions, such as when the initial execution of a test problem on a virtual memory system causes a page fault forcing a wait until the code is loaded, or when a system daemon usurps the CPU (to process an interrupt or to provide periodic service to a timed queue).
It is best to avoid taking measurements on a system while it is processing transients. If the timing measurement will converge it would be better to increase the value of NCOUNT and re-enter Phase 1. If the transients go away and the time measurements stabilize, then a reliable measurement will be generated. If the behavior does not stabilize, the system should "waste" little execution time in an attempt to produce good measurements. When a test problem takes a variable amount of time for each execution, increasing the number of iterations and dividing by the iteration count will not always converge. Increasing NCOUNT in such a case will simply make the ACES run longer (perhaps much longer) trying to measure the test problem. Therefore, the ACES timing loop code only increases NCOUNT in Phase 2 when the current measurement satisfied all the points listed below:
* It is greater than ZG_GLOB3.EXCESSIVE_VARIABILITY or less than 1/2 ZG_GLOB3.EXCESSIVE_VARIABILITY.
* It is less than 1.0 / real(NCOUNT). This (relatively) arbitrary cutoff is perhaps more intuitively obvious when written as:
IF time * real(NCOUNT) > 1.0
THEN ... do not increase NCOUNT ...
END IF;
Time * NCOUNT is the total time taken to execute the test problem. When this total time is greater than one (1) second the time is large enough to produce reliable timing estimates. Estimates based on small values of total problem time will be sensitive to small errors in clock measurements.
* NCOUNT is less than BASIC_ITERATION_COUNT. An informational message is displayed when the variation in measurements between cycles is large, but the cutoff parameters prevent extending the Phase 1 search.
This "termination" and re-entry into Phase 1 serves two purposes. First, using this technique, the ACES timing loop can produce more accurate measurements by ignoring transient behavior. Second, by not assuming that all variations are due to transient noise, it will not spend an inordinate amount of time increasing the iteration count to force a test problem timing measurement to converge when the test problem is inherently variable.
During initial development of the ACES timing loop, the code always re-entered the Phase 1 search whenever the times varied by a factor of two. On one trial system some test problems containing I/O operations executed for over a day without terminating because it always re-entered Phase 1 and increased NCOUNT; I/O performance on this system varied more than the confidence levels were set for.
Each test problem is measured by "plugging" it into a template which will, when executed, measure and report on the execution time and size of the test problem contained within it.
The timing loop code consists of four code files which are incorporated into the source by a preprocessor (Include) which supports text inclusion. The PRAGMA INCLUDE ("Inittime") incorporates a small section of the verification code to ensure that initialization code parameters created during Pretest for the different compilation options are correct. There is also an option to incorporate the initialization code inline for every test program when it is necessary to work around verification problems. The other code files bracket each test problem to be timed and provide a place for writing an identification of the test. They are responsible for computing the execution time and code expansion size of the test problem they enclose, and for outputting the measurements obtained. The general form for all test programs appears in Figure 6-2, Timing Loop Template.
with zg_glob1; use zg_glob1
with zg_glob2; use zg_glob2
with zg_glob3; use zg_glob3;
with calendar; use calendar; -- declarations
begin -- initializations can go here
.
.
.
pragma include ("inittime"); -- once per main program
-- initializations can go here too
.
.
.
zg_glob2.put_test_name("...") -- print name of test problem
pragma include ("startime"); -- first test problem
-- test problem code goes here
...
pragma include ("stoptime0");
zg_glob2.put_comment_optional(".."); -- description goes here
pragma include ("stoptime2");
.
.
.
zg_glob2.put_test_name("...") -- print name of test problem
pragma include ("startime"); -- second test problem
... -- test problem code goes here
pragma include ("stoptime0");
zg_glob2.put_comment_optional(".."); -- description goes here
pragma include ("stoptime2");
... -- additional test problems enclosed by
-- put_test_name, startime / stoptime0 /
-- stoptime2 follow
end;
In Figure 6-2 the "..." after the PRAGMA INCLUDE("Inittime") would be replaced by initialization code, that after the PRAGMA INCLUDE("Startime") would be replaced by the test problem to be timed, and that after the PRAGMA INCLUDE("Stoptime0") would be replaced by calls on ZG_GLOB2.PUT_COMMENT_OPTIONAL which print a description of the problem. It is appropriate to include lists of related tests and the purpose of the test.
The purpose of each of the included files is as follows:
* Inittime - This code includes the "zg_cpy.*" files that were created during the Pretest and verifies that the timing loop parameters can be used to predict the execution time of the NULL loop.
* Startime - This code is the head of the timing loop.
* Stoptime0 - This code is the tail of the timing loop.
* Stoptime2 - This code outputs the measurements collected on the current test problem.
During the development of ACES Version 1.0 the timing loop was modified to inhibit optimizations on compilation systems performing inter-unit analysis. An optimizing compilation system which performs such inter-unit analysis is potentially able to detect that the NULL timing loop is a no-op (even though it calls on a subprogram in an independent unit) and not generate any code for it. On such a compilation system, the times for all ACES test problems (which were not optimized into a null) would be too large. The loop overhead for these test problems would be performed on each iteration, but not compensated for when subtracting off the null loop overhead. The loop overhead would be measured as zero due to optimization. Note that systems which perform this optimization either create dependencies on the body of a unit or perform flow analysis after the initial compilation of a unit (in a separate optimizing process or as a part of a linking process).
The timing loop was modified in order to inhibit its optimization into a null. An optimizing compilation system as discussed above which generates a null for the timing loop would also be able to optimize several of the individual ACES test problems into nulls (or to reduce the total problem time significantly) by recognizing that the procedures ZG_GLOB4.PROC0, ZG_GLOB4.PROC1, ZG_GLOB4.PROC2, ZG_GLOB4.PROC3 and ZG_GLOB4.PROCI1 have NULL bodies and the calls on them can be optimized away (and by observing that the procedure calls do not modify any variables, permitting wider application of common Subexpression elimination, preservation of register contents, and/or loop invariant motion). Therefore, these procedures in ZG_GLOB4 were modified by adding an IF statement (a condition which always will be FALSE but where this will not be easily determinable) which makes the body not easily omittable. Adding this IF statement will usually add about three instructions on a typical target (LOAD; COMPARE; CONDITIONAL_BRANCH). But, the difference in the number of instructions executed may be larger than this because the addition of the procedure calls in the THEN part can inhibit otherwise possible simplification of the subprogram linkage convention. This is done by recognizing that a subprogram which contains no calls on other subprograms can avoid the creation of an activation record (this is called "leaf-node" optimization).
This change to the procedure in zg_glob4 had a potential effect on the timing measurement for some problems. The affected problems were renamed for ACES Version 1.0. Users making historical comparisons with ACEC Version 3.1 or earlier are urged to consult the Version Description Document for ACES Version 1.0 for more information.
The measurement of code expansion is performed in the timing loop. This can be done in either of two ways: using a label'ADDRESS attribute or a Get Address (zg_getad) function. In both cases, the measurement is computed as the difference between the addresses at the end of "Startime" and the beginning of "Stoptime0". These techniques compute the differences between values of type SYSTEM.ADDRESS. The resulting sizes are multiplied by STORAGE_UNIT to get the code size in bits. Presenting size in bits will permit easy comparison between target machines with different values of word sizes (values of STORAGE_UNIT).
Out-of-line code which supports a construction used in a test problem will not be counted in the code expansion measurement. Run-Time System (RTS) routines will not be measured here. If the compiler calls on run-time library routines to translate a feature, the space measured will be that required to perform the call. This differs from timing measurements which will include the time to perform any called subprogram.
Several issues are associated with correctness, which will be discussed in turn along with the steps taken in the ACES to minimize problems due to them. Some of the issues have been mentioned in other contexts in this document, but deserve explicit discussion. Issues include:
* Validity of test problems.
* Correct translation of test problems.
* Satisfaction of intention.
* A priori error bounds (see Section 6.3).
* Random errors (see Section 6.3.1).
* Systematic errors (see Section 6.3.2).
Considerable care has been taken to ensure that the test problems are valid Ada. The test problems will have been attempted on multiple Ada compilation systems.
Some test problems are system dependent. Any test suite which seeks to address all the Ada features of interest to MCCR applications must include tests of system-dependent features. These test problems are noted in the VDD Appendix D, "SYSTEM DEPENDENT TEST PROBLEMS."
Determining the correctness of an Ada implementation is not the charter of the ACES. That is the responsibility of the ACVC project. However, it is misleading and undesirable to credit an implementation for producing wrong answers quickly. Self-checking is included on some test problems where nonobvious errors occurred during testing. If a test problem provokes a compile-time error or aborts during execution, failure is obvious and does not call for self-testing code. Users who detect errors in an Ada implementation when executing the ACES should report them to the compilation system maintainers for correction.
If a difficulty is discovered with a test problem on one particular system, this may be sufficient to justify changes to the test suite, depending on the nature of the difficulty encountered and its impact on portability. The appropriate response to the discovery of errors in a compiler may be to simply report them to implementors for correction.
Test problems are included in the test suite for specific reasons. They may fail to satisfy their intentions for two reasons.
* A problem might be optimized with respect to the timing loop.
* Specific optimization test problems might be susceptible to an unanticipated optimization, resulting in the problem not being a test for the original optimization.
For the timing loop code to work it is necessary that the test problem be executed once for each iteration of the timing loop. It must follow the same computational path on each iteration, or it will not be valid to divide by the number of iterations to compute the individual problem execution time. For measurements on the test problems to realistically represent the behavior of the test problem in a general context, it is necessary to ensure that code is not optimizable with respect to the timing loop. In particular, the test problems should:
* Use live variables - Variables containing the results of a computational sequence in a code fragment must be referenced after being assigned. If they are not, an optimizing compiler might determine that the computations are irrelevant and ignore them.
* Not be loop invariant, in whole or in part - If the code fragment is completely invariant, one execution of it would suffice for all values of the loop iteration count. Even where the code fragment is not completely invariant, it might contain a (sub)expression which is invariant and subject to loop motion, such as "A**2" where "A" is a loop invariant variable.
* Not be strength-reducible with respect to the timing loop.
* Not be unduly foldable - A code fragment which is interesting to measure often requires initialization code to ensure correct and repeatable execution. Any such setup code should not permit the "interesting" code to be evaluated at compile time.
When the flow through a code fragment is determined by the values of variables that the code later modifies, for example
IF first_time_flag = false THEN
initialize_procedure;
first_time_flag := true;
END IF;
...
it is necessary to ensure that the same path is followed on each repetition. If the test problem is following different computational paths on each repetition of the timing loop it is incorrect to compute an average time by simply dividing by the number of repetitions.
Some code fragments will, when executed repetitively, generate a numeric overflow, which will raise an exception and disrupt the flow of control. Consider the statement
I:=I+1;
where "I" is never reset. This will eventually reach the integer overflow limit and raise an exception (if checking is not suppressed). Code fragments with either of these characteristics should be modified. Numeric variables should have the same value at the start of each iteration, either by reinitializing them or by insuring that their new computed value equals their initial value.
If a user determines that a problem does not satisfy its intent on some target, that is sufficient reason to file a problem report. If a compiler is able to find a better way to translate a test problem, that is not a fault of the compiler. It is not necessarily a reason to withdraw the test problem; the performance of various systems on this example may be a good way to expose the use of the unanticipated optimization techniques. If such an optimizing compiler results in the test suite not having any remaining tests which address the originally intended optimization, it may then be desirable to construct and incorporate into the ACES test suite additional test problems which are not amenable to the unanticipated optimization. The comment block of the original test problem would be updated to reflect the optimizations and the new related test problem.
Projects developing code for embedded targets may not have target hardware available or accessible when they perform an evaluation. If they have access to a software simulator, they may be able to run the ACES, and some adaptation may be necessary to save testing time. The ACES user must determine how the simulator treats time. If programs referencing CALENDAR.CLOCK running on the simulator return the actual time-of-day, then the ACES timing loop will provide measurements of the execution speed of the simulator. The speed of the simulator will rarely be of much interest; what is usually of primary concern is what the execution speed would be on the target.
Some simulators provide estimates for what the eventual target speed would be (based on information on the speed of individual instructions). These estimates of "simulated" time may be accessible to programs running on the simulator. It would be easiest for ACES use if the "simulated" time were reflected in the values returned by CALENDAR.CLOCK (then the ACES might be run without modification). If "simulated" time is accessible from some other call, the ACES must be modified to use it. Because it would be likely in the case of a simulator that the measured CPU time would be more than the elapsed time, the simplest way may be to use the ACES CPU time option with the function returning the "simulated" time replacing the CPU time function, and also removing the CPU time cutoff code in ZG_GLOB3.TERMINATE_TIMING_LOOP and ZG_GLOB3.STOPTIME2.
If it is possible to measure simulated time, it would be appropriate to adapt various values in the ACES to minimize the testing time. In particular, the values of BASIC_ITERATION_COUNT_LOWER and BASIC_ITERATION_COUNT_UPPER should be set to small values (note that the program zp_basic determines these, but as distributed this test will itself probably take an excessive time to run). MAX_ITERATION_COUNT should also be set to a smaller value, both because simulated times should be more reliable and to limit the execution time. Because of timing complications in the hardware, it is common for simulators to imprecisely reflect the timing on the hardware; because it adds more overhead to the simulator to precisely reflect all the hardware interlocks and memory conflicts possible on a modern processor, some simulators simplify their timing computations in order to make the simulator run faster. It might be appropriate to attempt only a subset of ACES tests to reduce the testing time.
If it is not possible to measure simulated time, it might still be useful to run some of the ACES performance test problems, not to collect timing data but simply to observe whether failures occur. This may be particularly helpful for the Implicit Storage Reclamation subgroup of tests (although, if successful they will take a long time to complete on a simulator).
This section of the guide presents the background for the analysis program, CA. Readers without a statistical background may wish to skip some sections. All readers should read at least the first few pages. CA assumes a mathematical model and fits it to the observed data. It also checks on how well the model describes the data. The general model, and its application to timing data, code expansion data, and compilation and link times are each discussed in turn.
There are three books containing information related to benchmarking efforts which would provide interested ACES users with additional background on the design of benchmarking test suites; the analysis and interpretation of the data produced by test suites; possible pitfalls in measurement and interpretation of data; implementation of compilation systems (hardware and software).
The book ALGOL 60 Compilation and Assessment, by B. Wichmann, Academic Press 1973, is broader in scope than benchmarking, discussing several other topics relevant to language implementations, support environments, and compilation system testing.
* Comments on performance measurement techniques.
* An analysis of 24 implementations and 42 simple ALGOL statements using a product model and residual matrix (fit using a least-square criteria).
* Usage statistics for ALGOL 60 which formed the basis of the classical Whetstone benchmark.
* An introduction to implementing block structured languages.
* A discussion of six different compiler implementation approaches and their performance impacts.
* Discussion of program environments.
* Discussion of testing compilers.
The second book of interest is Performance and Evaluation of LISP Systems, by R. Gabriel, MIT Press 1986. This book contains a more recent treatment of measurement techniques and test suite design in the context of LISP and covers some more modern processors than the first book. Particularly relevant are the comments in the conclusion of this book on the tangible benefits of benchmark suites. It lists two major benefits, an increasingly educated audience and improved implementations, where the improvements in implementations come in two sorts: revealing errors in implementations for correction and revealing "performance bugs" which implementors can modify and thereby improve the efficiency of their compilation system. The ultimate benefactor is the user who gets correct and efficient systems. These are very worthwhile benefits and supplement the value gained by providing users a capability to collect information to make informed selections between alternative implementations.
The third book of interest is Computer Architecture A Quantitative Approach, by J. Hennessy and D. Patterson, Morgan Kaufmann 1990, which discusses pitfalls in benchmarking relative to optimizing compilers and pipelined processors.
There has been a growing controversy in the literature about how to characterize overall system performance and about the need to review test problems which demonstrate variations in relative performances between systems. The reporting of only "average" performance is increasingly recognized as simplistic and potentially misleading.
"The Keynote Address for the 15th Annual International Symposium on Computer Architecture", May 30 - June 3, 1988, by D. Kuck, printed in Computer Architecture News, Volume 17 Number 1, March 1989, emphasizes several points of interest relative to the performance of benchmarks on systems. First, he pointed out that performance is unstable; the megaflop rates vary widely between the different Livermore Loops kernels, having a ratio of 15 between the best and worst rates on one system (machine and compiler). He concluded that it is not good to use a single benchmark as a measurement tool - but that this is only the most benign part of the story.
Secondly, Mr. Kuck examined results from two different systems, and plotted the performance ratios between them. The first system was 1.5 to 2.5 times faster than the second for nine problems; the second system was more than 1.5 times faster than the first in another nine cases; and in six cases the systems were within 25% of each other (there are a total of 24 test problems). Based on this data, deciding which system is faster is not a well defined question, it varies widely depending on which problem is being run. Other work he was doing convinces him that the instability between both megaflop rate and the relative performance of systems is NOT an anomaly of the Livermore Loops due to the small size of the code; large variations in performance over different problems is a real property of systems, and when he examined real applications' "unstable" behavior (variation in absolute and relative performance), it was greater than for the Livermore Loops test programs.
The recent paper "Performance Variation Across Benchmark Suites", by C. Ponder in Computer Architecture News, Volume 19 Number 4, June 1991 emphasizes the inability of any single number to represent the performance of a system. It recommends that consideration be given to "performance plots" which display the range of variations in the relative performance of a system over a suite of benchmarks, and that users examine extreme cases to try to understand the reasons for their non-uniform behavior. He argues that "too much" information from "too many" benchmarks is not a problem if the results are properly presented, and that the more objectionable benchmarks (those testing only one operation or being too small) may provide the most interesting insights.
The article "Are SPECmarks as Useless as Other Benchmarks?", by O. Selin in UnixWorld August 1991 reports that L. Nirenberg of Competition Technology Corporation developed a simple linear regression model using Dhrystone MIPs and Linpack megaflops to predict the SPECmark (which is a performance metric based on running ten large programs) "quite accurately". The tentative conclusion that the article suggests is that all reasonable benchmarks which concentrate exclusively on CPU power metrics are highly statistically correlated and that measurement of any will permit prediction of all of them; from this he concluded that extensive (and expensive) benchmark testing is not warranted. Experience with regression and other statistical techniques (such as principal component analysis) in related fields suggests that such models are not surprising. Extrapolating to the ACES test suite suggests that, given access to a large representative sample of ACES performance results, it should be possible to statistically predict ACES system factors based on measuring a few problems. While probably true, doing so would give ACES users little information about the variances in language feature performance and would discard one of the unanticipated benefits of the test suite: finding errors in implementation.
The approach used in the ACES Comparative Analysis (CA) tool emphasizes the observed range of variations (by providing a residual matrix displaying the differences between the model fit and the actual data so that extreme cases are isolated for detailed examination) and provides confidence intervals so that the estimates of "average" performance it computes will include an explicit estimate of the goodness-of-fit of the statistical model to the data (systems with very similar relative performances over the set suite will have narrow intervals; systems with dissimilar relative performances will have broad confidence intervals). The residual matrix is similar in purpose to the "performance plots" suggested by Ponder and provides the information on the variations in performance discussed by Kuck.
This section explains the rationale for the CA model and for the statistical estimation techniques used. One useful concept in our discussion, with which some users may not be familiar, is the geometric mean. Whereas the arithmetic mean of N numbers is the quotient of their sum by N, the geometric mean is the Nth root their product. That is,
A = Sum * (1/N)
G = Product ** (1/N).
Note that the logarithm of the geometric mean is the arithmetic mean of the logarithms. Figure 7-1 contracts these means for several sets of two numbers each.
Arithmetic Mean:
5 + 5 = 10 10 / 2 = 5
2 + 8 = 10 10 / 2 = 5
1 + 9 = 10 10 / 2 = 5
Geometric Mean:
5 * 5 = 25 Sqrt (25) = 5
2 * 8 = 16 Sqrt (16) = 4
1 * 9 = 9 Sqrt (9) = 3
The requirements for a model to describe the ACES test results are listed below.
(1) The model should be consistent with simple data pattern interpretations.
There are several obvious results in the following (see Figure 7-2). First, System A performs better than any other system. The performance figures are consistently less by a wide margin. Any model that led us to a different conclusion would be suspect. Second, systems C and D are very similar. Our results should confirm this. The results of a comparison between system B and systems C and D are less obvious. The arithmetic mean shows B to be the slowest system. The geometric means ranks B, C, and D at the same level. The reason for this discrepancy is not hard to find. The arithmetic mean is influenced by the high absolute time on problem_1 by system B. The geometric mean depends solely on the relative ratios between problems. Comparing B and C, for example, we find that on problem_1, B takes twice as long as C, but on problem_2, C takes twice as long as B. Since the scores on problem_3 are equal, the geometric mean rates the two systems as equal.
(2) The model should be simple; it should have an intuitive interpretation. Both of the models we consider below are very simple.
(3) The model should summarize the data well. If the model does not fit the data, then it will not be useful. Both of the models we discuss fit our data well.
(4) The model parameters must be estimable.
SIMPLE INTERPRETATIONS
--------------------------------------------------------------------------------------
Consistent with simple data pattern interpretations
--------------------------------------------------------------------------------------
SYSTEMS
--------------------------------------------------------------------------------------
Problems A B C D
--------------------------------------------------------------------------------------
problem_1 20 80 40 40
problem_2 10 20 40 20
problem_3 4 20 20 40
--------------------------------------------------------------------------------------
Arithmetic 11.3 40.0 33.3 33.3
Geometric 9.3 31.7 31.7 31.7
--------------------------------------------------------------------------------------
While there are variations, two simple models are widely discussed in the literature. See these references, as well as the references discussed in the introduction to this section, for additional information.
* Fleming, Philip J., and Wallace, John J. How not to lie with Statistics: The correct way to summarize benchmark results. CACM, Mar. 1986, Vol 29, No. 3. pp 218-221.
* Smith, James E. Characterizing Computer Performance with a Single Number. CACM, Oct. 1988, Vol 31, No. 10. pp. 1202-06.
These are the product model and the additive model. The product model assumes that computer performance can be modeled as the product of a system factor (for each system) and a problem factor (for each problem). This model is intuitively appealing because we often say that one system is twice as fast as another; or that one system is 50 percent faster. Implicitly, a product model is assumed. A product model corresponds to a geometric mean.
An alternative approach to comparing system performance is to simply add the execution times of a set of programs and use this number as a figure of merit. We can interpret these totals in the same way as the results of the product model analysis. For reference purposes, we call this approach the additive model. This model corresponds to an arithmetic mean.
Some characteristics of these two approaches have been identified. The product model gives equal weight to all test problems, whether large or small. The product model is concerned with the ratios between system performances on a given test problem. On the other hand, the additive model clearly gives added importance to long-running problems since these contribute more to the total score. In some cases, when a very large problem is part of an analysis group, this can mean that this one problem effectively determines the results.
Figure 7-3, Product vs. Additive Model, gives us an example which compares the product and additive models. Except for system A, which is clearly better than all the other systems, the pattern of results is not obvious from examining the raw data. The geometric mean and the arithmetic mean lead us to different rankings of systems. The rank order for the geometric means is A, B, D, C while the rank order for the arithmetic means is A, (C, D), B as C and D are tied. This confusion reflects the fact that the data from systems B, C, and D do not reveal any clear winners. Any single number summary here can be misleading. (The ACES handles this confusion by presenting the reader with confidence intervals and outliers to aid in interpreting the system factors for each system.)
PRODUCT VS. ADDITIVE MODELS
------------------------------------------------------------------------
SYSTEMS
Problems A B C D
------------------------------------------------------------------------
problem_1 20.0 70.0 33.3 40.0
problem_2 10.0 20.0 33.3 20.0
problem_3 4.0 20.0 33.3 40.0
------------------------------------------------------------------------
Geometric mean 9.3 30.4 33.3 31.7
------------------------------------------------------------------------
ratio (A = 1.0) 1.0 3.3 3.6 3.4
------------------------------------------------------------------------
Arithmetic mean 11.3 36.7 33.3 33.3
------------------------------------------------------------------------
ratio (A = 1.0) 1.0 3.2 2.9 2.9
We should emphasize that the product and additive models can produce different results even when all problems contribute equally. In the product model, the least average ratio defines the best system. In the additive model, the least total time defines the best system. One advantage to the additive model is that the zero times are naturally included in the results. Zeroes are more difficult to include in the product model.
The total time required for an application (or set of applications) is the most generally accepted measure of performance. Less time means better performance. Given a set of benchmarks, this leads to the conclusion that a weighted arithmetic average is the appropriate measure to use for system evaluation. If the user can assign meaningful weights to the test problems (or to a subset of them), then this approach should predict the performance a user will get. For reasons discussed below, the ACES does not simply add scores to compute a system factor. First, the data is "normalized" by dividing by the mean score for that problem on the systems being compared. (The reasons for this step are discussed next.) Then, a robust mean is computed for each system, robust confidence intervals are estimated, and a multiple comparison is done.
This approach is called the Special (Application Profile) Analysis Mode in the ACES. It is also discussed in Section 5.3.2.3. The User's Guide Section 9.1.5 "Modifying the Structure (Weights) File" discusses how to accomplish this goal. For users who know enough about their needs to assign appropriate weights, this is probably the most valuable analysis for them to perform.
This is not the only ACES analysis mode for several reasons: (1) Not all users have the time or the information to derive the necessary weights; (2) Some sets of analysis results are definitive without going to this extra trouble; sometimes one (hardware and software) system is clearly better; (3) Even users of the Application Profile approach can learn something from doing the "regular" ACES analysis. The Application Profile mode does not produce an outlier analysis. This can be a very valuable part of the ACES findings. Regardless of how the overall system comparison is done, outlier analysis can tell the user which specific problems are very fast or very slow, relative to the performance by that system on other problems, and relative to the performance by other systems on that problem.
Since the ACES contains tests with widely varying execution times, and since the length of time that these tests run is not necessarily an indicator of how important a problem is, we do not want some problems to be more important simply because they run longer. We can make each problem equally important by weighting the problems appropriately. One solution would be to use the reciprocal of the average problem speed.
We do not, a priori, have any reason to think that any problem is more important than another. (If it is more important for some users, then they can weigh it accordingly.)
In the ACES analysis mode, we first proceed by normalizing the scores on each problem by dividing by the mean for that problem across all the systems being compared. Normalization has no effect on the relative system factors produced by the product model (in the absence of missing data). It does equalize the importance of problems in the additive model. Figure 7-4, Normalization - Models, illustrates the impact of normalization. Without normalization, the additive and product model give different rankings to system C. With normalization, the rankings are identical and the system factors are very close.
NORMALIZATION - MODELS
------------------------------------------------------------------------
Raw Data Normalized Data
------------------------------------------------------------------------
Systems A B C A B C
------------------------------------------------------------------------
prob 1 2 4 8 0.43 0.86 1.71
prob 2 3 6 12 0.43 0.86 1.71
prob 3 5 10 20 0.43 0.86 1.71
prob 4 110 220 60 0.85 1.69 0.46
------------------------------------------------------------------------
Arithmetic 30 60 25 0.53 1.07 1.40
------------------------------------------------------------------------
ratio (A = 1.0) 1.0 2.0 0.83 1.0 2.0 2.6
------------------------------------------------------------------------
Geometric 7.6 15.2 18.4 0.51 1.02 1.24
------------------------------------------------------------------------
ratio (A = 1.0) 1.0 2.0 2.4 1.0 2.0 2.4
Normalization was originally introduced to explicitly equalize problem importance while running the additive model. However, it turns out that normalization is also important for our models in an unexpected way. Missing data is common in ACES results. Even today, with more mature Ada compilers, some tests fail to compile; others fail to run. This missing data can distort results. Normalization reduces this effect for both the additive and the product models.
Normalizing affects the geometric mean, because of missing data. If there were no missing data, then the ratio of geometric means for raw or normalized data would be the same. Missing data distorts these values, sometimes greatly. Dividing by a problem mean divides the several products by the same quantity. Their ratio remains unchanged. Missing data means that not all products are divided by the same problem means. Means that are missing are not included, of course.
If we remove the largest value in Figure 7-4 (the score for problem 4 on system B), and recompute our results, we get a new winner. However, if we normalize first and then recompute, the results are much more similar. Figure 7-5, Normalization - Missing Data, summarizes these calculations. We do not want failure on slow problems to improve the relative standing of a system.
NORMALIZATION - MISSING DATA
------------------------------------------------------------------------
Raw Data Normalized Data
------------------------------------------------------------------------
Systems A B C A B C
------------------------------------------------------------------------
prob 1 2 4 8 0.43 0.86 1.71
prob 2 3 6 12 0.43 0.86 1.71
prob 3 5 10 20 0.43 0.86 1.71
prob 4 110 - 60 1.29 - 0.71
------------------------------------------------------------------------
Arithmetic 30 6.7 25 0.65 0.86 1.46
------------------------------------------------------------------------
A = 1.0 1.0 0.2 0.8 1.0 1.3 2.3
------------------------------------------------------------------------
Geometric 7.6 6.2 18.4 0.57 0.86 1.4
------------------------------------------------------------------------
A = 1.0 1.0 0.8 2.4 1.0 1.5 2.4
Normalization is not sufficient to compensate for the impact of missing data. It merely reduces it. When there is missing data, we run a check on our findings by doing each pairwise comparison with only the data common to the two systems.
We have not been able to conclude that either the product or the additive model is always superior. For historical reasons, the CA program uses the product model. In practice, it makes very little difference. Using the data we have available, almost all conclusions (systems factors, significant differences, outliers, goodness of fit) are the same.
Other researchers have argued differently, but their data was based on evaluations of supercomputers, where performance variations between problems were often orders of magnitude different, first favoring one system, then another. None of the data from the ACES that we have seen displays these characteristics.
We want our statistical methods to estimate system factors that we can use to compare different systems and reach reliable conclusions about which are better. In statistical terminology, we want robust estimators that include confidence intervals and a multiple comparison technique. We also want to identify outliers: data points that do not fit the model.
Robust statistical estimation techniques are important for several reasons: (1) The data may not fit the classical statistical assumptions. In this case the classical statistical procedures may lead to misleading or wrong conclusions. (2) Even though the ACES timing loop has many built-in tests to detect timing errors, it is still possible, under some circumstances, to gather erroneous times. We want to use a statistical estimation technique that will not be seriously distorted by a few bad measurements. (3) Finally, some data may not fit our simple model, even when the timing measurements are accurate. In both of the preceding cases, we want to identify and flag the aberrant data points for further analysis.
We have selected methods that meet these requirements.
* System Factors and Confidence Intervals - The point estimate and confidence interval calculation details are M-estimators in chapters 11 and 12 from Understanding Robust and Exploratory Data Analysis, by David C. Hoaglin, Frederick Mosteller, and John W. Tukey.
* Multiple Comparisons - For multiple comparisons, we use the Bonferroni method presented on pages 33-34, in Analysis of Messy Data, by George A. Milliken and Dallas E. Johnson.
* Outliers - We compute outlier range based on analysis of natural logarithm (actual/predicted). Our primary viewpoint is still the product model. We would like to be able to say that one system is nn% faster (or slower) than another. From this perspective, the obvious way to interpret our data for good or bad fit is: What do we have to multiply the predicted value by to produce the actual value? We take logs to linearize the data for statistical purposes.
The other choice here is ABS (actual - predicted). However, this can lead to anomalies if applied to raw (or normalized) data. If the predicted value is 1.0 and we have actual values of 0.05 and 20.0, the first prediction misses by 0.95 which is much smaller than the second miss (19.0). But the first case is 20 times faster than expected and the second is 20 times slower. This problem is solved by using logarithms, since
Log (actual / predicted) = Log (actual) - Log (predicted).
* Goodness of fit - We use the sum of the absolute values of ( Log (actual) - Log (predicted) ) as our measure of goodness of fit. According to our analysis with outliers, this appears to be an appropriate measure. Note that this is analogons to the sum of the absolute deviations from the arithmetic mean.
This section discusses how well the product model assumed by CA can be expected, in principle, to fit measurements of execution time. The product model will be satisfied if we can make the "linear" assumption that there is a time associated with each language construction, and the total time for a problem is the sum of the times for the constructions it uses. Then, if the relative speeds of various constructions on different hardware are roughly constant, the data will fit the product model. For non-optimizing compilers, this assumption could easily be satisfied. For optimizing compilers, the linear assumption might break down.
Many optimizing transformations (e.g., common Subexpression elimination, folding, load/store suppression, most machine idioms, strength reduction, and automatic inline expansion of subprograms) will not modify the asymptotic complexity ("Big O") of the generated code. For a linear sequence of Ada statements, the code generated by either an optimizing or a non-optimizing compiler will be of a linear complexity; the optimizing compiler will have a smaller coefficient.
This is not true for loop invariant motion, where the difference between an optimizing compiler which moves an expression evaluation out of a loop and a non-optimizing compiler which does not, is determined by the number of loop iterations, rather than being a property of the optimizing techniques. For most optimizing techniques, the difference in execution time between optimized and non-optimized translations will be some factor based on the code generation approach and whether the test problem contains constructions which are amenable to optimization.
Some test problems are not amenable to optimizations. Consider two compilation systems for the same target machine, one which does straightforward code generation and one which produces optimal code. There are some test problems where the straightforward code will be the best possible code for the target machine. The optimizing compiler will not be able to produce any better code for that problem. The analysis program will show the optimizing compiler as executing slower than expected for these simple test problems. In this example, the real problem that the optimizing compiler has is that it cannot do any better than the best code for the target machine, but based on the average performance on the other test problems, it is expected to.
A product model has been widely applied and successful in similar studies. Citing the performance of one compilation system as a factor (or percentage) of another is a common practice and implicitly assumes a product model. Replacing the target hardware with a processor which is uniformly some multiple of the original should intuitively result in a proportional change in the evaluation of the compilation system on the new processor relative to the old one. This also assumes a product model. A product model is simple and intuitively understandable. A detailed examination of the model assumptions and sample results suggest that a perfect fit should not be expected. This lack of perfection is not a fatal flaw. The main purpose of the statistical analysis is to determine and present underlying trends in the large volume of performance data.
Tasking tests executed on multiprocessor target systems which assign separate Ada tasks to separate hardware processors are measuring a conceptually different quantity than the same programs when executed on a single processor system. The use of priorities to control whether the task performing the ACCEPT or the ENTRY call arrives at the rendezvous first may not work on multiprocessor targets.
An ACES user might want to extrapolate measurements from one member of a family of processors to other members in the family. For example, data may be available for Vendor One's system executing on a VAX-station 2000 and Vendor Two's system executing on a VAX-8800. CA will numerically compare the two sets of measurements, but this result confounds the machine effects with the compilation system effects.
The question the user would like to have answered is: "If both vendor systems were run on the same hardware, which would be faster?" The performance difference between hardware speed of two members of a family is rarely exactly linear. The relative speeds of floating point processing, cache memory systems, instruction prefetching, software simulation of some instructions, etc. usually result in some instruction sequences performing faster or slower than the average difference between the two processors. If a user is willing to accept that one member of a family is X times faster than another member of the family (or equivalently, that each has a given MIP rate which is assumed to be applicable for the applications of interest), then a user could compare the factors computed by CA with the relative hardware speeds to see how well the two performance estimates agree.
Large differences between systems will show up this way; however, users need to be aware that hardware options (such as the different floating point hardware support as provided for several families of machines) can complicate such direct comparisons. If a user needs to know performance on a particular target configuration, the ACES should be run on that configuration. Extrapolating from "similar" systems can introduce errors. Dividing each test measurement by the target hardware's average instruction time will estimate the (fractional) number of instructions to execute each test problem.
Some people have been tempted to use this measure to "factor out" target machine hardware characteristics and measure the compiler code generator (independent of the target hardware). This is not really useful for two reasons:
* First, it is difficult to determine target hardware scaling factors. Different Instruction Set Architectures (ISA) take different numbers of instruction to perform similar tasks. Assessment of "overall" processor speeds requires a determination of the appropriate mix of instructions and memory characteristics (cache faults, wait states, prefetching, etc.). Simply counting instructions can be misleading because a RISC ISA will typically require more (but simpler) instructions to translate a test problem than a Complex Instruction Set Computer (CISC) processor. If a RISC processor can execute its simple instructions quickly and cheaply, it may be a more cost effective design. Many compilers do not use all the features of the target machine. An MC-68020 is upward compatible with an MC-68000, so a compiler which treats an MC-68020 as an MC-68000 will work. If a "hardware factor" is determined assuming that all features are used, it will give misleading results.
* Second, it is not helpful to do so. The purpose of the ACES is to evaluate, not compiler writers, but compilation systems, including both software (compilers and run-time libraries) and hardware. Target architectures make a significant contribution to overall system performance which should not be ignored. In an academic environment, it might be desirable to give higher marks to a highly optimizing compiler generating code for a feeble target processor than to a straight forward compiler producing fair code for a capable target processor. However, a primary purpose of the ACES is to provide a capability to evaluate Ada implementations for their suitability for MCCR applications. If the first complete system (optimized compiler/slow hardware) is slower than the second (straightforward compiler/fast hardware), that is the determination that the ACES is interested in making. A project may well select a slower system if its performance is adequate, but it would be foolish for the ACES to report that the first system is faster.
Compilation systems may have implementation-dependent facilities which can be used to improve the performance of the generated code. Their use is not portable, and so results are not necessarily comparable between systems. The syntax and semantics of the string which is passed as an actual FORM parameter for file OPEN and CREATE operations is defined by each implementation. On some systems, there may be settings which produce much faster performance than the defaults. For example, there may be options for specifying shared or exclusive file access, contiguous disk allocations, the size of the initial and secondary extents, multiple buffering of sequential files, suppression or enabling of read-after-write checking for physical I/O operations, and the physical properties of the disk device (seek times, rotational latency, error recovery procedures). These can all have major performance impacts.
The ACES is designed for portable operations, and uses the default settings. To get an idea of the difference tuning can make in performance, it is possible for ACES users to experiment with the various options provided and select the ones which provide the best execution time and satisfy the program's requirements for I/O functionally. When this has been done, readers should be aware of the modification and interpret the results accordingly; in particular, remember that not all systems may have had the same degree of performance tuning. Once a project has selected a compilation system for use, they may be very interested in the performance changes resulting from modifying FORM parameters. ACES users should feel free to explore such modifications, but they should consider the modified test problems to be "new problems" which do not replace the original, portable version distributed. Results on modified problems can be very interesting, and important for best exploiting a system. In addition to FORM parameters, there are other special features which an implementation may provide which can enhance performance. Special pragmas are the feature which is the most likely to occur. Again, an ACES user may modify some problems to observe the effect of nonpredefined pragmas, but they should consider these to be "new problems" which are not necessarily comparable between different systems.
This section discusses how well the product model assumed by CA can be expected, in principle, to fit measurements of code expansion size.
The product model assumption will be satisfied for non-optimizing compilers which have a size associated with each language construction. The problem factors computed by CA will be the estimates for these sizes. For optimizing compilers, as with timing measurements, it is not reasonable to assume that the total code expansion size will be a simple sum of the features used; however, as with performance optimization, code space optimization should be linear with respect to the constructions used. Removing unreachable code is the primary space-specific optimization, which will be linear.
Other optimization techniques should match the product model nicely; loop invariant code motion does not modify space measurements as it does timing measurements. The code expansion measurements in the ACES reflect the difference between memory addresses at the beginning and end of the code fragment bracketed by the timing loop. This could introduce errors in the measurements if the compiler generated machine code for the Ada text within the timing loop in different address ranges. For example, some compilers may allocate code generated for generic package instantiations to different control sections. Compilers which produce the same amount of code could have very different measurements depending on where the space is allocated.
The ACES tries to avoid this problem by minimizing the use of declarative regions within the timing loop. Where the code bracketed by the timing loop code is a call on a subprogram, the space measurement will not reflect the space associated with the referenced subprogram. The execution time for the subprogram will be accounted for in the timing measurements. This discrepancy between time and space measurement should not confuse a reader.
Consider a system which translates a language construction by a call on an RTS routine where the RTS routine is not loaded unless the language feature is used in the program. If total program size, including all the loaded RTS routines, were considered, it would not be reasonable to expect a linear factor to adequately model usage. A call on the support routine will take "n" words of code, and the RTS routine takes "m" words when loaded. If the feature is referenced once in a program, it will take "n+m" words; if the feature is referenced twice it will take "2*n+m" words; if referenced "p" times, it will take "p*n+m" words. While this is asymptotically linear in "p", for small programs (small "p"), the non-linear behavior introduced by the "m" will thwart attempts to fit one linear factor to the language construction.
This problem is avoided by measuring code expansion sizes rather than total program size. Because of the measurement technique used to measure code expansion size, subtracting addresses at the beginning and end of the timing loop, a few anomalous measurements might occur. If a compiler allocates the code within the timing loop to different control sections, it is possible that a test problem may have a code expansion measurement which is much smaller than the space actually allocated it (since the measurement misses space not contiguously allocated). It is conceivable that a negative space measurement may be reported if the end of the loop is allocated in memory with a lower address than the beginning of the loop. Outliers should be examined carefully. A small code expansion size sometimes results when the generated code calls on run-time support routines rather than generating code inline. Expansion sizes and execution times may not be strongly correlated; sometimes small sizes will correspond to long execution times, because of loops in the test problem, calls on external routines (either in the run-time library or user-defined subprograms), or slow instructions. Some correlations should be observed. A test problem which takes zero space should also take zero time.
This section discusses how well the product model assumed by CA can be expected, in principle, to fit measurements of the time to compile and link each test program.
Most of the programs were developed to measure execution-time performance aspects, and do not necessarily represent a set of compilation units which will expose all the relevant compilation time variables. However, the Systematic Compile Speed group is a set of tests designed to systematically measure factors which may affect compilation speed. Users may decide to only examine the results from this when looking at compile (and link) times.
CA provides the option for users to select compilation times, or link times, or combined compilation and link times. Link time can be a significant fraction of total time to transform the source of a compilation unit into an executable load module. The ACES recommends including link time in the compilation rate measurements. This has not been traditional in the quoting of compiler performance, since many compilation systems for prior languages have used system standard linkers which are common to all language processors on a target system and are often written and maintained by different organizations from the one writing the compiler. The Ada language definition requires checking of subprogram type signatures between separate compilation units, and while this could be done at execution time, it is possible (and more efficient) to do the checking at link time.
Different Ada systems have partitioned the checking work differently between the compiler, the linker, and the execution-time environment. On a system with a linking loader which links modules only after a program execution has been requested, it will not be possible to separately measure link time, or to combine link time with compile times. The time associated with linking will appear to users as a slow program load. Linking loaders complicate measurements; however, they are not a widely used implementation approach for Ada compilation systems. When testing such a system, an ACES user will report on the observable times (compilation times for the test programs and execution times for the test problems) and ignore program load times.
The ACES measures the time to compile and link various source files. This approach avoids non-productive controversy about how to count lines. Should comments and blank lines be counted? Should only executable statements be counted, and if so, should initialization in a declarative region be counted as executable statements? There are several Ada features whose use may impact compile rates.
* Compilation unit size - Several code generator paradigms build a representation of a program (or a piece of a program such as a subprogram, basic block, linear sequence of statements, ...) and perform various manipulations on the structure to try to produce good code. Some of the algorithms used are of more than linear time complexity; they run much slower on large units than on short ones. The hope is that, in exchange for taking more time for compiling, better code will be generated. The time associated with loading each phase of a compiler may mean that all programs take some fixed time. A short program may not take much less time than a slightly longer one.
* Presence of generic instantiation - The amount of time associated with instantiating a generic unit can be substantial. The compiler must check that type signatures match, and if it is treating instantiation as a form of "macro expansion", it may try to optimize generated code. When an actual generic parameter is a literal, the compiler may use this value to fold expressions in the generic body. Some Ada compilers treat a generic instantiation as a macro expansion and effectively recompile the source code when instantiating the generic unit, substituting the actual generic parameters, with the expected impact on performance.
* "With" clauses - Each library unit referenced in a WITH clause can require considerable processing time. The time to search a program library to find the matching definition can be substantial, and can vary with the usage history of the program library.
* Disk usage - The placement of files for source, object, and temporary files on disk can produce better performance by reducing the time associated with physical I/O. Contiguous allocation of files on separate devices on separate channels would lead to best times. Contending users, or separate files on the same disk causing disk head contention, will all tend to degrade performance.
* System Parameters - Compilers are sensitive to various system tuning parameters, such as the amount of main memory, working set sizes, and contention for CPU and secondary storage. When the CA program indicates that one compilation unit compiles much faster (or slower) on one system than on another, the reader may want to examine the compilation unit to see what language features are being used.
The ACES Symbolic Debugger Assessor is a procedure to assess the quality of a symbolic debugger associated with an Ada compilation system. The Symbolic Debugger Assessor determines the functional capabilities of a symbolic debugger, and measures the performance impact when a program is executed under the debugger. The Symbolic Debugger Assessor includes a set of programs, a set of scenarios describing what operations the evaluator should have the symbolic debugger perform, instructions for evaluating the performance of the debugger, and a summary report form. Because symbolic debuggers provide different user interfaces and capabilities, ACES users will have to adapt the tests to each implementation.
The approach used in the Symbolic Debugger Assessor is based on debugging scenarios (or scripts) which the user will perform in the debugger. Each scenario tells the user which program to use, the operations to be performed, what measurements to collect, and what constitutes a successful performance. Most scenarios determine if a capability is present, and what restrictions are placed on the capability. Execution time is collected for some tests.
A readme file, "yb_readm.txt", explains the scenarios, and the programs provided for each scenario contain comments indicating the operations to be performed on each program. A readme file, "yb_readme.txt", explains the scenarios, and provides the programs provided for each scenario contain comments indicating the operations to be performed. The readme file is found in Appendix C of the User's Guide. The first step in each scenario is to set a breakpoint on an Ada function which contains comments describing what is to be done in the scenario. Comments throughout the test program explain the code and the debugging operations to be performed, leading the user step-by-step through the operations to be performed.
Debugger command files are both a convenient way to execute a sequence of debugger commands and to document what commands were performed. The ACES includes a set of sample executable debugger command files which contain all the commands used to perform the scenarios on one system (VMS). Each debugger command file contains comments explaining the operations performed. The ACES also includes a sample command file to compile all the programs needed for the Symbolic Debugger Assessor for a VMS and a UNIX system.
The Symbolic Debugger Assessor includes a standardized report format to simplify comparison of results between compilation systems. The report form to be filled in includes space for both quantitative results and the evaluator's qualitative impressions of the debugger. Space is also provided to record the operations used to test each capability.
Because the capabilities of a debugger can change when running optimized code, ACES users may need to perform two separate evaluations, with and without the code compiled with an optimize option.
A scenario was developed to examine the capabilities discussed in each of the following sections. In each case, the user is instructed to execute a program under the debugger and perform the specified operations. A separate program without tasking is provided to test other capabilities when the debugger does not handle tasking programs well.
* System State - Determines whether adequate information about the current state of the program is provided by the debugger.
* Breakpoints - Checks for limitations on using breakpoints. The capability to display all breakpoints, and to remove a breakpoint also is tested.
* Single Stepping - Tests several variations of the single stepping capability.
* Examination of Variables - Tests examination of variables. The capability to examine objects of different types is also tested.
* Break Support - Break support is the capability for a programmer to asynchronously interrupt a program executing under the debugger and have the debugger prepared to accept and process debugger commands.
The test for this capability uses a program which falls into an infinite loop. The programmer will try to interrupt the program and then execute other debugger commands.
* Watchpoints - "Watchpointing" is the monitoring of a variable value and breaking when it changes or when it takes on a specified value. This is checked by a scenario which requests the user to monitor several variables for change in their values. There are two independent dimensions to vary: the attributes of the variable being monitored, and the way the variables are modified. It may be that some debuggers will provide only a limited scope for checking statements which might modify variables -- perhaps only explicit assignments are checked and assignments through aliases may be missed.
Some debuggers will not provide any direct support for "watchpoints". The user may be able to use a macro facility to compare values against a set of "triggers" and break when a variable assumes a particular value.
* Traceback or Walkback - The capability to display the subprogram history is tested. The test problem specifies that the user set a breakpoint in the exception handler of one subprogram and request a traceback from that point.
* Program History - A scenario is provided to test the debugger's capability to display program history. The programmer will set a breakpoint after an IF statement and request the debugger to show which path the IF statement took.
* Modify Variables - Some debuggers provide the capability to modify variables of the program and resume execution. A scenario is provided to test this capability.
* Control Flow Direction - The capability of restarting execution from different locations is tested, although starting execution at an arbitrary location is not generally safe. There are cases where modifying execution flow is feasible and helpful, and those cases are tested.
* Integration with Editor - A scenario is provided to determine whether the debugger is integrated with an editor and the compiler.
* Foreign Languages - If the compilation system supports interfacing to subprograms written in other languages, the debugger should be executable on programs which contain calls on foreign subprograms. It is not expected that all the features of the symbolic debugger be usable on subprograms written in another language. The system documentation should indicate restrictions on foreign language debugging.
* Time Critical Programs - To test the usability of the debugger for time critical code, an example was constructed which has several tasks. A breakpoint is set in one task which will last long enough to force a conditional entry call in another task to fail if "wall clock" time is used.
* Interrupt Handlers - A test is provided to determine whether restrictions are placed on using a debugger inside a task tied to an interrupt. Because some systems may disable all interrupts inside a task tied to any interrupt, the program may crash when the breakpoint is tripped because the system can lose interrupts.
Tying tasks to interrupts is implementation dependent. Test programs using these features will have to be adapted by users to the system being tested. This may not be possible on systems which do not support the capability of tying tasks to interrupts. Some systems provide alternative ways of processing interrupts, such as tying interrupts to procedures. The evaluator of such a system may want to adapt the debugger tests using the non-standard features.
* Conditional Breakpoints - Test problems are included to determine the extent to which a debugger supports conditional breakpoints.
* Macro Facility - A scenario is provided to test the macro facility. A debugger macro facility is a simple method of providing considerable power. It might be the mechanism used to provide conditional breakpoints.
* Tracing - A scenario is provided to determine whether the debugger output can be directed to a file for later analysis.
* Isolation of Debugger I/O - In this scenario, a program executed under the debugger will perform console input and output. The ACES user will decide whether the program output is clearly distinguished from the debugger output and whether the program input is clearly distinguishable from debugger command input.
* Access to Machine Properties - A debugger should provide a way for users to examine the low level properties of the machine. Some difficult problems (such as those involving suspected compiler code generation errors) may require examination of machine registers and an instruction-by-instruction stepping capability.
* Isolate Exception - A scenario is included to determine whether the programmer can use the symbolic debugger to isolate the statement which raised a predefined exception. A program will raise (and handle) multiple exceptions before encountering the one which causes the program to fail.
The user should record whether the breaks occurred on the statement raising the exception or in the handler.
* Non-terminating Tasks - A subprogram cannot complete until all the tasks it created have terminated. A scenario is provided to:
+ Determine whether the debugger has the capability to list all the tasks created by a particular subprogram, and the status of these tasks.
+ Determine whether the debugger has the capability to list all the tasks in a program, identify the parent of each, and the status of these tasks.
* Failure in a Declarative Region - A subprogram can raise an exception during the elaboration of its declarative region: for example, by violating a range constraint or by evaluating an initialization expression which calls on a function whose execution raises an exception. The scenario determines whether breaking on an exception raised in a declarative region is supported.
* Library Package Initialization Failures - The sequence of statements in a package body (LRM 7.3 paragraph 2) can raise an exception. For a library package, this could cause failure before the main program is entered.
A specific test problem was constructed which fails in a library package elaboration and the user is instructed to isolate the fault.
* Library Elaboration Order - A scenario is included to display the order in which library packages are elaborated.
* Execution-time Degradation by Debugger - The run-time performance of the system may be seriously degraded when the debugger is used. Specific options, such as watchpointing variables, may be particularly expensive.
A program derived from the ACES performance test suite is run under the debugger several times, with breakpoints and watchpoints set. The elapsed time collected in the performance test is entered on the report form and compared to the elapsed time to run under the debugger.
* Task Deadlock - A tasking program which deadlocks is included and users can observe whether the debugger system informs them that the program has no runnable tasks.
* Limits on Number of Breakpoints - A test scenario is provided to test for limits on the number of active breakpoints. The user will try to set 10, 30 and 50 breakpoints.
* Program Termination - A scenario is provided which determines whether the debugger treats program termination as an implicit breakpoint.
* Debugger Command Files - Debugger command files allow the user to specify a sequence of debugger commands in a file which can be invoked in the debugger. A scenario is provided to determine whether command files are supported.
The results of the symbolic debugger evaluation include a list of scenarios and the quantitative results for each one: whether the debugger passed or failed, and the execution time (where applicable). The evaluator may also wish to comment on issues such as the quality of the documentation and the user interface, the ease of learning, availability of additional features, and the relative importance of the scenario/capability.
The report form ("yb_tmplt.txt") contains a list of scenarios, a brief description of each scenario, and space for the user to input both quantitative results and subjective evaluations and comments. Space is also provided for system information, such as the name and version number of the compiler and host operating system, host hardware information, compiler options used, the date of the test, and the name of the evaluator. The user creates a tabular report by using a standard system text editor to enter test results into the report form.
The report form included with the Symbolic Debugger Assessor provides a way to summarize the results of the tests and easily compare the results between systems. No analysis tool is considered necessary. The relative importance of each scenario must be decided by the user, and will depend on the user's project. If the project will not be using Ada tasking, then poor debugger support for tasking will not be serious. Unusually awkward operations should be noted on the report template.
This section describes the Diagnostic Assessor. This assessor:
* Covers diagnostics produced by the compiler, linker, and run-time system.
* Considers both error conditions and unusual constructions about which warning messages would be appropriate.
* Maximizes objectivity and minimizes subjective judgments.
* Is adaptable to multiple systems.
The Diagnostic Assessor contains a set of programs and directions, along the lines of:
* A set of (intentionally erroneous) programs.
* A set of instructions explaining how to evaluate the system responses to the set of erroneous programs.
* A summary report form to fill in with results as they become available.
A readme file, "yd_readme.txt", explains the scenarios, and the programs provided for each scenario contain comments indicating the operations to be performed. The readme file is found in Appendix D of the User's Guide.
There is a separate Ada compilation unit for most of the diagnostic test problems. This will minimize cases where a failure to process one condition leads to another condition not being attempted. There are command files or scripts to compile, link, and execute the diagnostic test problems on DEC Ada under VMS. Users can adapt these examples to the implementation-dependent requirements for the system they are evaluating. A report form to fill in with results is distributed as a text file, ("yd_tmplt.txt"). The report form contains general questions common to all problems. In addition, there are specific questions for some problems. The data collection is in terms of a series of yes/no questions; e.g., "Does the text of the message clearly define the difficulty?" A "yes" response will always be a favorable one.
The completed summary report form is a two-way table with a row for each test problem, and a column for the shared questions for all diagnostic messages.
The general questions within the template are:
a) Is any diagnostic message printed?
b) Is the message in the general area of the difficulty? "General area" means that a statement designating an error location is written in the diagnostic message.
c) Is the message at the correct specific location? "Specific location" means that the diagnostic message identifies the specific location of the error within a token.
d) Does the text of the message clearly define the difficulty? Each test will have comments explaining exactly what is wrong.
Some problems have additional questions. These include:
e) Is relevant non-local information listed where appropriate?
Examples include: location of declarations for variables; location of IF statements not terminated with an END IF.
f) Is error recovery appropriate?
A few problems are intended to test error recovery and require that no message should be generated for succeeding statements.
Other specific questions which are covered by these template questions will be raised in the upcoming sections as each problem is outlined.
The command file is split into four separate entities allowing a simple method for a user to complete only a subsection of the total diagnostic testing. These four areas include compile-time errors, compile-time warnings (or informative messages), link-time messages, and run-time messages.
The four main groups of diagnostic tests are described below:
* Compile-Time Errors - The LRM and the validating procedures require many errors to be detected at compile time. The ACVC does not try to evaluate the clarity or precision of the diagnostic messages generated by erroneous programs. The purpose of the ACES diagnostic evaluation is to assess the quality of these diagnostics. In addition to erroneous conditions, the ACES diagnostic evaluation suite will include examples of "suspicious" conditions which a programmer should be warned about; these reflect potential errors, inefficiencies, or poor programming style.
Deciding whether a diagnostic message clearly identifies a problem is a subjective judgment. For some programmers (on some errors) anything more than "SYNTAX" and a pointer to where the error was detected is superfluous. Programmers familiar with a particular compiler eventually learn to understand what the compiler "means" by cryptic messages. The ACES diagnostic tests are intended to provide an assessment of how helpful a system's diagnostic messages are to programmers who are not experienced with the compilation system being evaluated.
To minimize subjective judgment in the diagnostic message evaluations, the construction of ACES diagnostics tests emphasize conditions where specific information would be helpful to programmers. The assessment procedure involves checking for the presence of this information. Each test identifies the specific information expected to be supplied in a diagnostic message.
The Diagnostic Assessor contains examples where specific non-local information is expected and also examples where only local information needs to be provided. In both cases, the specific information expected is well defined.
The compile-time errors tested for are:
+ Invalid characters
+ Mismatching bracketing structures
+ Naming conflicts
+ Circular order dependency
+ Self-WITH
+ Invalid named aggregate associations
+ Invalid CASE alternatives
+ Non-returning function
+ Invalid assignment to a CONSTANT
+ Visibility error
+ For I IN -1..10 LOOP
+ Cascading error messages
+ Suppress duplicate error messages
+ Improper type qualification
+ Improper type conversion
+ Diagnostic limits
+ Inconsistent package body
* Compiler Warning Messages - Warning and informational messages report conditions for which the LRM does not require rejection by a validated compiler. Some of these conditions are erroneous (but too difficult to detect for the LRM to require reporting on: for example, references to uninitialized variables, or programs which depend on the mechanism of parameter passing), while others are not strictly errors but represent unusual or suspicious code (for example, unreachable code, unused variables, and statements detectable at compile time to raise an exception). It is valuable for users to find the "warnable" conditions as early in the program development cycle as possible, because they may represent program logic errors.
The following list shows the warning conditions for which test problems are provided:
+ Propagating exceptions beyond visibility
+ Reference to uninitialized variables
+ Uncalled entry
+ Mode inconsistent parameter usage
+ Compiler time warning for constraint violations
+ SELECT statement guaranteed to raise PROGRAM_ERROR
+ Unrecognized pragma
+ Ignored PRAGMA INLINE
+ Improper pragma location
+ Duplicate ACCEPT entries in a SELECT
+ Superfluous DELAY alternatives in a SELECT
+ Excessively large data structure
+ Unreferenced variables
+ Dead variable assignments
+ Unreachable code
+ Unneeded WITH clause
+ Invalid mode dependency
+ Notification of obsoleteness
* Linker Messages - There is a class of error conditions which cannot be detected until after a program has been compiled, but before it executes. Depending on a compiler's design, messages for these conditions may be generated by the compiler, linker, or loader, but the intent is that they be detected and reported to users before the program is executed.
In the scenarios listed below, when mention is made of compiling a unit and then modifying it and recompiling it, the ACES provides a disk file with the original and the modified version of the source test of the unit. It is not intended that an ACES user will edit a test problem file. Providing all necessary versions of the source will permit the process of compilation and recompilation to be performed using command files or scripts, as in the rest of the ACES test suite. The ability to execute the tests in batch mode is as important for the diagnostic evaluations as it is for the performance tests. Examples of link-time errors and warnings include:
+ Missing units in library
+ Improper subunits
+ Link with obsolete units
+ Insufficient disk space
+ Omission of unreferenced subprograms (warning/informational message hoped for)
* Run-time Error Messages - A set of test problems are included that test whether several specific conditions generate an execution-time error:
+ Deadlock detection
+ Access errors
+ Exhausting storage
+ Unhandled exception propagated from main procedure
+ Uninitialized variables detectable at run time
The ACES Diagnostic Assessor provides for a tabular report "yd_tmplt.txt" by including a Summary Report form reflecting the test problems, which users fill in with a standard system text editor as they obtain information. There is provision for users to add descriptive comments about individual test problems, which might be useful to record error messages for failed test problems or implementation restrictions encountered.
Not all the test problems are of equal importance to the overall diagnostic capability of a system, since different user organizations have different priorities.
The Summary Report form provides for representing the percentages of each type of error response in each group (the percentage of messages printed for compiler errors, warning, linker faults, and run-time faults).
A readme file, "yl_readme.txt", explains the scenarios, and the programs provided for each scenario contain comments indicating the operations to be performed. The readme file is found in Appendix E of the User's Guide.
Ada program library systems are assessed by using a set of compilation units, operations (scripts) to be performed on a library, and a checklist for evaluating system responses. These are listed in the following sections. Because there is no standard for library commands, the ACES user will have to adapt the scripts to the compilation system being evaluated.
The Library Assessor contains different scenarios, or scripts, which the ACES user must adapt to the system under test. Each scenario describes what must be accomplished to perform it. Users will typically measure execution time, disk space, or whether some operation is possible.
The scripts define each step to perform and the measurements to make. The scenarios are as order independent as possible; however, because the sequence of library updates can affect the correctness and performance of library commands, the sequence of commands within a scenario must be the same on different systems to ensure that the operations are comparable.
Where possible, the user should construct the sequence of operations required to perform the library evaluation scenarios as an explicit command file or script which can be executed. On most command-line-based systems, this will be possible and advantageous. It clearly records the operations performed (reducing reliance on handwritten logs); it makes scenarios easily repeatable, and it provides for unattended operation. Some systems (including some with graphic user interfaces) may not be able to encapsulate library commands in an executable form. On such systems, the ACES user will have to manually enter commands and record responses.
The ACES distributes sample library command files for DEC Ada on VMS which can be used as models for adapting to other systems.
Sets of test problems (programs and scripts of operations to perform) are included that examine the capabilities listed in each of the following sections.
* Program Library Contents - One script is provided which: creates a library, compiles a unit into it, and then requests a listing of library contents. The ACES user must then verify that the name of the compiled unit is listed as one of the elements in the library.
* Status of a Particular Unit - The user tests the program library manager to determine if it can provide the following information about units entered in it.
+ The full compilation unit name.
+ The creation date when compiled into the library.
+ Status - current or obsolete.
+ Compilation options.
+ Version number (or other identification) of the compiler used when the unit was entered into the library.
+ Information about dependent units.
+ The type of the unit, that is, whether it is a subprogram declaration or body; a package declaration or body; a generic declaration or body; or a subunit (as discussed in LRM 10).
+ The disk space occupied by the unit. This is important information which many systems unfortunately do not provide.
* Variations - A test problem is provided to determine if the library system supports the concept of variations. Variations are modified versions of a unit.
* Space Reclamation - A scenario is provided to detect whether the library system reuses space from deleted compilation units.
There are three scripts to test for reclamation:
+ Successful recompilation of a unit will delete the old version of the unit from the library. There is a script to test whether this space is reclaimed.
+ Unsuccessful recompilation of a unit should not allocate any permanent disk space in the library. There is a script to test if space is allocated and not reclaimed for unsuccessful recompilations.
+ Explicit deletion of units should also make space available. This is checked by a version of the script which compiles and then deletes the units.
* Library Integrity after Manual Abort - A scenario is provided to test whether the library system is particularly sensitive to corruption when the compiler is prematurely terminated by using host operating system techniques to abort it--a Control-Y on VMS, a Control-D on many UNIX systems, or a BREAK on many other systems. Several systems have fragile program libraries, which are very vulnerable to corruption by program aborts. A fragile system will have "windows of vulnerability" during which the internal file is inconsistent if an abort is performed. If killed in a "vulnerable state" the library may be corrupted.
* Concurrent Library Access - A set of scenarios is provided to test whether concurrent access to a library is supported, in general or with restrictions.
+ A scenario is provided to test whether two independent processes can concurrently read a library.
+ A scenario is provided to test whether two independent processes, one performing a compile and one a link in the same library, will operate concurrently.
+ A scenario is provided to test whether two independent processes, each compiling into the same library, can operate concurrently.
There are some compilation systems where the concept of concurrent access is not applicable, such as a single-user, single-tasking operating system, e.g., MS-DOS on an IBM-compatible PC or MacOS on an Apple Macintosh. On such systems, these scenarios are not applicable and should not be attempted.
* Detection of Missing Unit - A test scenario is provided to determine whether the library system notifies the programmer when a subunit is deleted, making the parent unit no longer current, and if it clearly identifies the missing unit when it is requested.
* Smart Recompilation - This set of scenarios tests for "smart recompilation". There are also some performance tests that measure the system's compile speed to perform smart recompilations. These new tests appear in the Systematic Compile Speed group, Smart Recompilation subgroup.
The sequence of scenarios has increasing complexity. The command files display the dependent unit names before and after submitting the unit for (re)compilation. Examination of time stamps should permit users to observe if dependent units were recompiled. There may be some systems where a directory listing will not display the time stamps and other (implementation-dependent) methods must be discovered to determine if a unit has been recompiled. In addition to this recompilation check for dependent units, the command files record the total recompilation times for comparison between systems.
* Compilation Order Tool - A test scenario is provided to determine whether the library system provides a tool to list the compilation units in a valid compilation order.
* Movement of Intermediate Objects - A test scenario is included to determine if the library system provides a capability to move compilation units between libraries without recompiling the source.
* Deleting Units - A test scenario is included to determine if the library system provides a capability to delete a unit. This capability has been implicitly assumed in several other tests. The evaluator shall record whether deleting a parent unit will also delete its now obsolete subunits from the library.
* Library Creation - A scenario is provided to create and delete a program library.
* Consistency Verification - A scenario is provided to determine whether the system has a capability to verify the internal consistency of a program library.
* Library Support for Large Programs - A set of scenarios are provided to determine how well the library system supports large programs. A library system may fail large programs in two ways:
+ It might exceed "hard" capacity limits. For example, a library may not permit more than "N" units, where "N" is not sufficient to support the project.
+ Performance might degrade severely as the number of units grows. This would make development of large programs unfeasible.
* NULL Unit Size - A scenario is provided to determine the disk size of a NULL unit.
* Generic/Non-generic Unit - A scenario is provided to determine the size of a generic unit definition, a unit which is an instantiation of this generic unit, and a (logically equivalent) non-generic version of this unit. This scenario is replicated with a small generic unit (less than a dozen statements) and a large unit.
* Hidden Space - A scenario is provided to determine whether a library system allocates "hidden space" for units which are not reflected in the sizes reported by the library query facility. Several systems store unit dependencies in separate files and the space in theses files is not necessarily reclaimed when the units are removed from the library, although some systems may also separately store dependency information.
* Modifications to a Dependent Unit - A scenario is provided to determine whether modifying a dependent unit marks executable programs based on the original version of the modified unit obsolete.
* Contents of the Program Library - A scenario is provided to determine whether a program library system can extract the source text of the compilation units it contains.
* Impact Analysis - This scenario determines whether the library manager provides a capability to list the units that would be affected if the named unit were modified.
* Order-Of-Elaboration List Capability - This scenario determines whether the system provides a facility to print the order in which bodies will be elaborated.
* Closure Time Estimator - This scenario determines whether the system provides a facility to estimate the time for a closure to recompile all the selected units of a specified program. It will also request the recompilation so that the estimated time can be compared with the actual time.
Because of the limited number of library test problems, the ACES Library Assessor provides for a straightforward tabular report by providing a Summary Report form reflecting the test problems. Users fill in this report, "yl_tmplt.txt", with a standard system text editor as they obtain information. There is provision for users to add descriptive comments about individual test problems.
Not all the test problems are of equal importance to the efficient use of a system, as different user organizations have different priorities. In addition to the boolean data reflecting presence of a capability, executing these test problems also generates some execution time and disk space measurement data.
This section describes the capacity assessor. This assessor:
* Tests for compile-time and run-time capacity limits of the Ada compilation system.
* Is adaptable to multiple systems.
* Tests features of the language that could be subject to less than desirable limits for the timely and satisfactory completion of some projects.
* Uses a common command language shell for all tests on a given system.
* Uses portable Ada source generators to produce the code for the compile-time tests.
* Generates standard Ada source code to test the various system capacities.
* Provides a summary report form for recording test results.
* Uses a branch-and-bound search technique where applicable to find limits in the compile-time and run-time tests.
A readme file, "yc_readme.txt", explains the scenarios, and the programs provided for each scenario contain comments indicating the operations to be performed. The readme file is found in Appendix F of the User's Guide.
The ACES user will have to adapt the scripts to the system being evaluated. The user will typically measure maximum compilable sizes of generated Ada source code within user-specified ranges, and maximum attainable sizes of run-time constructs in supplied Ada source programs.
The individual capacity tests are as order independent as possible. However, because the success of each test scenario depends on proper interpretation of intermediate results and proper action based on those results, the scenarios must be executed in the sequence of commands supplied within each predefined script.
The capacity tests normally run without user interaction. The user should adapt the operations required to run the capacity assessor test scripts as a command file or script. The supplied script records the operations performed (reducing reliance on handwritten logs), makes scenarios repeatable and provides for unattended operation. Some systems may not allow for such grouping of commands in the Capacity Assessor command scripts. On such systems, the ACES user will have to manually enter the commands and record the results.
The ACES distributes sample command files for VMS and UNIX systems. These files help document the operations the tests perform and can be used as models for adapting to other systems. The file, "yc_parms.dat", contains default testing limits for the compile-time and run-time tests.
Under normal circumstances, when a compile-time test completes it will be for one of the following reasons:
* The system accepted the maximum value specified by the user.
* The system rejected the minimum value specified by the user.
* The time constraint requested by the user expired.
* The test has found a limit for the tested feature within the user-specified bounds.
The compile and run-time tests obtain input from "yc_parms.dat" consisting of the test ID, minimum test size, maximum test size, maximum test duration, and test description string. The output data consists of log information output on each iteration, and a Results Summary containing information used to complete the Capacity Report output at the end of each test.
Some systems may crash when a capacity is reached or exceeded. If a crash should occur, manual inspection of data files and logs produced during the test should reveal the value being tested, as well as the last successful test value.
This group of tests is aimed at finding limits in the Ada compiler and linker. Different procedures are recommended for self-targeted and cross-targeted systems. On self-targeted systems, each program should be executed so that run-time failures can be detected. On cross-targeted systems where downloading files complicates the testing process, it is recommended that the user use compiler/linker status codes (where available) and defer execution of the programs until a limit has been found. It is still important to verify on cross-targeted systems that the largest program accepted by the compile/linker executes properly; if executing this program reveals a run-time failure, then the ACES user should modify the testing procedure to include executing each generated program, which is a slower but safer testing procedure.
Some limitations are due to operating system or resource limits. It may be possible to adjust system quotas to better support the compiler being evaluated. This assessor does not distinguish between system and compiler limits, but reports when limits are found in a particular configuration of a system with a certain set of resources for Ada compilations.
A branch-and-bound search technique is used to find compile-time limits for the tested features. When a range surrounding a limit has been established, a binary search is used to supplement the search algorithm.
The following section describes the compile-time capacity tests.
* Number of names in a compilation unit - This test generates Ada compilation units containing numerous named numbers. This test determines whether there is a compile-time limit on named objects less than the user-specified upper bound for the compiler being tested, without confounding the interpretation of name space with the target system's memory limits.
* Number of simple variables - This test generates Ada compilation units containing numerous variable declarations. The test may encounter limitations based on target processor memory structure. This test determines whether there is a compile-time limit on variable objects less than the user-specified upper bound for the compiler being tested.
* Number of literals in an enumeration type - This test generates Ada compilation units containing enumeration type declarations consisting of a large quantity of enumeration literals. This test determines whether a limit on the size of an enumeration type exists which is less than the user-specified upper bound.
* Number of elements in an aggregate - This test generates Ada compilation units which contain large aggregate structures. It determines if a compile-time limit exists on aggregate size less than the user-supplied upper bound.
* Alternatives in a "CASE" statement - This test generates Ada source units which contain "CASE" statements of increasingly larger size. The test determines whether there exists a compile-time limit on the number of "CASE" alternatives which is less than the user-supplied upper bound.
* Alternatives in a "SELECT" statement - This test generates Ada source units which contain "SELECT" statements of increasingly larger size. It determines whether there exists a compile-time limit on the number of "SELECT" statements which is less than the user-supplied upper bound.
* Number of constrained formal parameters - This test generates Ada source code which contains a subprogram with a multitude of parameters of a constrained type. It determines whether a limit exists on such parameters that is less than the user-supplied upper bound.
* Number of unconstrained formal parameters - This test is different from the previous one in that all parameters are of the unconstrained type STRING. It determines if a limit exists which is less than the user-supplied upper bound.
* Number of tasks - This test generates Ada source which contains some simple tasks. It increases the number of tasks, within the user-supplied bounds, in the generated source in order to determine if a limit exists on the number of tasks allowed within the specified range.
* Number of simple operands in an arithmetic expression - This test checks for the number of simple operands in an arithmetic expression. The sequence of expressions takes the form:
X1 + X2
X1 + X2 + X3
X1 + X2 + X3 + X4
It determines whether the compiler imposes a limit on expressions which is less than the user-specified bound.
* Levels of parentheses in an arithmetic expression - This test checks for a limit which is less than the user-specified bound on the number of nested parentheses in a sequence of arithmetic expressions of the form:
X1 + X2
(X1 + X2) + (X3 + X4)
((X1 + X2) + (X3 + X4)) + ((X5 + X6) + (X7 + X8))
where each variable occurrence at level "N" is replaced by the expression "(X+Y)" at level "N+1" and all variables are assigned unique names.
* Levels of nesting of actual parameters - This test checks for a limit which is less than the user-specified bound on the depth of nesting of function calls in actual parameters in a sequence of the form:
F( X1, X2 )
F( F( X1, X2 ), F( X3, X4 ))
F( F( F( X1, X2 ), F( X3, X4 )), F( F( X5, X6 ), F( X7, X8 )));
where each variable occurrence at level "N" is replaced by the expression "F(X,Y)" at level "N+1" and all variables are renumbered.
* Number of characters in a line - This test determines how many characters the compiler will allow in a line of source code. This test produces source lines of increasingly significant lengths, within the bounds of the user-supplied values. It is intended to determine whether a limit on source line length exists within those bounds.
* Depth of nested subprograms - This test determines if a compiler imposes a limit on the depth of nested subprogram definitions which is less than the user-specified bound.
* Depth of nested "IF" statements - This test determines if a compiler imposes a limit on the depth of nested "IF" statements which is less than the user-specified bound.
* Depth of nested "CASE" statements - This test determines if a compiler imposes a limit on the depth of nested "CASE" statements which is less than the user-specified bound.
* Depth of nested "BLOCK" statements - This test determines if a compiler imposes a limit on the depth of nested "BLOCK" statements which is less than the user-specified bound.
* Depth of nested "ACCEPT" statements - This test determines if a compiler imposes a limit on the depth of nested "ACCEPT" statements which is less than the user-specified bound.
* Depth of nested variants in a record - This test determines if a compiler imposes a limit on the depth of nested variants in a record which is less than the user-specified bound.
* Statements in a compilation unit - This test checks for a maximum number of Ada statements allowed in a compilation unit. It generates compilation units of specified sizes and attempts to compile them. It determines whether a limit exists on the size of an Ada compilation unit within the user-supplied bounds.
* Number of non-declarative statements in a block - This test checks for a limit on the number of non-declarative statements in a block.
* Number of characters in a fully qualified name - This test determines the maximum length allowed for an Ada fully-qualified name in the compiler under test. It generates Ada programs with names of lengths within a user-specified range until such a limit, provided one exists, is determined, or until the test maximum has been proven to be supported.
* Number of executable statements in a program - This tests for the ability of a system to support large programs. To avoid confounding total size with limitation on the size of blocks or the size of individual compilation units, the generated test programs are constructed from a set of small library units, no one of which is more than 100 executable statements large. The structure of each procedure is trivial
procedure procedure_1 is
begin
procedure_call(1.5) ;
procedure_call(2.5) ;
procedure_call(3.5) :
.
.
.
end procedure_1 ;
where the body of "procedure_call" adds the formal parameter to a running total. Note that using numeric literals of the form "integer.5" permits precise determinations because "half-integer" will be model numbers and precisely represented. To permit testing for a large number of literals, a floating point type with at least 6 digits of precision is used here. At the end of the main program the running total is verified to have the appropriate value.
* Number of dimensions in an array - This test checks on the maximum number of dimensions permitted in an array. To avoid confounding the test with total memory capacity considerations, the bounds of the array are "one..one" where "one" is a non-static variable with the value one (1).
* Nested aggregates - This test checks on the depth of nested aggregates. It defines a corresponding nested object into which the aggregate is copied. The structure uses a record whose components are themselves records which contain components, and so on. Each non-lowest level record definition contains two components so that the aggregate essentially defines a binary tree. After assignment of the aggregate to a record, the generated program verifies, by a series of component by component tests, that the expected values are assigned.
* Nested subunits - This test checks on the depth of nested subunits. In determining this capacity, the space in the program library may be exhausted by initial test runs leaving non-deleted units in the library. The objective is to delete from the program library all the subunits from each step. This may require user adaptation. The simplest approach is to have users delete the current (sub)library and create a new one for each step, or delete all the units with a specified prefix using wildcards. Wildcard deletes are used in "yc_serch.com".
* Nested INLINE subprograms - This test checks on the depth to which INLINE subprograms can be expanded. To avoid possible misleading results, the generated test determines whether the subprograms actually were expanded INLINE. Nested INLINE subprograms are expanded to a depth to which they can process non-inlined subprograms. The subprograms selected are readily expanded INLINE by systems which do any inline expansion. They do not define exception handlers, they are not recursive, they declare no dynamic-sized structures, and they do not declare local storage.
The test checks for actual INLINE expansion by taking the 'ADDRESS attribute of a label within a subprogram which has been specified as INLINE and testing whether the address of a subprogram occurs between labels which bracket the call on the subprogram. The two versions of this generator are "yc_ct27.lab", for systems which support the label'ADDRESS attribute, and "yc_ct27g.gad", for targets which do not support the label'ADDRESS attribute.
* Nested generic subprograms - This test explores the capacity of a system to instantiate nested levels of generic subprogram definitions.
* Nested generic packages - This test explores the capacity of a system to instantiate nested levels of generic package definitions.
* Size of the literal pool in a compilation unit - This test explores the capacity of a system to accept programs with an increasing number of literals. It may be limited by internal compiler restrictions or by addressing limitations on the target machine (the target must be able to access a data section containing values for all the declared literal values). In order to avoid complications due to targets supporting instructions with immediate operands or compilers using substring searches (what we thought was a "new" string literal may be a substring of an existing literal), this test uses floating-point literals (which do not correspond to simple integers). The test program is one compilation unit with multiple procedures (none larger than 100 executable statements) containing code of the form "PROCEDURE_nn(literal); ..." so as to restrict the size of the blocks to avoid confounding results with limitations on the size of the blocks.
* Size of statically sized subprogram declarative region - This test explores the largest total size of objects which can be declared in a procedure's declarative region. This is not a test for the maximum size of any single object. Some target systems may impose limitations on the size of an allocated data segment which is more constrained than the memory addressable by a program (such as target machines with memory-segmented architectures). This test generator determines limits by declaring a large number of (relatively) small, statically sized objects. Some compilers use different allocation algorithms for statically and dynamically sized composite objects which result in their being allocated in different memory regions.
On most systems, when a task is created, memory space is reserved in the stack space for the procedures the task will create. The size of a declarative region can vary when it occurs within a task. The Ada length clause "FOR T'STORAGE USE SIMPLE_EXPRESSION;" provides users with some control over this size for tasks. This test problem focuses on testing sizes outside the context of tasking programs, but users should expect different results for sizes when they are dealing with tasking programs.
* Size of a statically-sized library package declarative region - This test explores the largest total size of objects which can be allocated within a static library package. This limit may differ from the size which can be allocated within a procedure's declarative region because library static packages can be allocated statically (not on the stack) and can be subject to different constraints than the stack-based segments. This is a test of static sizes because systems may vary for library packages, with statically-determined sizes rather than for dynamic sized.
The run-time capacity tests, as their name implies, exercise the Ada run-time system to its limits. Each of these tests requires execution on the target system. Error conditions that occur as a result of exceeding run-time limits that do not cause system failure are handled by way of exception handlers.
The run-time capacity tests are dynamic in nature, growing larger as they execute. Where possible, these tests are designed so that they use a branch-and-bound search technique to find a range in which a limit exists on the feature being evaluated. In cases where this is not possible, the tests simply grow until a STORAGE_ERROR is raised. Precautions are taken in the test programs to handle exceptions.
The following paragraphs define the ACES Capacity Assessor's run-time tests:
* Number of elements in an array - This capacity test determines a range which includes the maximum number of array elements allowed in a dynamically-defined array. A simple array of small elements is used to conduct this test. The size of the test array depends on a parameter passed to a subprogram, which in turn declares an array of that size.
* Number of dynamically-created tasks - This capacity test finds the number of tasks that the run-time system allows to be created from a task type declaration at run time. The test program contains a task type definition. It continues to declare task objects via an access type. The tasks are tracked in a list so as to allow proper termination of all tasks, when an exception is raised as a result of such a declaration.
* Depth of subprogram or function calling - This capacity test determines the maximum subprogram calling depth allowed by the system before an error occurs. In some, and possibly all, cases the error will be a STORAGE_ERROR exception. The program for this test contains a recursive function that will attempt to achieve the maximum calling depth supported by the run-time system.
* Number of non-task type dynamically-allocated objects - This capacity test determines the maximum number of objects that can be allocated in a collection via an access type at run time. The collection consists of a non-task variable type and is allocated on the heap. The number of successful allocations is reported to the user.
* Size of a stack-declared array - This test determines the maximum size of an array in terms of bits. It is similar to the run-time test, "Number of elements in an array", but instead of counting the number of array elements, it calculates the number of bits used by the stack-declared array. It reports the size of the largest array prior to causing an error condition. The array in this test is declared in the declarative region of a subprogram.
* Size of a heap-declared array - This test determines the maximum size of an array allocated on the heap via a NEW operator in terms of bits.
* Size of a library-declared array - This test is similar to the previous two tests, but deals with arrays declared in a library package. Because of limited control over exceptions raised in elaborating a library package declarative region, the size increases as a power of two of the starting value, and therefore, is a coarser test than the others. Size increases until a STORAGE_ERROR occurs or the test limit is exceeded.
* Size of a collection - This test explores the largest total size of objects which can be allocated within the unnamed collection (the global heap). Some target systems may have limitations on the size of this data segment that are more constrained than the total amount of data addressable by a program. The test generator determines limits by allocating a large number of (relatively) small objects.
* Size of a data segment - This test checks for the possible existence of segment size limits. It starts by determining the smallest array which is too large to be handled, and then declaring two arrays half that size (such that the two halves are greater than the size of the largest array which is accepted). If the two smaller arrays also fail, then we can safely assume that either the failure is due to encountering a limit on the total memory available to the program or due to a segment limit. If the two halves are accepted, we may have found a limit on object size for a simple array, but have no information on possible segment size limits greater than the object size limit.
Note that no array can have more elements than its index type enumerates. This limit may come into play before limitations on object size. The test limits are given in "yc_parms.dat".
A report form "yc_tmplt.txt" in which to record the limits found by using the Capacity Assessor is provided. The report form provides space to enter the test parameters minimum_value, maximum_value, and elapse time; time limit found (yes or no); limit found (yes or no); smallest value to fail; the largest value to pass and a comment field for user's observations of the test.
The detailed description of each test problem is included in the VDD. That document contains the following appendices:
APPENDIX NAME CONTENTS
A Test Problem Descriptions List of test problem names with a brief description of each. New or withdrawn tests are identified.
B Test Problem to Source File Map List of test problems and the source file they are contained in.
C Distribution Description List of files on the distribution media.
D System-Dependent Test Problems List of test problems which exercise system dependent features.
E Debugger Assessor Descriptions of the scenarios.
F Diagnostic Assessor Descriptions of the scenarios.
G Capacity Assessor Descriptions of the scenarios.
H Library Assessor Descriptions of the scenarios.
I New and Modified Tests List of new performance tests for this release and a list of modified tests from ACES Version 2.0.
The ACES is designed to be easily extendible. If an organization created its own unique test problems, or modified existing test problems, these new problems would not be described in the documentation distributed with the ACES.
Test problems which fail to execute will be flagged on the CA output. There are several reasons why problems may fail, not all of which are due to serious problems in the compilation system being tested. Reasons include:
* The test problem could uncover a compiler error.
* The test problem could produce incorrect code which fails at execution time.
* The test could have been skipped and no attempt made to execute it. When new test problems are added to the test suite, this will be the status of those problems on the systems not yet tested. An organization with only a limited time for testing might not attempt to execute all the ACES test programs. All test problems skipped for lack of time would be unavailable.
* The test may be inappropriate. For example, the tests for file I/O are not appropriate for targets which do not support devices capable of maintaining a file system. The data for some system-dependent test problems may not be available for the following reasons:
+ The tests are not supported on the target. A test may use a floating point type with more precision than the target supports, or it may use an unsupported feature of Chapter 13 of the LRM; for example, several validated systems do not support the attribute label'ADDRESS. There are a few test problems which require checking to be suppressed; dr_rd_rec_discr_03 refers to a discriminated record with an invalid discriminate and intentionally violates a range check, assuming it can reference an 8-bit field as if it were an unsigned byte with a range of 0..255.
+ The tests require modifications which were not performed. An example is the tasking test problems which tie hardware interrupts to ACCEPT statements. These problems will require implementation-dependent modifications to execute, assuming that the target system will support the feature at all. Until this modification is performed, they cannot be executed as intended. Similarly, the test problem which calls on an assembler language program will need to be modified for each target to adapt to the interface conventions of the target.
* A test may execute, but produce results which are considered unreliable. When the timing loop is unable to achieve the requested statistical confidence level within the maximum number of permitted outer timing loop cycles, the measurement is considered unreliable. The Condense tool which extracts measurements examines the confidence level indicator and, if it is set, inserts in the database a negative numeric value which is recognized by CA as indicating an unreliable measurement.
An unreliable measurement does not imply that the compilation system did not execute the test problem properly. It indicates that the timing loop did not measure the execution time reliably. Simply rerunning the test program is often sufficient to produce a reliable measurement; there may be less contention when the test is rerun.
* The test program may have failed to verify the NULL loop time. The time for the NULL loop is computed during the Pretest and verified in each test program execution. If verification fails, an error code is written. This usually means that there was contention on the system and will often be resolved if the program is rerun.
* The problem may have been estimated to take excessive time and not been attempted. This is possible for the tests in the SR group IM subgroup.
Any test problem which retains the same name between releases of the ACES shall be comparable. Modifications to a test problem between releases must not change the performance of that test problem. This convention will strictly limit the type and extent of possible modifications. The ACES is adapted so that historical data from early releases of the ACES will be comparable to the most current releases.
If and when a test problem is determined to be incorrect, and correcting it would imply a possible performance modification, the test will be withdrawn and a corrected version entered with a new name.
ACES users have two formal paths to provide feedback to influence future ACES development. They can submit written problem reports and they can write change requests. No telephone support is provided. Written problem reports and change requests will be accepted as described below.
The procedure to request changes in either operations or in interpretation is the same. Readers and users may submit different types of requests: Readers would be likely to request modification to analysis output, or the addition of new test problems (or areas which should be tested). Users may request changes in the packaging of problems into programs, or modifications to control procedures.
The depth of detail of a change request may vary. Users may request the incorporation of a new test problem (which is submitted for consideration), or there may be a less specific request asking for more emphasis on some areas of concern. The more specific a request is, the easier it will be to respond to. The Change Request will be logged and evaluated, and a determination will be made.
After completing the form on the next page, mail it to:
Brian Andrews, HOLCF Technical Director
88 CG/SCTL
3810 Communications, Suite 1
Wright-Patterson AFB OH 45433-5706
E-mail: andrewbp@msrc.wpafb.af.mil
ADA COMPILER EVALUATION SYSTEM
CHANGE REQUEST
Originator's Name ________________________________________________
Organization ________________________________________________
Address ________________________________________________
Telephone ________________________________________________
Date ________________________________________________
SYSTEM IDENTIFICATION
ACES VERSION ______________________________
Compilation System Version ______________________________
Host Operating System Version ______________________________
Target Operating System Version ______________________________
Hardware Identification ______________________________
(If a test program is submitted for incorporation into the ACES, identify where it has been tested)
CHANGE DESCRIPTION AND JUSTIFICATION
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
_______________________________________________________________
(attach more pages if necessary)
This section tells the user how and where to report problems with the ACES. Problem reports will be logged and evaluated and a determination will be made.
After completing the form on the next page, mail it to:
Brian Andrews, HOLCF Technical Director
88 CG/SCTL
3810 Communications Suite 1
Wright-Patterson AFB OH 45433-5706
E-mail: andrewbp@msrc.wpafb.af.mil
ADA COMPILER EVALUATION SYSTEM
SOFTWARE PROBLEM REPORT
Originator's Name ________________________________________________
Organization ________________________________________________
Address ________________________________________________
Telephone ________________________________________________
Date ________________________________________________
SYSTEM IDENTIFICATION
ACES VERSION ______________________________
Compilation System Version ______________________________
Host Operating System Version ______________________________
Target Operating System Version ______________________________
Hardware Identification ______________________________
(if a test program is submitted for incorporation into the ACES, identify where it has been tested)
PROBLEM DESCRIPTION
Source File with Problem ___________________________________________________
Explanation _____________________________________________________________
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
(attach more pages if necessary)
ACEC Ada Compiler Evaluation Capability
ACES Ada Compiler Evaluation System
ACVC Ada Compiler Validation Capability
CA Comparative Analysis
CISC Complex Instruction Set Computer
CPU Central Processing Unit
CRC Cyclic Redundancy Check
DEC Digital Equipment Corporation
HOLCF High Order Language Control Facility
I/O Input/Output
LRM (Ada) Language Reference Manual (ANSI/MIL-STD-1815A)
MCCR Mission Critical Computer Resource
MIP Million Instructions Per second (a measure of hardware speed)
MIS Management Information System
NUMWG Numerics Working Group (ACM SIGAda organization)
RAM Random Access Memory
RISC Reduced Instruction Set Computer
RM (Ada 95) Reference Manual (ISO/IEC 8652 (1995))
ROM Read Only Memory
RTS Run-Time System
SSA Single System Analysis (ACES analysis tool)
VAX Virtual Address eXtension (DEC family of processors)
VDD Version Description Document
VLIW Very Long Instruction Word
VMS Virtual Memory System (DEC operating system for VAX processors)
Cross Reference Index for ACES Document Set
Acronyms, Abbreviations
Primer 9
RG 13.1
UG 11
Adding/Modifying Tests
RG 6.8
UG 5.4.3, 9.1.6
Addressing
Primer 3.2.2, 3.2.5
RG 2.4.3, 6.7, 8.4.1, 10
UG 4.3.3.3, 5.1.1, 5.1.2
Analysis, Running
Primer 1.2, 2.1.3, 3.2.4, 3.2.11, 3.3, 4, 4.1.1, 5, 6.4.1, 6.5,
6.5.2, 7.3.3
RG 2.4.2.3, 2.4.2.4, 2.4.2.5, 5.1
UG 7, 9.1, 9.2, 9.3, 9.4, 9.5, 9.6
Capacity Assessor
Primer 6.1, 6.5, 6.5.1, 7.3.4
RG 2.4.3, 3.6.4, 8.4
UG 4.3.3.2, 8.4, App. F
Code Size
Primer 1.4.1, 1.4.2, 3.2.2, 3.2.5, 5.3.2, 7.2.1, 7.2.2
RG 5.2.2.1.1, 5.2.2.2.1, 5.2.2.3.1, 5.4.2, 5.4.2.2, 5.4.4.1
UG 4.3.3.3, 5.1.1, 5.1.2, 9.3, 9.3.2, 9.4
Comparative Analysis
Primer 3.2.14, 5.2, 5.2.1, 7.1
RG 2.4.2.3, 5.3, 7.2
UG 9.4
Compatibility of Test Suite
RG 11
UG 4.3.2, 4.3.3.3, 5.2, 5.1.4, 5.1.6.1, 9.4.3, 10.1
Condense
Primer 3.2.13, 5.1
RG 5.2
UG 9.1.7, 9.3
CPU Time
Primer 3.2.2, 3.2.3, 3.2.4, 3.2.5
RG 6.3.2.6, 6.4.2.2, 6.4.2.3
UG 4.3.3.3, 5.1.1, 5.1.2, 5.1.3, 5.4.1
Data Summary Table
Primer 7.1.2
RG 5.3.2.2.3
Decision Issues
Primer 1.4, 1.4.1, 1.4.2, 1.4.3, 3, 3.1, 7.2.2
RG 3.2.6.1, 3.6.2, 5.3, 5.3.2.2.6, 5.3.2.2.7, 5.4.4.3, 6.4.1,
7.2, 7.3
UG 5.1.1, 5.1.6.3, 5.4.3, 6.1, 6.3, 6.9.6, 6.10.7, 9.1.1, 9.3.7,
9.4.3, 9.6
Diagnostic Assessor
Primer 6.3, 7.3.2
RG 2.4.3, 3.6.3, 8.2
UG 2.2, 4.3.3.2, 8.2, App. D
Erroneous Tests
RG 10
UG 6.10
Exceptional Data Report
Primer 3.2.13, 5.1
RG 5.2.2.2
UG 9.3.3
File Name Conventions
Primer 2.1
RG 5.2.2
UG 4.3.3.1.1
Globals
Primer 3.2.1, 3.2.5, 6.5, 6.5.1
RG 3.2.5.2, 8.4.2
UG 4.3.3.3
Harness
Primer 2.1.2, 3.2.11, 4.1, 4.3
RG 5.2
UG 6.0
History (ACES)
Primer 1.1
RG 2, 2.1, 2.2, 3.1, 7.1
UG 2
Include
Primer 2.1, 3.2.5, 3.2.10, 4.3.2
RG 4.3.3.1.1, 5.5.2, 6.5, 6.11, 9.1.6
Interfacing Spreadsheets
Primer 1.1, 5.1, 5.2.4, 7.1.2
RG 2.3, 9.3, 9.3.1, 9.3.2
UG 2.3, 9.3
Interpreting Results
Primer 1.2, 3, 4.2, 7
RG 5.3.2.2.1, 7, 6.1
UG 2, 2.1, 4.2, 9.3.4.2
Level of Effort
Primer 6.5.1.3, 6.5.4
RG 2.4, 7.2, 7.6, 8.1
UG 6.5.1.3, 6.5.4
Math Implementation
Primer 3.2.5
RG 3.2.3
UG 5.1.4, 5.1.6.1
Operating System Commands
See Unix Commands
Optimization
Primer 5.3.1, 5.3.3, 7.2.2
RG 6.2, 6.3.2, 6.6
UG 5.1.6.2, 9.1.1
Output
Primer 1.2, 1.4.1, 3.1.1, 3.2.1, 4.3.1, 7.1.1
RG 5
UG 7.3, 9.1, 9.2.2, 9.3.3, 9.4.1, 9.5.1
Performance Tests
Primer 1.2, 2.1.3, 4, 4.1
RG 5.4.2.2, 5.4.4.4, 8.3.1
UG 4.3.2, 7.0
Pretest
Primer 3.2, 3.2.1 - 3.2.15
RG 3.2.3, 6.3.1, 6.5, 10
UG 5.2, App. B
Program Library Assessor
Primer 6.3, 6.4, 6.4.1, 7.3.3
RG 2.4.3, 3.6.2, 8.3
UG 4.3.3.2, 8.3, App. E
Referenced Documents
RG 1, 1.1, 1.2
UG 1, 1.1, 1.2
Reports
Primer 4.2, 6.5.2
RG 2.4.2.3, 2.4.2.4, 2.4.2.5, 2.4.3, 3.1, 3.2.4.2, 3.2.7.4, 3.6,
4, 5
UG 4.2.1, 6.5.3.3, 6.10.7, 6.11, 7.1, 7.3, 8
Resources Needed
Primer 3.5, 5.1, 6.3.1, 6.5.2
RG 3.2.7.3, 3.4, 3.6.2, 3.6.4, 8.2.1, 8.3.1
UG 4.2, 4.2.1
Setup
See Pretest
Simulators
RG 6.9
UG 5.4.1
Single System Analysis
Primer 3.2.15, 5.3, 7.2
RG 5.4
UG 9.5
Symbolic Debugger Assessor
Primer 6.2, 6.2.1, 6.2.2, 7.3.1
RG 2.4.3, 3.6.1, 8.1
UG 8.1, App. C
Testing Scenarios
Primer 1.4
RG 2.1.3.1, 3.2.1, 3.2.3, 3.6.2, 5.1, 5.4, 6.3.2.6, 8.1, 8.2,
8.4, 10
UG 4.2.2, 6.11
Timing Techniques
Primer 2.1, 2.1.3, 3.1, 3.2.5, 3.2.10, 6.4, 7.1.2
RG 6.0, 6.2, 6.3, 6.3.1, 6.3.2, 6.3.2.5, 6.7, 7.3
UG 7.3
Usability
Primer 1, 1.2
RG 2.1, 2.4.2.2, 3.1, 3.2.1, 3.2.3, 3.6.2, 5.1, 5.4, 6.3.2.6,
8.1, 8.2, 8.4, 10
UG 4.2.2, 5.3, 5.1.6.1.3, 6.11
User Adaptation
RG 2.4
UG 6.11, 9.6
User Feedback
RG 12
UG 10