LANGUAGE STUDY NOTE LSN-030-DR STUDY NOTE ON "FOR ALL" EXTENSION FOR ADA Robert B. K. Dewar 16 Apr 1991, 05:48 This note follows up on the suggestion made at the mapping workshop by the Massively Parallel Architectures group for implementation of loop level parallelism. The basic idea was to allow an "all" keyword in a loop, which would imply freedom for the implementation to implement the loop in an arbitrary serial order, or in parallel, or in some mixture of the two approaches. A key part of the idea is that normal serial implementations be permitted to completely ignore the "all" and treat the loop as a normal serial loop. Note: our original syntax during this discussion was: for all I in 1 .. N loop but Tucker subsequently proposed for I in all 1 .. N loop which has the advantage that it is clear that "all" and "reverse" do not belong together in the same loop, and also this construct is syntactically parallel with the existing reverse syntax. I have adopted Tucker's syntax for the rest of the note, since it seems preferable, but this is not a major point. Background ---------- A number of other languages, most notably parallelized versions of FORTRAN and C, have introduced parallel loop constructs (in parallel FORTRAN, the construct is DO ALL) to provide loop level parallelism. The essential elements are a loop structure which allows the branches of the loop to be executed in parallel, and a mechanism for providing replicated variables (i.e. variables which have a separate instance for each branch of the loop). Vectorizing compilers can often detect an opportunity in a language like FORTRAN for vectorizing a simple loop like: DO 10 I = 1, N A(I) = B(I) * C(I) 10 CONTINUE However, in more complex loops, it may be difficult for the compiler to determine that a loop can be vectorized (for example if an external routine is called, it is not possible to tell whether or not this subroutine has side effects). There are also problems caused by temporary variables, as in: DO 10 I = 1, N T = E(I) * F(I) A(I) = B(I) * C(I) * T 10 CONTINUE where normal FORTRAN semantics provide only one copy of T, and furthermore T must be left set to the value from the last loop. This tends to make it harder to parallelize the loop. In the case of calls to external subroutines, the replicated variable problem becomes more important (because of local variables within the called subroutine). The provision of syntax in parallel FORTRAN to explicitly specify that the compiler may execute a loop in parallel makes it easier for the compiler to take advantage of parallel architectures, and is also arguably desirable from a specificational point of view, since many of these loops are not inherently serial from a conceptual point of view. The Situation in Ada -------------------- The problem of replicated variables is not an issue in Ada, since we have local block declarations, so for instance, the second example above can be written in Ada as: for I in 1 .. N loop declare T : LONG_FLOAT := E(I) * F(I); begin A(I) = B(I) * C(I) * T end; end loop; which seems a perfectly satisfactory method for indicating the equivalent of "replicated variables". Furthermore, in the case of externally called procedures, the procedures are prepared in any case for possible parallel execution by separate Ada tasks. However, Ada presents a more severe difficulty in the form of the exception semantics. The canonical semantics of Ada require this example to be executed one iteration at a time, and all subsequent iterations must be abandoned if an exception occurs. The permitted relaxations in RM 11.6 are recognized not to be adequate to address this problem (see discussion of Study Topic S7.3-A(1) in the Requirements document). Suppressing exceptions, or implementing the compiler with FLOAT'MACHINE_OVERFLOWS set to FALSE are possible work arounds, but clearly are undesirable as long term solutions. In any case, the other arguments mentioned above for the introduction of explicit loop level parallelism also apply to Ada: o It is conceptually desirable to indicate that a loop is not required to be executed serially if the loop is essentially a set operation. o It may be difficult, and in some cases impossible, for a compiler to determine that a loop can be executed in parallel. Even when it is possible for a compiler to make the determination, such compilers are complex, and the explicit syntax allows a much simpler approach. The Basic Proposal ------------------ The basic proposal is as follows: The syntax in 5.5 is modified: loop_parameter_specification ::= identifier IN loop_control_specifier discrete_range loop_control_specifier ::= [REVERSE] | ALL The iterations of a loop for which the ALL keyword is specified may be executed in any serial order, or in parallel, or in some combination of the two approaches (for example if there are 1000 iterations, then ten separate threads of control may each execute 100 iterations, choosing any serial order within each thread of control). This is of course an informal statement, and we note that there are a number of delicate semantic points. Some of these are raised in a note from Jim Hassett, sent by email on April 5th. A copy of this email note is included as an appendix. Among the issues are: o Tasking constructs, particularly accept and entry calls o Transfers of control (goto, exit, return, exceptions) These will be addressed in separate sections. A key part of the proposal is that simple serial implementations can entirely ignore the ALL keyword, since any semantic model must include the possibility of simple serial execution of the iterations in order, i.e. a semantic interpretation identical in all respects to the normal semantics of the loop in the absence of the ALL keyword. Note: a simple serial implementation may still be able to take advantage of the ALL to improve the code in some situations. For example, in the loop: for I in 0 .. N loop A(I) = B(I) * C(I); end loop; It is the case on most serial architectures that the loop will run faster backwards. Sometimes one will even see Ada programs include a reverse keyword only for the purpose of improving the generated code: for I in reverse 0 .. N loop A(I) = B(I) * C(I); end loop; However, adopting the proposed extension, programmers should be encouraged to use the ALL keyword freely. Indeed only loops where serial execution is essentially important should omit the ALL. In this case the loop would be written: for I in all 0 .. N loop A(I) = B(I) * C(I); end loop; and a compiler for a serial machine would be free to reverse the order of the loop if this improved efficiency. More complex examples involving nested loops and efficient uses of cache memories can be constructed. Relation to Requirements ------------------------ Study Topic S7.3-A(1) reads: "Statement Level Parallelism". Ada 9X should accomodate compiler techniques for efficiently mapping sequences of Ada statements, including particuarly appropriate loops, onto vector architectures." Study Topic S7.2-A(1) reads "Managing Large Numbers of Tasks: Ada 9X shall provide for the efficient creation, initialization, and termination of large numbers of tasks." S7.3-A(1) is directly addressed by this proposal. In a loop such as: for I in all 0 .. N loop A(I) = B(I) * SQRT (C(I)); end loop; the potential exists for easily vectorizing this loop on an appropriate architecture, or distributing it across multiple processors in a MIMD architecture. The one issue concerns exceptions, and when we address exception semantics later on, we must be sure that they are formulated in such a way as to avoid inhibiting vectorization of such loops. The discussion of S7.3-A(1) specifically mentions two possibilities: automatic recognition of parallelism, and the introduction of specialized loop constructs. The "for all" proposal clearly addresses the second of these possibilities. An issue not further discussed in this LSN is whether or not the flexibility provided by "for all" is sufficient for the purposes of meeting the user needs, or whether the further facilitation of automatic recognition is still needed. Note also that the issue of subprograms having side effects can also be addressed by providing mechanisms for annotating the specifications of subprograms to indicate that they are side effect free. Although this is not specifically addressed in the Requirements document, it can be seen as a special case of Requirement R9.3-A(1) (Allow Additional Compile-Time Restrictions), and furthermore the initial proposed mapping document suggested such a facility. Regarding Study Topic S7.2-A(1), the discussion of this topic indicates that the main problem lies in giving tasks separate identities, since the necessary RV calls in Ada-83 can constitute a serial bottleneck. The proposed "for all" facility provides a potential solution: for I in all 1 .. N loop TASKS (I).INITIATE (I); end loop; This allows all the calls to INITIATE entries in the vector of tasks to be made simultaneously, avoiding the serial bottleneck. If the Ada 9X proposal for task discriminants is adopted, then this can also be written: for I in all 1 .. N loop declare T : WORK_TASK (I); begin null; end; end loop; However, the provision of task discriminants does not solve the problem in the absence of the for all, and the second form is not clearly advantageous, so the need for task discriminants does NOT appear to be established by this requirement, at least in the presence of the "for all" proposal. Note: the current mapping document proposes a syntax for initialization of arrays using an implicit loop notation, which is suggested as a solution to the problem of giving tasks identities. However, it would appear that this implicit loop construct also needs a "for all" possibility if the serial bottleneck is to be avoided, and again the requirement for task identification does NOT appear to be a justification for the implicit loop initialization construct. Handling Exceptions ------------------- There are two issues to be addressed. First, if an exception is raised, what is the status of the other iterations of the loop. Second, what if more than one exception is raised on separate iterations. The general approach should be to give as much flexibility as possible. The idea is that a programmer is satisfied to know that something has gone wrong somewhere in the loop, but does not need to know exactly where, and is willing to regard the whole state of the loop as undefined in some sense. Although this seems a little vague compared to normal Ada canonical semantics, but it is arguyably conceptually reasonable for a high level view of the loop as a concerted set operation, and from a pragmatic point of view, it is important to allow this much freedom since there are vector architectures which provide essentially this free semantics at the hardware level (i.e. the operation indicates success or failure, but does not provide detailed information as to the location of the error). An informal statement of the desirable semantics might be something like the following: If an exception occurs during one iteration, then the execution of the loop is abandoned, and other iterations may be in any of the following states: o not started o completed o abandoned after partial completion If more than one exception occurs on separate iterations, then the execution is abandoned as described above, and one of the exceptions (chosen non-deterministically) is propagated. The difficulty will lie in precisely defining what it means to abandon a partially executed iteration. One possibility is to model the possible threads of control as Ada tasks, and appeal to the abort semantics for this abandonment. This is has the desirable properties of allowing complete freedom (in particular, variables being assigned can become undefined), but on the other hand, we have never clearly defined the semantics of abort in Ada (for example, in some implementations this can cause storage leaks, is this permissible). Since abort is regarded as an unusual emergency situation in Ada-83, not properly defining may be acceptable. However, raising an exception in a "for all" loop is not an "emergency situation", and certainly we would not tolerate storage leaks as a result. Note: here we are talking about storage leaks resulting from failure to free storage implicitly allocated by the compiler -- if the iterations explicitly allocate storage, then of course the programmer must expect that this storage may be lost -- a situation which is just an extension of normal exception behavior. Finally, if Ada-9X includes finalization, the interaction of the abandonment with finalization actions must be considered. The difficulty here is not so much the implementation, since after all an implementation need only complicate things as much as is desirable to take advantage of the architecture, and the exception semantics are essentially designed to allow the implementation to do anything reasonable at the level of the generated code. The difficulty lies in providing a reasonable semantic explanation of the set of allowed behaviors. This LSN does not further address this issue. Other Control Transfers ----------------------- As Jim Hassett points out in his note, it would be possible to simply disallow transfers of control out of the loop. However, this is an odd restriction, and furthermore it unnecessarily complicates the serial implementations, since now they cannot simply ignore ALL (they must at least perform the required semantic checking). It seems likely that once the exception semantics are worked out, it is easy enough to allow transfers of control outside the loop (since after all exceptions are simply a special case of this general semantics). The statement is something like: If a GOTO, EXIT or RETURN transfers control outside the loop (for which the ALL keyword is specified), then execution of the loop is abandoned. where the concept of a loop being abandoned is as described for the exception case above. Tasking Constructs ------------------ There seems to be no particular problem in allowing entry calls, and indeed it is essential to allow them if this proposal is to address S7.2-A(1). One point that needs to be made is that the possibility of entry calls means that we cannot simply say that a FOR ALL loop is executed either: o serially in some non-deterministically chosen order or o in parallel, with each loop modeled as a task because we want to allow the mixed possibility, and it is possible, using entry calls to write a program which can distinguish between the mixed case and either of the pure cases given above. In other words, the set of allowed behaviors must be greater than the set that would be implied by the two "pure" semantics given above. That is why the original rule explicitly mentioned the mixed possibility. This is annoying, since it complicates the semantic description, but it has no effect on the implementation difficulties. The business of accepts is more problematical. Jim Hassett gives the following example: task T is entry A; ... end T; task body T is begin for J in all 1..10 loop accept A; ... end loop; ... end T; and asks what the semantics are. The difficulty is that if the iterations are modeled as separate tasks, they would not in any case be permitted to issue the accept statements (since they would be for other than the current task). Statically disallowing accept statements in FOR ALL loops is one possible approach, but it does add a (relatively slight) burden to serial implementations which simply want to ignore the ALL keyword. One possibility is to allow such accept statements to be rejected at execution time with an exception. This might be attractive if it were done in the context of a general permission for issuing accept statements in procedure bodies, with an understanding that execution of such accept statements may be dynamically exceptional if they refer to other than the currently executing task. The other approach is to attempt to specify an appropriate semantics. The use of accept does of course limit the possible parallelism, since we certainly assume that the same entry cannot be accepted by two separate iterations at the same time. But it may not be too difficult to specify a semantics which does permit the accept with this notion of serialization. It does seem to make reasonable semantic sense to use ACCEPT in a FOR ALL loop. Consider the loop: for I in 1 .. N loop accept GET_DATA do A(I) := DATA_ITEM; end; end loop; Now of course this loop has serial semantics in Ada-83. But suppose it is the case that conceptually A really represents a set, and the order in which the items are obtained is not important. In this case we would prefer the programmer to be able to write: for I in all 1 .. N loop accept GET_DATA do A(I) := DATA_ITEM; end; end loop; to clearly indicate this fact. For this particular loop, it is likely that the serialization implied by the accept would negate any efficiency advantage, but in a more complicated loop: for I in all 1 .. N loop accept GET_DATA do A(I) := DATA_ITEM; end; -- Now do a bunch of calculations on A(I) end loop; the parallelism would provide a potential implementation advantage. To summarize, it is desirable to accomodate the possibility of accept's in the semantics if this can be done without excessively complicating them. Rejecting accepts, either statically or dynamically, in FOR ALL loops seems undesirable from both uniformity and usability points of view. Summary ------- The proposed FOR ALL construct seems to provide useful approaches to satisfying user needs identified in sections 7.2 and 7.3 of the Requirements document, as reflected by Study Topics S7.2-A(1) and S7.3-A(1). In particular, the issue of identifying tasks without introducing serial bottlenecks seems to be adequately addressed by this proposal, and appears to reduce or even eliminate the need for either task discriminants, or for the implicit loop initialization construct. The impact on implementations is minimal. Serial implementations may simply ignore the ALL, and in the case of parallel implementations, they may do only what is desirable to take advantage of the peculiar architectural possibilities of the parallel target architecture. The semantics are intended to be sufficiently flexible that any reasonable architecture can be accomodated. The principle difficulty lies in providing a reasonably simple semantic specification at the language description level. Given that it is desirable to encourage the widespread use of ALL in loops to conceptually indicate a lack of serial dependency, it is preferable that these semantics be embodied into the main part of the language, rather than relegated to an annexe. Assuming that it is possible to provide a relatively straightforward semantic description, the FOR ALL feature seems to be an attractive candidate for inclusion in Ada 9X. In particular, it provides important additional functionality in parallel architectures, without any adverse impact on existing implementations for serial architectures. Robert Dewar Appendix: Email note from Jim Hassett, 5 Apr -------------------------------------------- At the Mapping Workshop, the Massively Parallel Architectures group called for the mapping team to examine the possibility of adding a "for-all" parallel loop construct to Ada. The intent was to address the concerns of Study Topic S7.3-A(1) (Statement Level Parallelism) with minimal impact on the language design. I participated in the discussions of the massive-parallelism group, and fully support the suggested construct. Precise definition of this construct is not quite as simple as it might seem, however, and I want to raise a few issues. These deal primarily with the semantics of parallel execution and restrictions which may be needed regarding this construct. The suggestion was that the syntax of the for loop could simply be extended by inserting the optional keyword "all". The presence of the "all" keyword would indicate that the separate executions of the sequence of statements of the loop (each with a different value of the loop parameter) could be carried out in any order or in parallel. Since this would allow serial execution in normal for-loop order, any implementation could ignore the "all" keyword if it did not support parallel processing. The idea of allowing a for-all loop to be treated as an ordinary serial loop is crucial to minimizing the cost of this language change. It allows the change to be handled by a simple front-end modification. For parallel execution, it is probably best to propose a canonical execution model in which each execution of the loop body is performed by a separate task. This takes advantage of existing semantics for parallel execution to deal with issues related to task interactions, shared variables, and scheduling. No new parallel-execution concepts are needed (though some mechanism is needed for communicating the loop parameter to each task). Actual implementations can take advantage of optimizations applicable to the tasks, which are entry-less and anonymous, so they cannot be called or directly aborted, and needn't support 'Callable or 'Terminated. But we have to reconcile the serial and parallel execution models, to minimize differences in semantics. Consider this example: task T is entry A; ... end T; task body T is begin for all J in 1..10 loop accept A; ... end loop; ... end T; This is a plausible construct for serial execution, but it is not clear (to me, anyway) how to define the semantics of parallel execution of accept statements. If parallel execution is modeled by distinct tasks, such accept statements would have to be prohibited, since the body of the loop is not treated as part of the sequence-of-statements of the the task T, but is contained by an implicit task body which has no entries. Similarly, transfers of control are a problem: return, exit, and goto statements inside for-all loops must either be prohibited or given special semantics for parallel execution (e.g., any transfer of control out of a parallel for-all loop could implicitly abort all of the other loop-body tasks). A simple approach to these problems is to prohibit accept, return, exit, and goto statements inside for-all loops, on the grounds that parallel execution is not meaningful. But this would require serial implementations to reject certain loops which would be allowed without the "all" keyword. An alternative would be to require serial execution of any for-all loop that contains such statements, retaining normal serial loop semantics. Implementations should issue warnings when such cases preclude potential parallel execution. A more difficult issue is the handling of exceptions, which cannot be eliminated by decree. Note that for the serial model, if an exception (without an explicit handler) is raised inside the loop, not only is that iteration of the loop abandoned, but no subsequent iterations will be initiated. Also, the exception propagates out of the loop. But under the canonical model for parallel execution, an exception (without an explicit handler) raised by some execution of the loop body causes that particular execution to be abandoned, with no effect on any other execution. This tasking model also prevents exceptions from propagating out of the loop body. A significant programmer effort would be needed even to detect such exceptions from outside the loop. An implicit mechanism could be devised for handling exceptions, possibly aborting the implicit sibling tasks, and re-raising exceptions at the end of the loop statement. This would somehow have to deal with the possibility of several (possibly different) exceptions being raised by different executions of the loop body. Note that these issues really only concern the semantics of parallel execution. Any strictly serial implementation needn't deal with these problems at all, due to the assumption that serial execution will use the ordinary loop model. Ideally, the total impact on uniprocessor implementations will be in allowing an optional keyword. The LRM may need a page or two on the minutiae of parallel execution, but anyone (including implementers) not concerned with multiprocessors can ignore the whole section. I think this proposal comes about as close as possible to the ideal goal for study topics: satisfying a need (for explicit statement-level parallelism) with minimal impact on implementations and users. - Jim Hassett