LSN003.PrEpil This LSN examines the actions associated with the "epilogue" of a protected operation that might have changed the state of a protected record. --------------------------------------- Language Study Note on Reliance on the Protected Record State Offer Pazy March 25, 1991 1. Introduction The straight-forward way to describe the sequence of events when a protected record changes its state is that after the change, and without releasing the lock, barrier expressions are evaluated for all entries with non-empty queues, (possibly, in priority order), and each true evaluation causes the corresponding entry body to be executed, until there are no more waiters with a satisfied bar- rier. It is a property of the expression that it can be evaluated by any task that changes the state of the object. Hence, there is no need for an extra context-switch just for this purpose. The primary goal of this definition is to close the window between the time the state of the protected record is modified and when a high-priority waiter is scheduled while hold- ing the lock. This model suggests that the task which changes the state of a protected record is the one that actually evaluates all the expressions and executes all the bodies. This is not required though. Other implementations may be provided as long as the invariant, that callers queued at the time of the state-changing operation whose barrier evaluates to true are serviced without allowing other intervening locking operations to execute (includ- ing the queuing of new entry callers), is maintained. 2. Discussion We define the semantics in such a way that a deterministic model can be presented to the application while allowing some implemen- tation freedom for more specialized usages. Here we talk about the "guarantees". We do want to ensure certain properties of the behavior, and as long as the implementation supports them, it is free to choose its own model for the actual implementation. For mode-changes, such a guarantee is essential. An example which may illustrate this case is the following: Let's assume that at one point in time, a certain barrier is false and N (with a variety of priorities) tasks are waiting for it to become true. A task comes in and changes the state of the PR such that the barrier becomes true. The issue here is what is guaranteed regarding the existing queue of waiters ? It is important to ensure that all will be resumed (semi-atomically) before any other task, even a high-priority one, can "penetrate" the queue. We then may be able to have a meaning for the 'COUNT attribute, and also allow the changing-state task to rely on its view of the queue for deciding who and how-many to release. If we allow a slice of the queue to be changed after that task finished (i.e. by a high priority task queuing itself in front of existing tasks) that invariant will not hold anymore. This semantic model still allows for a variety in implementation techniques (see below), in particular, it is allowed to actually switch to the task of an entry whose barrier expression is satisfied in order to execute its body. However, the semantics are drafted care- fully as to not require such a context-switch, and to isolate the programmer from such differences in implementations. The implementation model implies that a high-priority task (or an interrupt handler) may be required to execute entry bodies on behalf of other, potentially lower-priority, tasks. This may seem like a potential for an unbounded priority inversion. We do not think so. (There are implementation and semantic problems with choosing other behaviors, see below). In terms of the amount of CPU utilization needed, at a certain priority level, there is no difference when the actual bodies are executed, since they will be elevated to the higher priority when they do. If we remember the intended purpose of these operations, we know that they should be very short and bounded in time. A situation simi- lar to the one above may happen when broadcast semantics are needed from the synchronization primitive using the protected record. Using a traditional kernel, a similar activity will also have to occur in an atomic manner, i.e. making all the waiting tasks ready. If the number of waiting tasks is unknown, the same problem exists, and the kernel will spend an unbounded amount of time accomplishing this. We envision that the code inside locking operations is simi- larly short and low-level. This does, however, place the respon- sibility on the designer of a real-time system to guard against potential problems, but this is not different from using any other known primitives. In the real-time domain, one has to be more careful and knowledgeable about what he is doing. In par- ticular, critical sections must be extremely short and bounded. Also, one has to fully understand the time properties of the semantics of these mechanisms (whatever they are), to realize their effect on the application and to accommodate for them. We think that when one is aware of the exact semantics and under- stands the issues, he is back into a kernel-like level control, and can program around the problems if they exist. Specifically, we treat PR's (in the real-time domain) as a potentially very low-level primitive, one which can be programmed for either broadcast or singlecast. One should also know what kind and how many waiters are expected. The underlying assumption is that one can never completely avoid some general book-keeping activity in the guts of the kernel that does not always follow pure priority rules, and that those activities are very short. As long as one can account for them in the scheduling analysis, everything is fine. These operations just do the bare minimum (with a very high-priority for a short time), and then let the rest of the more complex work be done outside of the critical region under the normal priority rules. 3. Possible_Problems_With_the_Proposed_Semantics The following are some of the identified problems: The critical sections are executed at the ceiling priority and not at their respective tasks priority. Furthermore, the entire activity after the state changes has to be scheduled as one chunk, in the corresponding priority level. As a result some loss of utilized schedulability is unavoid- able. The fact that a high-priority task may be tied up in evaluating barriers and executing entry bodies on behalf of lower priority tasks, can result in longer than necessary critical sections, and in the case where the task is actu- ally an interrupt handler, for an increased interrupt latency. Even though waiters on entry barriers may have lower prior- ity than the state changing task, practically they are exe- cuted before their allotted time. It seems natural to let "really" higher priority tasks to be dispatched first. The current semantics allows waiting tasks to proceed before the time is appropriate for their priority levels. If in this time they are assigned any resources that could be needed by higher priority tasks (as is likely), they have invoked an inversion. 4. Alternative_Implementation_Techniques The main problem is the fact that if the task that has caused the state to change does not itself perform the operation on behalf of all waiters, then we run into the problem of making sure that when the original caller wakes up the state of the PR has not changed yet. A way to do so is through transferring the owner- ship of the lock during the dispatch. i.e. one could serve the truly higher-priority tasks (before the elevation to the ceiling) immediately, mark the queues somehow, and later serve the rest. In a preemptive, dynamic, multi-cpu system, this is not trivial to do; the overhead and the length of the global critical section may be too long. Also, this technique is a special purpose one, does not scale nicely and in general is not appropriate as the semantics for a general-purpose real-time language. In an alternative semantics, the state changing task merely notes that one task can proceed, makes it "runnable" again, and them proceeds out of the protected record. Once the task has completed its work, there will be a context switch to the other task which will consume the resource, mark the next waiter as "runnable" and proceed. And so on. By the time each of the tasks has executed there will have been the same number of context switches. However the first (presumably high-priority) task will have been "delayed" only for a bounded and minimal time. The only overhead of this approach is that the barriers will be executed twice (once by the releasing task to check if any other task can proceed, and again by the task itself when it actually executes). This technique requires a clever implementa- tion to avoid problems when the caller withdraws. It is also difficult to broadcast data. Yet, another alternative approach would be to schedule the protected record rather than a particular task. In such a scheme the ready queue would consist of both tasks and protected records. The priority of the protected record would be the highest priority of the tasks waiting on one of its open entries. In effect the protected record is viewed as a schedulable object which needs a thread to execute. When a protected record is given the processor, the first task on its open entry queue is the thread which is used to execute the entry. With this approach, if a task times out or is aborted then it is simply removed from the entry queue of the protected record. If there are other tasks queued, the protected record is left in the ready queue (although its priority may have changed). If no tasks are left on the open entry queues, then the protected record is removed from the ready queue. 5. Summary The beauty of the proposed approach (flexibility and low-level) is that all the above alternatives can be easily programmed explicitly by the user based on the application needs. The *number to wakeup*, the baton, and other flags can all be kept in the protected record and in the barrier expressions.