Language Study Note LSN-029-DR Comments on Aspects of Unsigned Integer Types in Ada-9X Robert B. K. Dewar 02 Feb 1992 Let's start of by limiting the field of discussion to unsigned types with two's complement modular operations. I realize there are a couple of other issues (range checked unsigned and 1's complement), but I don't intend to address those here. Question 1: What unsigned types should be provided? MD 4.0 merely suggests that at least one unsigned type be provided, which is essentially simply a mechanism for forcing the implementation of the feature (and is consistent with the general approach in Ada-83, e.g. at least one float type must be provided). Presumably one expects that a sensible choice is made, and furthermore that the implementor will provide other appropriate types. How reasonable is this expectation? Well we have good and bad experience in Ada-83. Generally, implementors have provided a reasonable selection of integer base types. In the floating-point field, the experience is much less encouraging. No implementor has provided the IEEE extended float type on machines that provide it, despite the fact that reasonable exploitation of the IEEE capabilities in such environments requires such access. There are even worse cases of implementors providing only one float type on machines where there are clearly two types provided by the hardware. Some suggestions have been made to require more specific compliance. I suggested requiring at least UNSIGNED_8 (a common type used to manipulate untyped byte strings) and the UNSIGNED type corresponding to INTEGER (which often, but not always, would correspond to the address length). Someone (I forgot who right now), appropriately commented that this requirement could be undesirable on a 16-bit machine which had implemented a 32-bit INTEGER, leading to a situation in which the (relatively useless) UNSIGNED_32 was required, and the (much more important) UNSIGNED_16 was not provide. This seems a good point, and I don't like my suggestion any more (although the idea of requiring UNSIGNED_8 seems still potentially useful). Bryce Bardin (if I remember right) suggested that an implementation be required to provide unsigned types corresponding to each implemented signed integer type. I understand the motivation here, and generally am sympathetic, but it has one very important flaw. It would extend the annoying rule in Ada-83 that if you implement a long integer type, you are forced to implement additional silly things that are useless (like loops and indexing). I would like to see more implementors provide a 64-bit signed type on 32-bit machines (but I don't want 64-bit indexing and loops). On the other hand, I find a 64-bit unsigned type on a typical 32-bit machine to be completely useless. Tucker, without specifically suggesting a formal rule, notes that the MRT intention is that unsigned types be provided for all lengths which are conveniently supported by the hardware. This seems exactly the right idea, although of course it is hard to formalize. Perhaps the right compromise on this issue is to add a clear statement requiring implementation of all lengths supported by the hardware without clearly specifying what that means (it is then a subject of judgment, which is not so terrible!) Incidentally I would like to see a similar rule stated for floating-point in the numerics annex. This would *definitely* require some additional floating-point types to be implemented in some compilers (but presumably not be too burdensome given that the numerics annex does not have to be implemented) -- it really is absurd to say that a compiler implements the numerics annex (and therefore is presumably suitable for numeric applications) when it does not implement all the available floating-point types. Question 2: What are unsigned types used for? This is a question that should definitely be asked. The following are possible answers: 1. UNSIGNED_8 (i.e. unsigned bytes) are a convenient type for manipulating untyped byte streams (such as appear in communication protocols). In C, the type "unsigned char" (which is really an integer type) is just right for this, and this is a common usage pattern in C. For some purposes, a signed 8-bit type can be used but is often less convenient, because the high order bit (which is for example often serving as a parity bit) really does not have the semantics of a sign. 2. Address arithmetic. On many machines, addresses values naturally correspond to unsigned integers. Even on the IBM-PC (286) with its non-linear address space, the type UNSIGNED_16 is appropriate both for segment and offset arithmetic. Of course address arithmetic is fairly specialized. Compiler writers use it all the time, which is why this interest sometimes seems to be over-represented in the compiler community! On the other hand, address arithmetic arises more often than one would think in low level systems type programming. C almost always provides, in the form of "unsigned", an appropriate type for such operations, and of course pointer arithmetic implicitly uses the same kind of unsigned operations in typical implementations. 3. General purpose use for numeric subranges, related to representation considerations. For example, if a financial program involves dealing with 200 companies, then an UNSIGNED_8 value can fit a company number. 4. Specialized use in applications specifically requiring modular arithmetic. Typically these modulus values may not be a power of 2, but, as I understand this requirement, it is easier to construct such arithmetic based on power-of-2 modular arithmetic types, than on ordinary signed types (is this true). It seems clear that the overwhelmingly important uses are in categories 1 and 2, but there has been some support for also meeting the other needs. Question 3: Does Ada-83 Satisfy the Need? Generally the approach in Ada-83 for unsigned is something like: type UNSIGNED_8 is range 0 .. 255; for UNSIGNED_8'SIZE use 8; function "+" (A,B : UNSIGNED_8) return UNSIGNED_8; pragma INTERFACE ("+", INTRINSIC); -- or some equivalent ... How well does this work? Well it gives a type with the right range, right representation, and right operations, which can be implemented in efficient form if the compiler and code generator care to do so (I use the 9X term INTRINSIC here, but Ada-83 compilers are of course free to implement some- thing equivalent, and many do, for example in the Alsys compilers, the equivalent language name is KNOWN_TO_CODE_GENERATOR.) What are the disadvantages of this approach? There are three: FIRST: it does not work for the largest integer type, since there is no signed parent type that can be used to provide the (largely fictional) base type required. SECOND: the signed base type is pretty well hidden away, but not completely. For example, the loop: for I in UNSIGNED_8'BASE'FIRST .. 100 loop ... end loop; is an annoying case of reappearence. Of course a code generator can do appropriate special casing to avoid any general impact on the quality of code for the normal cases, but you can't always trust this to happen (at least one compiler fails to do this special casing today). Similar instances of unwelcome reappearence of the base type can show up in the generated code. After all, when we write: A,B,C : UNSIGNED_8; ... A := B + C; the formal semantics is essentially that the values of B and C are 16-bit signed values that are passed to the + routine, which returns a 16-bit signed value that must be range checked before the store in A. Of course the actual code can optimize this check away, but again, you can't count on compilers to always do the optimal thing. THIRD: When the type is instantiated in a generic, the normal signed operators reappear inside the generic, according to the usual Ada-83 rules. Let's look at these disadvantages in more detail. The FIRST problem, that we cannot deal with the largest negative number, is a serious one. There are three general approaches to solving it: The first approach is to build kludges. We can define an unsigned type which is not really a numeric type with the right operations, and then provide methods (unary plus, use of strings etc) for introducing numeric literals, so we get something of the sort: type UNSIGNED_32 is private; function "+" (X : INTEGER) return UNSIGNED_32; function "+" (X : STRING) return UNSIGNED_32; A,B : UNSIGNED_32; .. A := +123; B := +"16#8000_0000"; Well it works, and with appropriate compiler cooperation, can generate efficient code, but it is certainly unpleasant, and no one would regard it as a clean solution (or one that adequately meets the requirements). The second approach is to read the RM imaginatively (both the ARG and URG have experimented with this imginative reading) to allow implementors to introduce unsigned types in packages other than STANDARD. The reasoning is something like the following: o STANDARD can use magic to introduce new base types, and there is no explicit statement that such magic cannot be used in other packages. o The requirement for symmetrical types only applies to predefined integer types in standard, and therefore would not apply to new magic types in other packages. The status of this reasoning is a bit dubious (but as in the past, we have been willing to bend the RM if the need is great enough), and, as far as I know, the relevant AI is not yet in approved status (I think it is 597 if I remember the numbers right). [John Goodenough, can you confirm the current status on this point]. As I remember, the reason the ARG did not complete this work was that generics are not understood. Do these new magic types match range <> in a generic? If so, what arithmetic operations do they get (what are there predefined operations?) This gets tricky, and no answers seem completely happy. Note: at least one implementor (Verdix) has gone ahead and implemented this solution (with an appropriate warning that its validity depends on RM readings that have not yet been approved and are in dubious status). I am not sure what Verdix does with generics in this case. The third approach is simply to implement a longer signed type. For example, if you need a 32-bit unsigned type, then provide a 64-bit signed type which can act as the notional base type (and provide the understood generic predefined operations). This seems the cleanest response to the problem, but, as far as I know, no implementor has taken this approach. Why not? I think the answer is clear. Given the current ACVC interpretation of the Ada-83 rules, if you implement a 64-bit signed type, then you have to implement a bunch of useless operations, such as conversions, indexing, loops and cases, which enormously increases the difficulty of such an implementation (just implementing the basic arithmetic operations is a much simpler, though not trivial, task). It is one of the annoying consequences of our current interpretation of the rules that, given two compilers: Compiler A: has only 32-bit integers Compiler B: has everything compiler A has, but also implements 64-bit integers, restricting the operations to simple arithmetic. we declare compiler A valid, but compiler B invalid, even though compiler B clearly has more functionality (at such decisions, I tend to support Stallman's contempt for restrictions imposed by standards!) Anyway, I think this interpretation is clearly responsible for the fact that the third approach has not been more popular. If Ada-9X was more permissive in this respect, this solution might be more attractive -- more on that later. Turning our attention to the SECOND problem, that of havinmg the signed base type reappear, this is more of an implementation problem than a semantic problem. The one semantic problem is that it is conceptually worrisome to have these types not be the same as their base types (a relationship which is of course required for normal signed types). If Ada-9X permits the explicit use of 'BASE as a type mark, then this problem is aggravated. Nevertheless the conceptual problem is relatively minor, and probably it is the implementation consequences that are more significant. One question here is whether doing things right using this Ada-83 approach (using special casing where appropriate), is any harder than implementing a separate Ada-9X approach, such as is suggested in MD 4.0. My guess is that it's pretty much a wash (neither approach is likely to be significantly more work). Now, what about the THIRD problem -- difficulties in generic instantiation? The problem is that inside the generic the normal predefined (i.e. signed) operators reappear, which seems wrong -- or is it? To study this further we need examples of actual usage. As I consider these examples, I find three separate cases: 1. Cases in which signed operations are essential (even for dealing with unsigned cases). As might be expected, examples in this category are generics which are suitable for use with both signed and unsigned types. 2. Cases where it does not matter whether you get the signed or unsigned operations. As might be expected, examples in this category are generics which are suitable for use with both signed and unsigned types. 3. Cases of generics suitable only for use with unsigned types, where the use of the unsigned (modular) operations is essential. 4. Cases of generics that are applicable to either signed or unsigned types, but where the use of unsigned (modular) operations for the case where an unsigned type is used is essential. Let's try to find examples in each category. First, a case where signed operations are essential. For this, consider: generic type ELEMENT is range <>; ... package STATISTICS is function AVERAGE (A : ARRAY_OF_ELEMENT) return ELEMENT; function STD_DEV (A : ARRAY_OF_ELEMENT) return ELEMENT; ... end; It makes perfect sense to compute the average and standard deviation of a set of unsigned or signed numbers, and of course the answers happen to be of the appropriate type (the average is signed or unsigned, depending on the input type, and the standard deviation always happens to be positive). This generic is thus suitable for use by both signed and unsigned integer values. However, the body very likely requires the use of signed arithmetic, and *certainly* does not depend on modular arithmetic. Consider how an average is computed -- it's not so easy, you can't just go adding up the total and dividing because of overflow. Two obvious approaches are to detect and handle overflow when it happens, or to fiddle around with subtractions and scaling (generating intermediate negative numbers) to avoid the overflow -- in either case modular arithmetic will derail the computation. Is the use of this package for unsigned numbers realistic? I think so. Here are two examples. Instantiate UNSIGNED_8 type and use it to compute these statistics for an incoming encoded byte stream in an attempt, part of a decryption program, to indentify the particular code being used. Instantiate UNSIGNED_32 type and use it to compute these statistics for the position of files on a disk. The average is the preferred head rest position, and the standard deviation indicates the expected average seek performance. Now, what about a case in which it does not matter whether or not signed or unsigned operations are used: generic type ELEMENT is range <>; ... package SORTOPS is procedure SORT (A : in out ARRAY_OF_ELEMENT); function MEDIAN (A : ARRAY_OF_ELEMENT) return ELEMENT; ... end; Here the most likely implementation of both sort and median is to simply use the comparison operator on elements that originally exist in the given array. Since the signed and unsigned comparisons give the same results (remember for example that we are comparing the 8-bit unsigned comparison with the 16-bit signed comparison), it makes no difference which set of operators is used. Is the use of this package with unsigned types realistic? Again I think so. Here are some suggestive examples: Instantiate UNSIGNED_8 type and use it to compute the median of a sequence of bytes. This operation is needed for certain compression algorithms that work by encoding the distribution of bytes to make it evenly distributed. Instantiate UNSIGNED_32 type and use it to sort a bunch of modules in memory so that the list is stored in memory order. Now, what about the third case -- one which is intended only for unsigned types and where the unsigned operations are essential. That's a lot harder to find. Certainly one can find examples that are suitable for instantiation with a particular unsigned base type. For example: generic type ELEMENT is range <>; ... package CYCLIC is function CRC_16 (A : ARRAY_OF_ELEMENT) return CHECKSUM; ... end; The point here is that it makes sense to instantiate this only with a type that is an 8-bit unsigned type (the algorithm is specific to this case). It's still useful to create such generics, because, unless the CRC_16 operation is primitive (unlikely!) it does not get derived, so if you really are using derived types, then this kind of generic is useful. On the other hand, it is important to note that this is not an example where completely different types instantiated. Are there better examples? I don't know, I can't think of any! The fourth case (where it makes sense to use it for either signed or unsigned operations, but it is important that the unsigned operations be used in the unsigned case), is really hard to find examples for. Again, I have failed to find any non-artificial examples (let alone convincing ones). The bottom line here is that the Ada-83 solution may not be as bad as it seems. Its weakness rests on the importance of convincing examples of case 3 or case 4 (where unsigned operations are important in the generic). Any examples would be most welcome! Question 4: What should we do in Ada-9X? The issue for Ada-9X is how to provide unsigned facilities with a minimum of special new features and especially with a minimum of new core features or peculiar magic. Many of the questions revolve around the generics issue. For example, one of the early proposals for 9X was to avoid the reemergence of predefined operations in generics. One "advantage" of this was to ease the task of dealing with unsigned by ensuring that the "undesirable" reemergence of the signed operations did not occur. However, from our previous discussion of this issue, it is not at all clear that the reemergence is necessarily so undesirable, and it is also the case that it addresses the generic case which seemed hardest to find examples for (generics that work for both signed and unsigned types, where it is imporant tha the unsigned operations be used for unsigned types). Similarly, the formulation in MD 4.0 where all unsigned types are derived from an unsigned class type, is designed to facilitate the use of generics where unsigned operations are important and which can be instantiated with different unsigned types - a case that at least I couldn't find any convincing examples to support. Going back to our generic examples, the one case where the Ada-83 solution was found clearly wanting was the cyclic checksum example, where we had a generic that was suitable for instantiation with any 8-bit unsigned type. An interesting observation is that in Ada-9X, there are two other possible approaches to solving this problem: Use a generic with a generic formal derived type Use class wide programming (e.g. UNSIGNED_8'CLASS formal parameter) At least one, and likely both, of these approaches will be in the final Ada-9X, so this case at least is taken care of satisfactorily. Given this observation, the Ada-83 solution, combined with other 9X features seems to satisfy most requirements for generic instantiation. The only exceptions are cases which we have not yet found good examples for: Cases of generics which are instantiated with unsigned types of different sizes, where unsigned operations are essential. Cases of generics which are instantiated with both signed and unsigned types, where unsigned operations must be used for the unsigned case. Note that at least some proposals that have been made seem to have the functionality of dealing with the first of these two cases (which is still in our dubious category - given the absence of examples), while excluding the cases where we did have good examples, and which the Ada-83 approach allowed. For example, if we have a specific "mod" type, and a corresponding "mod" generic type, then we accomodate the most dubious case, but not ones in which signed and unsigned types can both be sensibly used. The Ada-83 approach will of course still be possible in Ada-9X, no matter what other capabilities are provided. If the new unsigned capabilities of Ada-9X are in some significant respects *less* powerful than the Ada-83 solution, then at the best, both solutions will persist in practice, and at worst we may find that the new Ada-9X solutions just aren't very useful. One scenario to be worried about is an implementor who has to decide whether to optimize the Ada-83 solution and/or an alternative Ada-9X solution. It is of course the case that a lot of existing code uses the Ada-83 solution. This is particularly true for UNSIGNED_8 where the Ada-83 solution seems to work pretty well. This means that lots of 9X code, derived from this 83 code will continue to use the Ada-83 solution, and compiler writers are likely to continue investing effort in making this solution work. Will they also invest effort in an alternative 9X solution? The answer to that question is not clear, but could very well be no in some circumstances. If compilers appear in which the Ada-83 solution is carefully optimized, but the 9X solution is implemented in a correct, but not carefuly optimized manner, then we certainly may have situations in which the new 9X solution is useless and wastes implementor effort. Question 5: Can we use the Ada-83 Solution for Ada-9X? Going back to our three problems with the Ada-83 solution, what can be done in Ada-9X to minimize the impact of these problems and make the Ada-83 approach more viable. First there is the problem of the largest signed integer and its corresponding unsigned type. It seems fundamental to an Ada-83 solution that at least minimal support of the double length signed type is needed. For example, if we want 32-bit unsigned, then we will need 64-bit signed. Obviously this is not a zero level implementation effort. However, Ada-9X could make it much easier to implement, for example by specifically allowing an implementation to reject uses other than basic arithmetic. This does of course raise generic contract problems, but you can't have it both ways. Note that the introducion of INT_CLASS operations in 9X tends to be exactly counter-productive here, in requiring significant support for the longest integer type. It should also be noted that a minimal support of double length signed integers is definitely of significant value in its own right, so the extra implementation work here (which would of course only be required if you needed the longest unsigned type, and were implementing the systems programming annex). Second there is the problem that UNSIGNED_x'BASE /= UNSIGNED_x. As we discussed before this is a possible implementation problem in generating efficient code, but probably not a significant one. I think the conceptual difficulty (and slight inconsistency) is acceptable. Third, there is the generic issue. As has been argued at some length, it is not at all clear that this *is* a problem. Furthermore, the Ada-9X attempts to solve this problem may actually be cures that are worse than the disease. It certainly seems that the "Ada-83" solution is a possible path to handling the unsigned requirement in Ada-9X. In fact the situation in some Ada-83 compilers is not at all bad with respect to unsigned. What *is* seriously wrong in Ada-83 is that different manufacturers solve the unsigned problem with completely different incompatible packages, so one is in a Pascal like situation of writing code that depends on local junk and is seriously non- portable. Ada-9X at a minimum will provide a standard package interface for unsigned operations. One advantage of staying as close as possible to the Ada-83 solution is that the Ada-9X design will provide an instantly implementable recommendation for implementation of a corresponding package in existing Ada-83 compilers. Such opportunities for retrofitting Ada-83 with Ada-9X packages are an important tool for easing the transition from Ada-83 to Ada-9X. Summary In this note, we have examined some issues in the implementation of unsigned integer types, and in particular looked at the generic instantiation issue. Ada-83 provides paths for addressing unsigned which are not necessarily so bad, and we should look again at staying closer to the Ada-83 direction in satisfying the unsigned requirement, rather than devising completely new approaches which may after all be not entirely satisfactory, as well as causing additional transition problems.