Yet another Forth structures package

Many ways to add a feature like C's struct or Pascal's RECORD have been presented and discussed in the Forth community. One of them was posted on the USENET newsgroup comp.lang.forth in 1989 (unfortunately I don't remember, by whom; possibly John Hayes) and convinced me with its simplicity, elegance, and power (in a word, with its Forth-ness).

I have used this basic approach ever since, e.g, in the parser generator Gray [ertl89][ertl97]. It also inspired my approach for an object-oriented Forth extension. The package I present here adds automatic handling of alignments, a bit of syntactic sugar, and optimization of fields with offset 0.

Why explicit structure support?

If we want to use a structure containing several fields, we could simply reserve memory for it, and access the fields using address arithmetic. As an example, consider a structure with the following fields
a
is a float
b
is a cell
c
is a float
Given the (float-aligned) base address of the structure we get the address of the field
a
without doing anything further.
b
with float+
c
with float+ cell+ faligned
It is easy to see that this can become quite tiring.

Moreover, it is not very readable, because seeing a cell+ tells us neither which kind of structure is accessed nor what field is accessed; we have to somehow infer the kind of structure, and then look up in the documentation, which field of that structure corresponds to that offset.

Finally, this kind of address arithmetic also causes maintenance troubles: If you add or delete a field somewhere in the middle of the structure, you have to find and change all computations for the fields afterwards.

So, instead of using cell+ and friends directly, how about storing the offsets in constants:

0 constant a-offset
0 float+ constant b-offset
0 float+ cell+ faligned constant c-offset
Now we can get the address of field x with x-offset +. This is much better in all respects. Of course, you still have to change all later offset definitions if you add a field. You can fix this by declaring the offsets in the following way:
0 constant a-offset
a-offset float+ constant b-offset
b-offset cell+ faligned constant c-offset
Since we always use the offsets with +, using a defining word cfield that includes the + in the action of the defined word offers itself:
: cfield ( n "name" -- )
    create ,
does> ( name execution: addr1 -- addr2 )
    @ + ;

0 cfield a
0 a float+ cfield b
0 b cell+ faligned cfield c
Instead of x-offset +, we now simply write x.

The structure field words now can be used quite nicely. However, their definition is still a bit cumbersome: We have to repeat the name, the information about size and alignment is distributed before and after the field definitions etc. The structure package presented here addresses these problems.

Usage

You can define a structure for a (data-less) linked list with
struct
    cell% field list-next
end-struct list%
With the address of the list node on the stack, you can compute the address of the field that contains the address of the next node with list-next. E.g., you can determine the length of a list with:
: list-length ( list -- n )
\ "list" is a pointer to the first element of a linked list
\ "n" is the length of the list
    0 begin ( list1 n1 )
        over
    while ( list1 n1 )
        1+ swap list-next @ swap
    repeat
    nip ;

You can reserve memory for a list node in the dictionary with list% %allot, which leaves the address of the list node on the stack. For the equivalent allocation on the heap you can use list% %alloc (or, for an allocate-like stack effect (i.e., with ior), use list% %allocate).

Note that in ANS Forth the body of a created word is aligned but not necessarily faligned; therefore, if you do a

create name foo% %allot
then the memory alloted for foo% is guaranteed to start at the body of name only if foo% contains only character, cell and double fields.

You can also include a structure foo% as field of another structure, with:

struct
...
    foo% field ...
...
end-struct ...
Instead of starting with an empty structure, you can also extend an existing structure. E.g., a plain linked list without data, as defined above, is hardly useful; You can extend it to a linked list of integers, like this:
list%
    cell% field intlist-int
end-struct intlist%
intlist% is a structure with two fields: list-next and intlist-int.

This feature is also known as extended records [wirth88]. It is the main innovation in the Oberon language; in other words, adding this feature to Modula-2 led Wirth to create a new language, write a new compiler etc. Adding this feature to Forth just requires a few lines of code.

You can specify an array type containing n elements of type foo% like this:

foo% n *
You can use this array type in any place where you can use a normal type, e.g., when defining a field, or with %allot.

The first field is at the base address of a structure and the word for this field (e.g., list-next) actually does not change the address on the stack. You may be tempted to leave it away in the interest of run-time and space efficiency. This is not necessary, because the structure package optimizes this case and compiling such words does not generate any code. So, in the interest of readability and maintainability you should include the word for the field when accessing the field.

Naming convention

The field names that come to (my) mind are often quite generic, and, if used, would cause frequent name clashes. E.g., many structures probably contain a counter field. The structure names that come to (my) mind are often also the logical choice for the names of words that create such a structure.

Therefore, I have adopted the following naming conventions:

This naming convention does not work that well for fields of extended structures; e.g., the integer list structure has a field intlist-int, but has list-next, not intlist-next.

Implementation

The central idea in the implementation is to pass the data about the structure being built on the stack, not in some global variable. Everything else falls into place naturally once this design decision is made.

The type description on the stack is of the form align size. Keeping the size on the top-of-stack makes dealing with arrays very simple.

field is a defining word that uses create and does>. The body of the field contains the offset of the field, and the normal does> action is

@ +
i.e., add the offset to the address, giving the stack effect addr1 -- addr2 for a field.

This simple structure is slightly complicated by the optimization for fields with offset 0, which requires a different does>-part (because we cannot rely on there being something on the stack if such a field is invoked during compilation). Therefore, we put the different does>-parts in separate words, and decide which one to invoke based on the offset. For a zero offset, the field is basically a noop; it is immediate, and therefore no code is generated when it is compiled.

Acknowledgments

Marcel Hendrix provided helpful comments on the paper.

Glossary

%align ( align size -- )
align the data space pointer to the alignment align.
%alignment ( align size -- align )
extract the alignment of a type descriptor.
%alloc ( align size -- addr )
allocate size address units with alignment align, giving a data block at addr; throws an ior code if not successful.
%allocate ( align size -- addr ior )
allocate size address units with alignment align, similar to allocate.
%allot ( align size -- addr )
allot size address units of data space with alignment align; the resulting block of data is found at addr.
cell% ( -- align size )
char% ( -- align size )
create-field ( align1 offset1 align size "name" -- align2 offset2 )
name execution: -- addr

like field, but without the does>-part.

dfloat% ( -- align size )
double% ( -- align size )
end-struct ( align size "name" -- )
name execution: -- align size2

size2 is size aligned with align; this ensures that all elements of an array of name elements have alignment align.

field ( align1 offset1 align size "name" -- align2 offset2 )
name execution: addr1 -- addr1+offset1

create a field name with offset offset1, and the type given by size align. offset2 is the offset of the next field, and align2 is the alignment of all fields.

float% ( -- align size )
nalign ( addr1 n -- addr2 )
addr2 is the aligned version of addr1 wrt the alignment n.
%size ( align size -- size )
extract the size of a type descriptor.
sfloat% ( -- align size )
struct ( -- align size )
an empty structure, used to start a structure definition.

References

[ertl89] M. Anton Ertl.
GRAY - ein Generator f\"ur rekursiv absteigende Ybersetzer. Praktikum, Institut f\"ur Computersprachen, Technische Universit\"at Wien, 1989. In German.
[ertl97] M. Anton Ertl.
GRAY - ein Generator f\"ur rekursiv absteigende Ybersetzer. In Forth-Tagung, Ludwigshafen, 1997. In German.
[wirth88] Niklaus Wirth.
From Modula to Oberon. Software--Practice and Experience, 18 no. 7 pp. 661-670, July 1988.