Spreadsheet structure discovery with logic programming

[ Jocelyn Ireson-Paine's Home Page | Publications | Document (shortened) in Word format ]

ABSTRACT

Our term "structure discovery" denotes the recovery of structure, such as the grouping of cells, that was intended by a spreadsheet.s author but is not explicit in the spreadsheet. We are implementing structure-discovery tools in the logic-programming language Prolog for our spreadsheet analysis program Model Master, by writing grammars for spreadsheet structures. The objective is an "intelligent structure monitor" to run beside Excel, allowing users to reconfigure spreadsheets to the representational needs of the task at hand. This could revolutionise spreadsheet "best practice".

Prolog has many benefits for spreadsheet analysis, so our paper includes a tutorial on its use.
We also describe a formulation of spreadsheet reverse-engineering based on "arrows".

INTRODUCTION

This is the third paper in a EuSpRIG series. Our first, presented at EuSpRIG 2001 [Ireson-Paine, 2001] introduced the Model Master (MM) spreadsheet-description language, and showed how we could build spreadsheets by compiling them from MM programs, with advantages for readability, modularity, and code-reuse. In an MM version of Thomas Grossman's queuing simulation, where each column represented a server, we could change the number of servers simply by altering one constant.

We also introduced decompilation - translating spreadsheets to MM programs . and its benefits for reverse-engineering and error-detection. An MM program is a summary of a spreadsheet's calculations, useful in many ways. For example, Graham Macdonald has suggested that the commercial world would find it valuable in determining whether a spreadsheet accurately reflects a legal contract.

Our second paper, submitted like the current one to EuSpRIG 2004 [Ireson-Paine, 2004] took decompilation further, making it rigorous via "spreadsheet algebra" (the phrase should be understood in the same way as, for example, "vector algebra"), which treats spreadsheets as mathematical entities. We implemented functions for operating on these, and an interface resembling those found in computer-algebra systems, which enabled the results to be displayed as MM programs or recompiled into spreadsheets.

One example decompiled one of Ray Panko's spreadsheets and put it through a series of transformations aimed at making it more intelligible. The freshly decompiled version listed cells individually, giving them their original, uninformative names - D1, D3, and so on. As we proceeded with our transformations, we were able to batch related cells together into arrays and give them names derived from neighbouring labels. Our final listing was easier to read and revealed several intentional errors in the spreadsheet.

Our motivation was that there is no "best" form for such displays. To illustrate: we recently posted on the EuSpRIG discussion list a reference to John Raffensperger's spreadsheet style guidelines [Raffensperger]. Louise Pryor replied that these are indeed good from the point of view of a user who is reading the results and changing some inputs. However, they ignore the needs of other types of users, especially those maintaining and updating the spreadsheet. To us, it is obvious that a spreadsheet must be available in different forms, depending on its user's needs. Blueprints, architectural drawings, circuit diagrams and maps respect this: even library catalogues are searchable both by author and by title. Yet spreadsheets lag behind, forcing users to adapt to one and only one way of ordering their information. Spreadsheet algebra banishes such rigidity.

At the end of that paper, we introduced "structure discovery" - discovering logically related cells and redescribing the spreadsheet so that they can be seen as a single data structure. In this paper, we explain structure discovery and describe an implementation which allows the user to define and apply heuristics for finding structure in spreadsheets. Our goal is an "intelligent structure monitor" for Excel, allowing spreadsheets to be reconfigured to the representational needs of the task at hand.

The structure-discovery heuristics are logical specifications of the spreadsheet, and should be coded as close to logic as possible. We are doing so using the logic-programming language Prolog. This has many advantages for spreadsheet research, so a key part of this paper is an introduction to logic programming for spreadsheet analysis. We also introduce structure discovery and spreadsheet grammars, and formulate reverse engineering in terms of. "arrows".

STRUCTURE DISCOVERY

In this section, we explain what we mean by structure discovery. Let us give an analogy from the world of conventional programming languages. When a compiler compiles an IF statement, it does so using labels and jumps. The statement:

  IF Condition THEN
    ThenPart
  ELSE
    ElsePart
  END IF

would be converted into code similar to that below, its exact form depending on the instruction set of the machine:

      Condition
      If false, GOTO L1 
      ThenPart
      GOTO L2
  L1:
      ElsePart
  L2:

Suppose now that we lose the source of the program whose IF-statement this is. To reconstruct it, we would need a "decompiler": a program that tries to go back from machine code to the source that generated it. (Many decompilers exist - see the De Compilation page [Decompilation].)

This IF statement was explicit in the source file, but implicit in the machine code. Spreadsheets too have high-level implicit structure. The difference is that they contain no explicit listing of this structure - it resides only in the author's brain, a location not readily open to public inspection.

Consider as example a three-column Income/Outgoings/Profit spreadsheet, where each cell in the third column computes the difference of its row-mates in the first two:

  Income Outgoings Profit
                   =A2-B2
                   =A3-B3
                   =A4-B4

Logically, the columns are single entities, in the same sense in which the block of machine instructions above is a single high-level statement. Anybody trying to read or maintain the spreadsheet would be greatly helped by knowing both this and what all the third-column calculations have in common. Here, it is obvious; it may not be in larger spreadsheets.

Such analysis is even more vital when spreadsheets complicate their layout for visual effect. For example, consider a buy-to-rent spreadsheet which performs the same calculation for a number of properties, arraying them in blocks across the sheet, where each block calculates the profit P from rental income R, monthly mortgage payment M, and other costs O.

  Property 1   Property 2   Property 3
  M  O         M  O         M  O       
  R            R            R         
        P            P            P

Large spreadsheets of this kind can be confusing, especially when the blocks are split across worksheets.

Representing structure in MM

How can MM make such structure explicit? MM views the world in terms of "attributes" - arrays of elements related by equations to other attributes. In a freshly decompiled spreadsheet, each attribute corresponds to a single cell. Structure discovery entails working out which cells can be grouped together into arrays. Thus, our Income/Outgoings/Profit spreadsheet would look like this at first:

  <A1 A2 A3 B1 B2 B3 C1 C2 C3>
  where A1="Income" B1="Outgoings" C1="Profit"
        C2=A2-B2 C3=A3-B3 C4=A4-B4

but could have its attributes grouped and renamed to become more informative:

  <Income[1..3] 
   Outgoings[1..3] 
   Profit[1..3]
  >
  where Profit[all t] = Income[t] - Outgoings[t]

Similarly, the Property one could become

  <Mortgage{Property1,Property2,Property3} 
   OtherCosts{Property1,Property2,Property3} 
   Rent{Property1,Property2,Property3} 
   Profit{Property1,Property2,Property3}
  >
  where Profit[all p] = Rent[p] - (OtherCosts[p] + Mortgage[p])

Of course, the cells do not actually have names, unless assigned via Excel's "name cells" feature. However, one can try guessing plausible names from neighbouring labels. Our spreadsheet algebra interface has functions for this, renaming attributes and redisplaying the resulting program. The user can call these as many times as needed, without comitting irrevocably to any set of names.

How can we discover implicit structure?

We are building up a library of structure-recognition heuristics (or "patterns" or "grammars") that describe common structures and can be automatically matched against spreadsheets. We also allow users to define new patterns, perhaps ones so specific that they apply only to one spreadsheet: some possibilities are suggested in the next section.

This is a pragmatic approach, intended to get something running quickly. More powerful methods exist, the ultimate surely being to learn patterns via inductive logic programming [ILP]. We made some other suggestions near the end of our spreadsheet algebra paper.

Markus Clermont has written an impressive program for discovering semantically-related regions within spreadsheets [Clermont]. It searches for evidence of relatedness from a variety of hints; for example, it might propose that cells mentioned together in a cell range or the argument to a SUM should be grouped. Our approach does not pretend to be as comprehensive, but probably shades into Clermont's. Our heuristics are coded in Prolog, which can perform any computation that any other language can, so our patterns could actually invoke any kind of search, including those performed by his program. An example is given at the end of the Logic Programming and Prolog section, where we discuss searching for related formulae in a column.

PATTERNS, PATTERN LANGUAGES, AND GRAMMARS

In this section, we go on to explain what we mean by spreadsheet patterns. To develop some intuitions, we start with examples from other domains.

Patterns and pattern-matching

Filename patterns

Unix and DOS, naturally enough, both allow filenames to be written in full, e.g. del scratch.tmp, copy ss1.xls ss2.xls. But as a shorthand in commands which accept more than one file at a time, they also allow patterns which match sets of filenames. Thus del *.tmp deletes all files with the .tmp extension, the asterisk standing for any sequence of characters. So we have symbols (the letters) that stand for themselves, but also symbols that stand for sets of strings.

Regular expressions

Regular expressions were invented by mathematician Stephen Cole Kleene in the mid-1950s, as a notation to easily manipulate "regular sets", formal descriptions of the behaviour of finite state machines. Today, they form an indispensable part of Unix commands such as "grep", which searches for a string in a set of files. The following examples demonstrate some features:

a The letter a

a|b An a or b

[a-z] Any letter between a and z

([a-z])* Any number of such letters

[a-z] ([0-9])* One such letter followed by any number of digits between 0 and 9

Once again, we have symbols that can stand for sets of strings. We also have operators that combine patterns into bigger patterns: here, "|" makes the "or" of two patterns, and "*" following a pattern repeats it indefinitely.

Incidentally the Unix filename patterns mentioned above are apparently a special case of regular expressions [Tribble].

Snobol patterns

Snobol [Griswold et. al.] is a string-manipulation language written by David Farber, Ralph Griswold, and I. Polonsky of Bell Labs - and, incidentally, with a name derived in a deliberately silly manner: StriNg Oriented and symBOlic Language. It provided a rich string-matching formalism comparable to regular expressions but implementated differently. People used it for general string hacking well into the 1980s, before it was overtaken by Perl. As these examples, which do the same as the regular expressions, demonstrate, there is more than one way to write patterns:

"a"

"a" | "b"

any("abcdefghijklmnopqrstuvwxyz")

arbno( any("abcdefghijklmnopqrstuvwxyz") )

any("abcdefghijklmnopqrstuvwxyz") arbno( any("0123456789") )

Grammars

All the above patterns are grammars. A grammar is a formal definition of the syntactic structure of a language, normally written as rules which specify the structure of "phrases" in the language. Each rule has a left-hand side naming a syntactic category (for example, "noun phrase" or "verb" below) and a right-hand side defining it. The right-hand side can contain "terminal symbols" which stand for themselves, like the letters in the filename patterns, and "non-terminal symbols", which name other rules. There are no examples of these above. It may also contain operators for combining patterns, such as the regular expression "|" and "*", or the Snobol "|", "any" and "arbno". The example below is a grammar for a fragment of English:

  sentence      --> noun_phrase verb_phrase noun_phrase?
  noun_phrase   --> proper_noun | determiner adjective* improper_noun
  verb_phrase   --> verb
  proper_noun   --> "Mary" | "John" | "Fido"
  improper_noun --> "apple" | "ball" | "cat" | "fish"
  verb          --> "bites" | "eats" | "kicks" | "loves" | "sees" 
  adjective     --> "big" | "brown" | "small" 
  determiner    --> "the" | "a" | "each" | "every"

As in the patterns above, "|" indicates a choice between alternatives; "*" repeats an element indefinitely; "?" marks an optional component. The arrow "-->" means "is defined as". There are many possible notations: this is that used by Prolog Definite Clause Grammars.

Spreadsheet grammars

From these examples, it seems reasonable that a spreadsheet grammar - a grammar for describing parts of spreadsheets - would let us name and refer to to grammatical rules, and would have operators for combining elements within these rules. It might not need terminal symbols, since, at least for general-purpose rules intended to apply to many spreadsheets, it's unlikely that we would want references to specific labels or formulae.

A big difference will be dimensionality. In spreadsheets, one can go down a column and across from worksheet to worksheet as well as along a line; the grammar will need operators for specifying where in this space one element in a rule lies relative to the previous one.

What about content? Suppose that we have a spreadsheet laid out as follows (this one is from a project on modelling house prices). Each row consisting of a label followed by a cell:

  Household Income  cell              
  Interest Rate     cell
  Population        cell

Using the notation above, we could say that the rows follow the grammar

  row --> label cell

Now, suppose the spreadsheet had been in columns instead of rows:

  Household Income   Interest Rate   Population                 
  cell               cell            cell

We could now think in terms of columns, describing each by the rule:

  column --> label DOWN cell

We have introduced a "DOWN" operator to indicate the need to go down rather than along.

Somewhat more complicated is our Income/Outgoings/Profit spreadsheet, where each column contains three cells headed by a label. We could describe its columns thus:

  column --> label DOWN cell DOWN cell DOWN cell

or, since programmers loath writing anything more than once if they can possibly devise a way to automatically repeat it, we might introduce a "repeat N times" operator:

  column --> label (DOWN cell)*3

We can see that this rule is actually a special case of a more general rule which applies to a great number of spreadsheets and says that a column is a label with any number of cells beneath:

  column --> label (DOWN cell)*

Once more, we use "*" without a right-hand argument to denote indefinite repetition.

Finally, how about the Property spreadsheet, the one that's split into blocks, with attributes splattered across the worksheet. There are several possible descriptions. One is to start with the blocks as structural units:

  spreadsheet      --> (block ALONG(12))*3
  block            --> label DOWN 
                       (mortgage_part ALONG(2) other_costs_part) DOWN
                       rent_part DOWN
                       profit_part
  mortgage_part    --> cell
  other_costs_part --> cell
  rent_part        --> cell
  profit_part      --> cell

Here, we write an explicit ALONG operator, analogous to DOWN, rather than using implicit concatenation as we did in our first example. One advantage of this is that we can give the operator an argument, so that ALONG(12) means "go along 12 cells".

Another equally valid description would be to take the structural units to be the attributes:

  spreadsheet       --> mortgage_parts AND 
                        other_costs_parts AND 
                        rent_parts AND 
                        profit_parts
  mortgage_parts    --> DOWN             (cell ALONG(12))*3
  other_costs_parts --> DOWN ALONG(3)    (cell ALONG(12))*3
  rent_parts        --> DOWN(2)          (cell ALONG(12))*3
  profit_parts      --> DOWN(3) ALONG(6) (cell ALONG(12))*3

This time, we've given DOWN as well as ALONG an argument, for when we want to move down more than one row. We have also introduced an AND operator, which superimposes elements on one another without moving along or down.

What have we achieved? We have adapted ideas from pattern-matching languages and Prolog DCG grammars to describe structure in spreadsheets. Some descriptions will apply to many spreadsheets; others, such as the Property ones, will be more specific.

LOGIC PROGRAMMING AND PROLOG

We next show how logic programming and Prolog apply to spreadsheets and how they can be used to implement and match spreadsheet grammars. Logic programming is extraordinarily powerful: perhaps this section will convert some of the many who ignore it.

What is logic programming?

Logic programming languages are quite different from languages like C and Fortran in which one gives the computer a sequence of instructions - read data, assign to a variable, print variable - which it has to follow. In logic programming, by contrast, the programmer writes statements in logic that describe the properties that the solution must have.

By analogy, the C or Fortran programmer writing a program to build a house would code instructions telling the computer to pick up a brick, lay it on another brick, put a window next to it, and so on. The logic programmer, however, would write logical statements describing what it means for something to be a house - there are walls, which are parallel to one another and perpendicular to the ground, and composed of bricks in one of several patterns, and may contain a window, and other things. Compare how, later in this section, we use Prolog to find the labels in a spreadsheet. Our program is not a set of instructions, but a specification of what it means for something to be a label.

Logic programs consist of "facts" or logical statements. Some facts are unconditionally true; some depend on the truth of others. To run a logic program, one gives it a "goal": a logical statement which it is to prove from the facts. We start with one of the simplest possible examples of Prolog, namely these two facts:

  mortal(X) :- man(X).
  man(socrates).

The first says that X is mortal if X is a man; the symbol ":-" means "if". In everyday language, all men are mortal. The second says that Socrates is a man. So the first fact is conditionally true - it depends on whether its subject is a man - while the second is unconditionally true.

To run this program, we ask Prolog the goal:

  ?- mortal(socrates).

Prolog answers by finding the first fact, binding X to 'socrates', inferring that Socrates can be proven mortal if he can be proven a man, finding the second fact, thus proving him a man, and finally replying 'yes' to the goal.

More interesting are goals that don't merely check the truth of some statement, but find values for variables. If we ask

  ?- mortal(Y).

then Prolog replies that Y=socrates.

Prolog

Prolog was the first widely used logic-programming language, and although others exist, it is still by far the most popular. There are some very good implementations around; we recommend SWI Prolog [SWI], which is efficient, free, and runs on Unix and Windows.

Analysing spreadsheets in Prolog - representing cells

How can we apply logic programming to spreadsheets? Let us consider a single worksheet, describing each occupied cell by a fact. For our little Income/Outgoings/Profit spreadsheet, this gives six unconditional facts:

  cell( 1, 1, 'Income' ).
  cell( 2, 1, 'Outgoings' ).
  cell( 3, 1, 'Profit' ).
  cell( 3, 2, '=A2-B2' ).
  cell( 3, 3, '=A3-B3' ).
  cell( 3, 4, '=A4-B4' ).

The first two arguments of each are the cells' X,Y coordinates, taking the origin A1 to be (1,1). The third argument is the cell's contents, written between single quotes as a kind of Prolog string called an "atom". Of course, spreadsheet files do not come neatly divided up into Prolog facts. However, it is easy to write a Visual Basic program to iterate over a spreadsheet and dump its cells as Prolog, and we have done so as part of MM.

Now we want to ask questions about these cells. Just as we gave Prolog the goal mortal(socrates) above, here we can give it spreadsheet-related goals:

  ?- cell( X, Y, 'Income' ).
  ?- cell( 3, 2, C ).

The first asks Prolog for the coordinates of the cell whose contents is 'Income'. The second goes the other way, asking for the contents of the cell at (3,2), i.e. at C2.

Analysing spreadsheets in Prolog . representing formulae

This is not yet enthralling. One problem is that we are storing the formulae as indivisible strings. To do any meaningful analysis, we really need them to be expressions which can be broken into their constituent operators, numbers and cell references. Luckily, this is easy to do in Prolog - more so than in most other languages. Indeed, one of the first large artificial intelligence projects implemented in Prolog was MECHO, a program to read and answer 'A'-level mechanics problems. MECHO had to analyse the English text of the problems and convert into sets of equations, decide which variables the equations should be solved for, and finally solve them by isolating these unknowns. Clearly, this required non-trivial expression processing.

So let us rewrite the final three facts like this:

  cell( 3, 2, cell_ref(1,2)-cell_ref(2,2) ).
  cell( 3, 3, cell_ref(1,3)-cell_ref(2,3) ).
  cell( 3, 4, cell_ref(1,4)-cell_ref(2,4) ).

The third argument of each fact has now become a "term" or tree structure representing a formula. The minus sign has its usual significance; the "cell_ref" subterms have no intrinsic meaning to Prolog, being something we have arbitrarily decided shall represent cell references. Thus, cell_ref(1,2) denotes the cell at (1,2), i.e. A2. Note that we shall leave the first three facts as they are, because these cells only contain strings. This will be important below when we discuss finding labels.

Analysing spreadsheets in Prolog . representing cell dependency

What can we do with such expression trees? Computer science has devised many algorithms for their manipulation: one can scan for specified subexpressions, replace one subexpression by another, calculate the complexity of an expression, and so on. Such algorithms are readily coded in Prolog: see [Bratko] and [Sterling and Shapiro] for good introductions to this and Prolog in general.

One thing MM does is to scan each formula for cell references, building up a new collection of facts which specify that cell's precedents. We use these in finding labels, so let us show what they look like. In our example, they would be

  precedent( 1, 2, 3, 2 ).
  precedent( 2, 2, 3, 2 ).
  precedent( 1, 3, 3, 3 ).
  precedent( 2, 3, 3, 3 ).
  precedent( 1, 4, 3, 4 ).
  precedent( 2, 4, 3, 4 ).

The first two state that cell C2 depends on A2 and on B2; similarly for the other four.

Analysing spreadsheets in Prolog - finding labels

With these "precedent" facts now available, we can make our analysis more interesting. Consider the problem of finding all the labels in a spreadsheet, that is, those cells which contain only text and don't participate in the calculations. This is important, because as they are not part of the calculations, we don't want MM to show them in its summaries.

One thing we could try is to pose the goal

  ?- cell( X, Y, C ), atomic( C ).

What we have here is something new: a "compound goal" consisting of two parts joined by a comma, which means "and". It reads "find a cell at (X,Y) whose contents is C, and where C is atomic". "atomic" is a built-in which tests its argument to see whether it is an atom. The three label cells would pass this test, so here we have a simple query which finds all the labels.

The problem, and why we introduced the "precedent" facts, is that though this is sufficient for our small example, it might not be in general. There could be cells which contain strings but which are referenced by other cells, being therefore part of a calculation rather than mere annotation. So let us add another criterion to our query:

  ?- cell( X, Y, C ), 
     atomic( C ),
     not( precedent( X, Y, X2, Y2 ) ).

This now reads, using Prolog's built-in "not" operator: "find a cell at (X,Y) whose contents is C, and where C is atomic, and where there is no cell (X2,Y2) whose precedent is (X,Y)". That is, it finds only those cells which contain strings, and which are not referenced by any other cell, which is what we wanted.

A query interface for spreadsheet analysis

All Prolog implementations can be run interactively, in a mode where they read goals typed at the terminal, try proving them, and output the result. The component responsible is the "top-level interpreter" or TLI, and is what dealt with the goals shown above. The point is that it gives us a query interface for free, without the need to do any user-interface programming. Any questions can be posed, provided that appropriate definitions - such as those to be shown in the following section - have been added to the list of facts.

Structure discovery in Prolog

Let us link up with spreadsheet grammars and structure discovery. We' start by returning to the house prices spreadsheet, which consisted of rows all formatted as a label followed by a cell. Once we have it as a sequence of "cell" and "precedent" facts, how can we detect this structure?

We begin by adding the following to our collection of facts:

  label( X, Y, L ) :- 
    cell( X, Y, L ), 
    atomic( L ),
    not( precedent( X, Y, X2, Y2 ) ).

This is a conditional fact like the "mortal(X)" with which we introduced Prolog. It reads "the cell at (X,Y) with contents L is a label if L is atomic and there is no cell (X2,Y2) whose precedent is (X,Y)".

Next, we add another fact:

  label_naming_cell( L, X, Y, C ) :-
    label( XL, Y, L ),
    X is XL + 1,
    cell( X, Y, C ).

This builds on our definition of "label", and also calls Prolog's built-in "is" operator, which binds the variable on its left to the result of evaluating the expression on its right. What this gives is a fact which describes a group of two cells: a label and a cell to its right. It reads as "L is a label naming the cell at (X,Y) with contents C if there is a label L at (XL,Y), and X is XL+1, and there is a cell with contents C".

Were we then to put the query

  ?- label_naming_cell( L, X, Y, C ).

it would, should such a group exist, bind X and Y to the cell's address, C to its contents, and L to the label - its putative name.

Loose ends

There are a few loose ends to tie up. First, what is the connection between spreadsheet structure grammars and facts such as label_naming_cell? Compare the definition of label_naming_cell with the grammar rulerow --> label ALONG cell. The definition has a subgoal which finds a label, a subgoal which increments an address (thus moving along the row), and a subgoal which finds a cell. The grammar rule contains elements which do the same; they.re just more concise. In fact, one can be translated into the other, via techniques well-known in Prolog. So spreadsheet grammars are easily implemented in Prolog.

Second, we wanted to use spreadsheet grammars to propose cell groupings and names for these groups, which we ciuld feed to MM. How do we get this information out of them? Recall that the goal label_naming_cell(L,X,Y,C) above sets the variable L to a label, and (X,Y) to the address of the cell we assume it to be naming. Rather than just asking the TLI to display these, we can capture them and feed them under program control to the rest of MM. In practice, we would want to search for all answers to the goal (all possible label-cell pairs) and collect them together, passing them to MM and also, perhaps, displaying them on the spreadsheet so the user can see which regions have been matched. This is also relatively easy to manage.

Third, we won.t have just one rule. As a trivial example, we might have our label-cell rule, and also a catch-all rule which says that any cell on its own is to be considered as an attribute. We need to ensure that these are matched against the spreadsheet in the correct order, so that the catch-all doesn't grab everything before the other rule has had its chance. This means that either the user has to allocate a priority to each rule, or we sort them by specificity, so that the most specific rules are tried first.

Finally, we mentioned in the Structure Discovery section that although our approach to structure-discovery is deliberately simple, it probably shades into that of more ambitious programs such as Clermont's. We give a brief example. Consider a column headed by a label. This is a good candidate to be grouped as a single attribute. Now, such columns are often constructed by drag-and-copy, thus propagating a formula down the column. Formulae replicated in this way may be identical, or may differ just in that a constant or cell reference is incremented down the column. Searching for such replication has two benefits. First, it provides additional evidence for the cells' relatedness. Second, it gives us an underlying pattern that all the formulae must follow. In our MM summary of the spreadsheet, we can quote this pattern rather than the individual formulae, making the summary shorter and simpler. We have experimented with appropriate algorithms which boil down, in essence, to forming equivalence classes. An elementary method, which we are currently using, is to compare the formulae in the first two cells, work out a difference between them (e.g. A2+B2 would differ from A3+B3 in that there are cell references either side of the "+" whose Y parts are both incremented), and then test whether this difference continues throughout the rest of the cells.

ARROWS AND REVERSE ENGINEERING

This section describes a formalism we have devised that provides a concise formulation of reverse-engineering. It requires slightly more maths than the rest of the paper.

Let us begin by imagining one worksheet of a spreadsheet in value view. This is, in effect, an array whose indices are cell addresses and whose elements are values, i.e. numbers, strings, and other data. Mathematically, we can consider it a function from cell addresses to values.

Value view is boring, so let's switch to formula view. Now, each cell maps to a formula or expression, and the worksheet becomes a function from cell addresses to expressions

MM views the world as a collection of attributes, each attribute being an array. These too can be considered functions, from the attribute.s indices to MM expressions.

An MM program, then, is modelled as a collection of functions, one for each attribute. Mathematically, we regard it as an indexed function, the indices being the attributes.

Compiling an MM program translates it into a spreadsheet; i.e., translates this indexed function into a single function from cell addresses to Excel expressions.

Now, before we can compile an MM program, we need to know how each attribute maps onto the spreadsheet. In our Income/Outgoings/Profit spreadsheet, each attribute mapped onto a column; in the Property spreadsheet, the attributes (Rent, Profit and so on) mapped onto cells staggered across the worksheet. In general, the compiler will assume a default mapping for each attribute (it allocates attributes to columns headed by their name), but this can be overridden by the user. In any case, MM clearly needs to know what the mapping from attribute to spreadsheet will be.

We now see that compiling an MM program becomes a matter of taking the attribute-to-expression functions, pre-composing with the attribute-to-spreadsheet mappings, and forming the union of the results. To generate a spreadsheet output file, we iterate over the domain of this union arrow, pumping out each element and its image - i.e. the cell address and its contents.

Decompilation goes in the opposite direction, from the spreadsheet-to-expression function to a set of attribute-to-expression functions. The trial and error of spreadsheet algebra entails finding a decomposition optimal for intelligibility; structure discovery infers appropriate domains for the attribute functions. In general, there is no unique decomposition, hence the need for trial, error, and human assistance.

Our latest version of MM takes this model literally, and represents the functions as Prolog data structures. We note for Prolog programmers that we found this good for manipulating functions in a representation-independent way, without needing to worry about whether they are stored as facts (clauses) or as association lists, trees, and so on.

Finally, many branches of mathematics like to consider functions as "arrows". Amongst other things, this gives us, via "commutative diagrams", a handy graphical way to depict them, and enables some proofs to be done largely by drawing. We found this very useful, and think likewise of our functions as arrows.

IMPLEMENTATION

The current MM is coded in SWI Prolog. Advantages over Kawa, in which we coded the previous version, include: easier to generate .exe files for; probably less memory-heavy; the syntax is more readable, allowing one to define new operators, so making it easy to prototype notations for patterns without writing a parser; the TLI's query interface also helps rapid prototyping. And logic programming is wonderful. SWI Prolog may not be as portable as the Java-based Kawa, but it does run on all the machines we.ve needed so far.

The user interface

MM's user interface will allow the user to enter patterns, match them against a spreadsheet, and display the portions matched. Roughly half the interface is concerned with this; the other half provides the spreadsheet algebra functions described in our previous paper.

We are prototyping the user interface as a Web application where MM will run on a Web server, the user operating it via a browser. This is less interactive than a windowing interface, but easier to prototype, since the browser takes over the grunt work of rendering. We have taken advantage of this before in our Spreadsheet Autopublisher [Ireson-Paine; Paine and Ramsden, 2002], a Web-based application which automatically converted simple spreadsheets or MM programs submitted to it into other applications runnable over the Web.

FUTURE WORK

We need to experiment with alternative ways to write patterns. Once the notation is good enough to be fixed, we would like to see researchers collaborate in building up a shared library of structure patterns.

Intelligent structure monitoring and flexible best practice

Our goal is to run MM as an "intelligent structure monitor" alongside Excel. This could revolutionise spreadsheet "best practice.. We have said already that different spreadsheet tasks require different layouts. A layout convenient to a user entering data and reading off results may be much less convenient to the maintenance programmer. The problem is that all guidelines for best-practice demand a single layout. But no layout can be simultaneously optimum for tasks as different as data entry and maintenance. The answer is to move from a view of spreadsheets as single unchanging entities to a relativistic approach where they can freely adapt to the user's needs. MM will provide this.

REFERENCES

Ivan Bratko. Prolog Programming for Artificial Intelligence, Second Edition, Addison-Wesley, 1990.

Markus Clermont. A Scalable Approach to Spreadsheet Visualisation, http://www.ifi.uni-klu.ac.at/Publications/Dissertations/pubfiles/pdffiles/2003-0175-MC.pdf, Dissertation, Fakultät für Wirtschaftswissenschaften und Informatik, Universität Klagenfurt.

R. E. Griswold, J. F. Poage, I. P. Polonsky. The SNOBOL4 programming languge, 2nd edition, Prentice-Hall, 1971.

Jocelyn Ireson-Paine. Ensuring Spreadsheet Integrity with Model Master, http://www.j-paine.org/eusprig2001_as_html/eusprig2001.html, Proceedings of EuSpRIG 2001, Amsterdam 5-6 July 2001.

Jocelyn Ireson-Paine (2). The Spreadsheet Autopublisher, http://www.bized.ac.uk/virtual/vla/autopublisher/index.html.

Jocelyn Ireson-Paine (3). Spreadsheet algebra , http://www.j-paine.org/spreadsheet_algebra.html, Submitted to EuSpRIG 2004.

Jocelyn Ireson-Paine (4). MM Interactive, http://www.j-paine.org/mm_interactive/.

Jocelyn Ireson-Paine and Andrew Ramsden. Publishing and Using Spreadsheets on the Web, http://www.economics.ltsn.ac.uk/cheer/ch15_1/ramsden.htm, Computers in Higher Education Economics Review, Volume 15, Issue 1, 2002.

John Raffensperger (August 2003), New Guidelines for Spreadsheets, International Journal of Business and Economics, Volume 2, Number 2, http://www.ijbe.org/table%20of%20content/pdf/vol2-2/(6)ms9088rr.pdf . An older version is available at http://www.mang.canterbury.ac.nz/people/jfraffen/spreadsheets/index.html.17:25 6/4704.

Leon Sterling and Ehud Shapiro. The Art of Prolog, 2nd edition, MIT Press, 1994.

David Tribble. Filename Pattern Matching, http://david.tribble.com/text/fpattern.html, originally from The C/C++ Users Journal, December 1997.

De Compilation, http://www.program-transformation.org/twiki/bin/view/Transform/DeCompilation.

Inductive Logic Programming, http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/ilp.html.

SWI-Prolog, http://www.swi-prolog.org/.

[ Jocelyn Ireson-Paine's Home Page | Publications | Document (shortened) in Word format ]

`a`	The letter `a`
`a\|b`	An `a` or `b`
`[a-z]`	Any letter between `a` and `z`
`([a-z])*`	Any number of such letters
`[a-z] ([0-9])*`	One such letter followed by any number of digits between `0` and `9`