[ Jocelyn Ireson-Paine's Home Page | Publications | Document (shortened) in Word format ]

Spreadsheet algebra
Jocelyn Ireson-Paine

ABSTRACT

We introduce spreadsheet algebra. This treats spreadsheets as mathematical objects and operates on expressions containing spreadsheets, giving us rules for expanding, factorising, simplifying and rearranging them. We implemented it in our spreadsheet analysis program Model Master, and demonstrate by reverse engineering one of Panko's spreadsheets, producing a concise listing of the formulae with cells renamed to meaningful identifiers derived from the labels. This is useful when error-checking. We have also used it to automatically add numerical differentiation code to a process-control spreadsheet, and to generate new spreadsheets by joining existing ones.

We take a relativist approach, believing there is no one "right" way to display spreadsheet expressions. Different tasks need different presentations. In this, we imitate conventional computer algebra, supporting the trial and error needed during reverse-engineering.

INTRODUCTION

In 1953, while IBM were busy inventing Fortran, Kahrimanian and Nolan independently wrote the first programs that could differentiate symbolically. Fortran, seeing the world in terms of numbers, only lets you differentiate numerically; by using arrays of the things, you can lay a grid over your variables and get a numerical answer to their rate of change, to as fine a degree of accuracy as you want. But what Kahrimanian and Nolan had realised was that the process of symbolically differentiating the expression x³+2x to make 3x²+2 is as much about "getting an answer" as is that of numerically evaluating the expression (x[1]-x[0])/dt to get 5. Their work kicked off the development of computer algebra systems. And since then, computer algebra has pushed this notion of "getting an answer" ever further into mathematical territory, capturing the behaviour of polynomials, series, tensors, ..

One of the things about such objects is that they exist in pieces. There is a single symbol for the number 5, but not for the polynomial (1+x)ⁿ. The polynomial doesn.t even have a single presentation. Sometimes, we prefer to write it as 1+nx+n(n-1)x²/2+..., 1+n!x/1!(n-1)!+n!x²/2!(n-2)!+..., or as ∑C(n,k)x^k. It all depends on which particular patterns we want to make explicit and how far we are trying to squeeze the expression down to fit into the rudimentary capacity of human short-term memory. Computer algebra designers know the importance of rearranging and re-presenting expressions, and have put years of effort into algorithns for factorisation, expansion and simplification. We shall argue the need to extend this work to spreadsheets.

While computer algebra was extending the range of mathematical objects we can compute with, computer science was extending the range of computational entities we can mathematise about. Not least of these were computer programs themselves. It took computing some time to understand programs mathematically; but once it had, it realised that they too could be the objects of computer manipulation. Spreadsheets are programs; and in this paper, we describe a computer algebra system for "spreadsheet algebra", where the fundamental mathematical components are spreadsheets.

The rest of this paper is structured as follows. The next section illustrates conventional computer algebra. We then introduce "spreadsheet algebra", starting with an introduction to the idea of computer programs as mathematical entities. This is followed by a section on how spreadsheet algebra relates to our spreadsheet--analysis program Model Master and - the longest section in the paper - a demonstration of reverse engineering, generating a comprehensible listing from one of Panko's spreadsheets. This is followed by two smaller examples, in one of which we used spreadsheet algebra to insert numerical-differentiation code into a process-control spreadsheet. We finish with a section on further work.

COMPUTER ALGEBRA

For those unfamiliar with it, we introduce computer algebra via a short session with the Maple algebra system [Hart and Wolfe ]. We shall go on to compare it with our spreadsheet algebra system. The lines ending in semicolons below are the user.s input; they are followed by indented lines showing Maple.s responses:

  p := (a+b)^2;   
    p := (a + b)²

  q := expand(p);      
    q := a² + 2 a b + b²

  r := factor(q);  
    r := (a + b)²

The first command assigns the expression (a+b)^2 to the Maple "expression variable" p. If you saw a similar statement in Fortran, you would expect it to evaluate the expression, adding a to b, squaring the result, and assigning the resulting number to p. However, like other computer algebra systems, Maple is about computing with expressions, not just numbers. One adds, subtracts, squares numbers; one simplifies, factorises, expands expressions. So Maple needs variables, like p, in which expressions can be held.

In the second command, the user passes this variable to expand, a Maple function that treats its argument as an expression and expands it out as far as possible. The result is assigned to variable q.

Finally, we call another built-in, factor , to refactorise it, showing that we can get back to the original expression. The key points are:

Computer algebra systems manipulate algebraic expressions without evaluating them into numbers.
You can display the same expression in different ways, for example by factorising or expanding it.
You can store expressions in variables.
You can have several versions of an expression on the go at the same time.

SPREADSHEET ALGEBRA

As we have said, programs and spreadsheets are mathematical entities, and we can do algebra with them too. In this section, we explain.

Doing algebra with programs

What does it mean, to do algebra with programs? Consider the following program, which assigns the numbers 1 to 10 to the arrays p and q,

  for i := 1 to 10 do    
    p[i] := i;  
  for i := 1 to 10 do    
    q[i] := i;

Now, this has the same effect as

  
  for i := 1 to 10 do    
    a[i] := i;    
    b[i] := i;

From this, it's tempting to propose a general law of programs which says

  
  for I := L to U do    
    S1;  
  for I := L to U do    
    S2;

always has the same effect as

  
for I := L to U do    
  S1;    
  S2;

where S1 and S2 are arbitrary statements. Notice how this resembles the distributive law a(b+c)=ab+ac for numbers.

Why is this useful? Well, were this law always true, it would help in optimising our code, because it would give us a rule we could follow to merge loops, thus reducing the time spent testing and incrementing loop counters.

As it happens, this law is not universally applicable: consider what would happen if S2 were a statement that halted the program, or if S1 and S2 both updated the same array but with different values. Such problems arise, for example, arises with variables and assignment when different statements find themselves updating the same piece of store. However, that doesn.t destroy the validity of program algebra laws; it.s just that conventional languages are complicated and it.s hard to give simple examples here. The topic is discussed in [Bowen; Visser et. al.].

The good news for EuSpRIG is that spreadsheets are actually simpler to handle in this way, as we shall see..

Doing algebra with spreadsheets

So if spreadsheets are programs, and we can do algebra on programs, we can do algebra with spreadsheets. But why bother?

The answer is that we want to achieve quality through reverse engineering. In its raw or unrefined state, a spreadsheet file is about as legible as a stretch of DNA or the inside of an amoeba. We think we can make out some parts because of the boundaries between them; others, we reveal by staining with powerful dyes. But point mutations have corrupted evolution's drag-and-drop copying; different dyes disagree on the regions they delimit; much of what we find looks suspiciously like left-over junk from previous versions; and the Author wrote no comments and won't pick up His phone. In such circumstances, it's essential to carry out trial dissections, saving useful parts in killing bottles or pinning them labelled to a board; bracketting repeated patterns in a database and comparing them for matches. We need to classify, clean, label and store parts for later reuse; to store, as well, different dissections; and to transpose component parts from one to another. Spreadsheet algebra lets us do this.

The slogan is: preserve meaning while maximising utility. In our program algebra example, utility was efficiency: the program with merged loops was faster than that without. In other contexts, it could be memory occupancy or ease of data entry . or, as in our big example later on, readability. Preserve meaning, but don.t preserve presentation if utility is better served by a different presentation.

ALGEBRA AND MODEL MASTER

At EuSpRIG 2001, we presented a paper on Model Master (MM), a spreadsheet design language and its compiler [Ireson-Paine, 2001]. The MM language enables spreadsheets to be described in concise mathematical notation. Meaningful identifiers can be used in place of cell names; programs can be divided into modules which can be stored in files and reused from one spreadsheet to another; cells to which the same calculation is being applied can be grouped so that the calculation appears only once; formatting commands can be used to display the same program in many different layouts.

To demonstrate, we reimplemented one of Tom Grossman's queueing simulations in MM. Our program appears at the end of [Ireson-Paine, 2001], and implements a 4-server queue. Grossman's original spreadsheet represents each server as a column in which successive rows represent successive transactions. To alter the number of servers in the simulation entails changing the number of columns, a major structural change. We were able to do this in MM just by editing one constant in our code.

MM also lets us decompile spreadsheets into MM programs, regardless of whether they were originally generated in MM. The program will lack much of the structure intended by its spreadsheet's author - for example, in an accounts spreadsheet where one column tracks income over several years, there's nothing to indicate that all these cells are "doing the same thing". However, MM can transform the program into one that maps the entire column onto one multi-cell attribute. Discovering though which cells should be grouped, how they should be named, and what their indices are, can take a lot of trial and error. Spreadsheet algebra supports this, allowing users to put a through a gamut of transformations until they find the one that best reveals its structure.

SPREADSHEET ALGEBRA EXAMPLE: REVERSE ENGINEERING

This, the main section, demonstrates spreadsheet algebra for reverse engineering. We started with a spreadsheet sent by Steve Davis of Clemson University, who is evaluating MM as a debugging tool. The spreadsheet turned out to have been written by Ray Panko for the work reported in "Applying code inspection to spreadsheet testing" [Panko]. We show it below in value view, omitting to save space some labels found below and to the right of the calculation area. These all described errors - for example, cell J28 contained the string F5;; living costs summer;; cells in sum should be f14..f20 - and were evidently not part of the original problem.

Cash Budget Assume: 2% inflation/semester

Fall Spring Summer

Cash, beginning $1,000 $1,360 $673

Outflows - School costs $4,468 $4,474 $4,480

Living costs $4,172 $4,213 $4,485

Inflows- Loans 3000 3000 3000

Support from home $6,000 $5,000 $6,000

Cash,end $1,360 $673 $980

======= ======= =======

School (contractual) - Tuition $4,115 $4,115 $4,115

Fees $53 $53 $53

$300 306 312

Living (contractual) Monthly Months

Housing $450 4 $1,800 $1,800 $1,800

Insurance $53 4 $212 $212 $212

Living Costs (other)

Food $330 4 $1,320 1346 1373

Entertainment $150 4 $600 612 624

Transportation $40 4 $160 $161 164

Clothing $21 4 $80 82 84

Stage 1

The following sections show some spreadsheet algebra manipulations, and should be compared with the Maple session in Section 2.

Our first step was to start the spreadsheet algebra system and load the spreadsheet:

  panko := load( "Panko.slk" );

Because of the file's extension, load assumes it to be a spreadsheet and automatically decompiles it into an MM program. The":="assigns this program to the variable panko. Just as p, q and r in the Maple session are variables which hold algebraic expressions, so panko is a variable which holds a expression. And, just as Maple automatically lists the expression resulting from the commands typed at it, so MM displays the spreadsheet resulting from the load command, listing it as an MM program:

  <A1 A3 A4 A6 A8 A10 A13 A14 A15 A16 A17 A18 A19 A20 
   B1 B4 B5 B6 B7 B10 B11 B13 B14 B15 B17 B18 B19 B20 
   C13 C14 C15 C17 C18 C19 C20 
   D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D14 D15 D17 D18 D19 D20 
   E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E14 E15 E17 E18 E19 E20 
   F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F14 F15 F17 F18 F19 F20 
   H1 H7 
   I19 I20 I23 I24 I27 I32 I37 
   J25 J26 J28 J29 J30 J31 J33 J34 J37 
   L5 L7
  >
  where
  A1 = "Cash Budget"
  B1 = " Assume:"
  D1 = 0.02
  E1 = "inflation/semester"
  H1 = "Panko, R.R. Applying code inspection to spreadsheet testing, Journal of Management Information Systems, (in press?)"
  D2 = "Fall"
  E2 = "Spring"
  F2 = "Summer"
  A3 = "Cash, beginning"
  D3 = 1000
  <etc>
  D9 = "======="
  E9 = "======="
  F9 = "======="
  <etc>
  I27 = "4 mechanical errors in formulas with 3 opns and cell references"
  J28 = "F5;; living costs summer;; cells in sum should be f14..f20"
  <etc>

The program contains a bracketed list of attribute declarations followed by the equations (some omitted to save space). As always in MM, all attributes referenced in the equations are declared. This provides redundancy for the compiler (when compiling and error-checking MM programs) and for the reader (when struggling to make sense of other people's code).

Stage 2

The next step was to remove the labels, as they are not part of the calculations. We did this by calling detach_labels, which removes all cells containing string constants not referenced by other cells. These can fairly be assumed to be labels:

  panko_delabelled := detach_labels( panko );

MM then displays the result:

  <B14 B15 B17 B18 B19 B20 
   C14 C15 C17 C18 C19 C20 
   D1 D3 D4 D5 D6 D7 D8 D10 D11 D12 D14 D15 D17 D18 D19 D20 
   E3 E4 E5 E6 E7 E8 E10 E11 E12 E14 E15 E17 E18 E19 E20 
   F3 F4 F5 F6 F7 F8 F10 F11 F12 F14 F15 F17 F18 F19 F20
  >
  where
  D1 = 0.02
  D3 = 1000
  E3 = D8
  F3 = E8
  D4 = SUM(D10:D12)
  E4 = SUM(E10:E12)
  F4 = SUM(F10:F12)
  D5 = SUM(D14:D20)
  E5 = SUM(E14:E20)
  F5 = SUM(F12:F19)
  D6 = 3000
  E6 = 3000
  F6 = 3000
  D7 = ROUND(1000+D4+D5-D3-D6,-3)
  E7 = ROUND(1000+E4+E5-E3-E6,-3)
  F7 = ROUND(1000+F4+F5-F3-F6,-3)
  D8 = D3-D4-D5+D6+D7
  E8 = E3-E4-E5+E6+E7
  F8 = F3-F4-E5+F6+F7
  D10 = 4115
  <etc>

Note that the labels have been removed. Note also that we now have two versions of the spreadsheet on the go simultaneously, just as our Maple session had several versions of an expression

Stage 3

When MM loads a spreadsheet, it gives all attributes the same name as the cell they came from. We wanted to rename them to make the code easier to understand. Inspection suggested the labels in columns A and B might make meaningful names for the cells in columns D, E and F, so to test this, we sliced columns A and B out of the labels and then joined them:

  labels := panko without panko_delabelled;
  col_A := by_row( slice( labels, A1:A20 ) );
  col_B := by_row( slice( labels, B1:B20 ) );
  cols_AB := zip( col_A, col_B, append );
  cols_AB_cleaned := zip( cols_AB, mangle );

Here, the first line subtracts the spreadsheet without labels, built in Stage 2, from the entire spreadsheet, giving just the labels. The next two lines extract columns A and B (up to row 20, the final one of importance). The by_row function modifies slice's result to be a function from row number to label rather than from cell address (i.e. row and column) to label.

Next, in the fourth command, we zip together corresponding rows of columns A and B. We use MM's zip function, which runs down the columns passed in the first two arguments, taking corresponding cells from each and applying the third argument, itself a function, to them. Here, this function is append: it joins or appends two strings. So the effect is to append the A and B labels of each row.

We then had to clean up the results. The problem is that we want to use the labels as identifiers. As in most languages, these can contain letters and digits, but not such things as minus signs, brackets, and spaces. So we ran down the joined names, passing each through the mangle function. This "mangles" each name (the word is commonly used by compiler writers) to remove or replace unwanted characters. Thus the labels Cash Budget and Assume become Cash_BudgetAssume, for example. The next thing is to pair the cell names of column D with the new identifiers that replace them:

 
  renamings := make_map( by_row(col_D), cols_AB_cleaned );

giving us

  D3 > Cash_beginning
  D4 > Outflows_School_costs
  D5 > Living_costs
  D6 > Inflows_Loans
  D7 > Support_from_home
  D8 > Cash_end
  D10 > School_contractual_Tuition
  D11 > Fees
  D14 > Housing
  D15 > Insurance
  D17 > Food
  D18 > Entertainment
  D19 > Transportation
  D20 > Clothing

where the listing uses > to mean "maps to". So what this listing shows is meaningful names suggested for the cells in column D.

Stage 4

Columns D, E and F are obviously related, reporting the same quantities for the three school terms autumn ("fall"), spring and summer. A feature of MM - indeed, a key point - is that an attribute can name not just one cell as in the examples so far, but an entire collection of cells. So for example, assume we have a spreadsheet where column A represents the income of some company, with rows 1 (for 1990) up to 14 (for 2003). We could declare this in MM as

  <Income : [1990..2003]>

and write references such as Income[1995] in the equations.

In the previous paragraph, the multi-cell attribute Income had numeric indices. MM also allows symbolic indices akin to Pascal's enumerated types. We use these when building tables whose cells do not have a natural ordering; for example the branches of a department store or the colours a dress can be bought in. Thus we might have a business with branches in Oxford, Cambridge and London. To store the income for each branch at one particular time, we would declare:

  <Income : {Oxford,Cambridge,London}>

while to hold income for these branches from 1990 onwards, the attribute would gain a dimension, becoming:

  <Income : {Oxford,Cambridge,London} * [1990..2003]>

We now show how we applied multi-cell attributes and enumerated types to our running example. Since our renaming of the cells in column D made sense, we decided to do the same to columns E and F. But we also decided the three cells D1, E1 and F1 should be treated as part of one attribute, indexed by the symbolic indices fall, spring and summer. Thus any reference to D1 would be replaced by Cash_BudgetAssume[fall]; reference to E1 by Cash_BudgetAssume[spring]; and so on. We did the same for all the other attributes in these three columns.

  cols_DEF := slice( panko_delabelled, D1:F20 );
  cols_DEF_merged := merge( cols_DEF, cols_AB_cleaned, D:F, [fall,spring,summer] );

which gave:

  <B14 B15 B17 B18 B19 
   C14 C15 C17 C18 C19 C20 
   Cash_BudgetAssume[fall,spring,summer]
   Cash_beginning[fall,spring,summer] 
   Cash_end[fall,spring,summer]
   Clothing[fall,spring,summer] 
   D12 
   E12 
   Entertainment[fall,spring,summer]
   F12 
   Fees[fall,spring,summer] 
   Food[fall,spring,summer]
   Housing[fall,spring,summer] 
   Inflows_Loans[fall,spring,summer]
   Insurance[fall,spring,summer]
   Living_costs[fall,spring,summer] 
   Outflows_School_costs[fall,spring,summer]
   School_contractual_Tuition[fall,spring,summer] 
   Support_from_home[fall,spring,summer]
   Transportation[fall,spring,summer]
  >
  where
  Cash_BudgetAssume[fall] = 0.02
  Cash_beginning[fall] = 1000
  Cash_beginning[spring] = Cash_end[fall]
  Cash_beginning[summer] = Cash_end[spring]
  Cash_end[fall] = Cash_beginning[fall]-Outflows_School_costs[fall]-
    Living_costs[fall]+Inflows_Loans[fall]+Support_from_home[fall]
  Cash_end[spring] = Cash_beginning[spring]-Outflows_School_costs[spring]-
    Living_costs[spring]+Inflows_Loans[spring]+Support_from_home[spring]
  Cash_end[summer] = Cash_beginning[summer]-Outflows_School_costs[summer]-
    Living_costs[spring]+Inflows_Loans[summer]+Support_from_home[summer]
  Clothing[fall] = C20*20
  Clothing[spring] = ROUND(Clothing[fall]*(1+Cash_BudgetAssume[fall]),0)
  Clothing[summer] = ROUND(Clothing[spring]*(1+Cash_BudgetAssume[fall]),0)
  D12 = 300
  E12 = ROUND(D12*(1+Cash_BudgetAssume[fall]),0)
  Entertainment[fall] = C18*B18
  Entertainment[spring] = ROUND(Entertainment[fall]*(1+Cash_BudgetAssume[fall]),0)
  Entertainment[summer] = ROUND(Entertainment[spring]*(1+Cash_BudgetAssume[fall]),0)
  F12 = ROUND(E12*(1+Cash_BudgetAssume[fall]),0)
  Fees[fall] = 53
  Fees[spring] = 53
  Fees[summer] = 53
  Food[fall] = C17*B17
  Food[spring] = ROUND(Food[fall]*(1+Cash_BudgetAssume[fall]),0)
  Food[summer] = ROUND(Food[spring]*(1+Cash_BudgetAssume[fall]),0)
  Housing[fall] = C14*B14
  Housing[spring] = Housing[fall]
  Housing[summer] = Housing[fall]
  Inflows_Loans[fall] = 3000
  Inflows_Loans[spring] = 3000
  Inflows_Loans[summer] = 3000
  Insurance[fall] = C15*B15
  Insurance[spring] = Insurance[fall]
  Insurance[summer] = Insurance[fall]
  Living_costs[fall] = SUM(Housing[fall]:Clothing[fall])
  Living_costs[spring] = SUM(Housing[spring]:Clothing[spring])
  Living_costs[summer] = SUM(F12:Transportation[summer])
  Outflows_School_costs[fall] = SUM(School_contractual_Tuition[fall]:D12)
  Outflows_School_costs[spring] = SUM(School_contractual_Tuition[spring]:E12)
  Outflows_School_costs[summer] = SUM(School_contractual_Tuition[summer]:F12)
  School_contractual_Tuition[fall] = 4115
  School_contractual_Tuition[spring] = 4115
  School_contractual_Tuition[summer] = 4115
  Support_from_home[fall] = ROUND(1000+Outflows_School_costs[fall]+
    Living_costs[fall]-Cash_beginning[fall]-Inflows_Loans[fall],-3)
  Support_from_home[spring] = ROUND(1000+Outflows_School_costs[spring]+
    Living_costs[spring]-Cash_beginning[spring]-Inflows_Loans[spring],-3)
  Support_from_home[summer] = ROUND(1000+Outflows_School_costs[summer]+
    Living_costs[summer]-Cash_beginning[summer]-Inflows_Loans[summer],-3)
  Transportation[fall] = C19*B19
  Transportation[spring] = ROUND(Transportation[fall]+1+Cash_BudgetAssume[fall],0)
  Transportation[summer] = ROUND(Transportation[spring]*(1+Cash_BudgetAssume[fall]),0)

Not bad, is it?

Recall that we made the new attribute names by concatenating labels from corresponding rows in columns A and B. That isn't actually quite good enough, because the column A labels sometimes qualify more than one cell in column B. A4 - Outflows - appears to qualify both B4 and B5, so we should have prefixed it to both. Even so, our transformation has worked pretty well.

ERROR DETECTION

In developing MM, we want to make spreadsheets easier to read, and, of course, to check. While the main point of this particular paper is to introduce spreadsheet algebra, we ought also to say something about error detection. The spreadsheet in our example originated, as we mentioned, in Panko's work on detecting errors via code inspection, for which he seeded it with eight errors. Of these, four were in columns D, E and F:

E19=ROUND(D19+(1+$D$1),0). D19+ should be D19*.
F3=F4-F5+E6+E7. Should be F3=F4-E5+E6+E7.
D20=C20*20. Should be C20*B20.
F5=SUM(F12..F19). Should be SUM(F14..F20).

Has our listing, with its renamed cells, helped us detect them?

These formulae correspond to the equations for Transportation[spring], Cash_end[summer], Clothing[fall] and Living_costs[summer] in Stage 5 of the example. The first error is relatively easy to spot by inspection, helped by the way the listing groups together the elements of an attribute and makes it easy to compare the spring and summer equations. Likewise, the second error stands out because the right hand side refers to Living_costs[spring] and not Living_costs[summer]. The third error, C20*20 is not obvious (though it might be had we bothered to work out out a good replacement name for C20), but the fourth error stands out because the attribute F12 in the right hand side of Living_costs[summer] should, by comparison with the other elements of Living_costs, be Housing[summer].

SPREADSHEET ALGEBRA EXAMPLE: NEW SPREADSHEETS FROM OLD

Boolean algebra, linear algebra, and of course the ordinary kind of algebra that operates on numbers, all posess primitive operators such as set union, addition, or matrix multiplication, and fundamental laws that define how these behave. Boolean algebra has De Morgan's law; numerical algebra has the distributive law and others. So far, we haven't seen such things in spreadsheet algebra, where the functions have been ad-hocced up for convenience in reverse engineering: rename, merge, slice, detach_labels. We did, however, invent some, and the others are based on them. They include:

spreadsheet addition, which joins together two spreadsheets or MM programs, concatenating their attributes and equations:
```
  (<a,b,c> where a=b+c) plus (<d,e,f> where d=e*f)
    <a b c d e f
    >
    where
    a = b+c
    d = e*f  
```
spreadsheet extension, an analogue to drag-and-drop copying, which extends attributes and calculations over a range:
```
  (< a b c > where a=b+c) over [1..10]
    <a[1..10] b[1..10] c[1..10]
    >
    where
    a[all I] = b[I]+c[I]
```
spreadsheet division, which "divides" a set of identical attributes by their range:
```
  <a[1..10] b[1..10] c[1..10> / [1..10]
    <a b c>
```

This is how we use our operators to combine calculations and data from several spreadsheets into one:

  a := load("Company.slk");  
    <A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14
     B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14
     C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14
    >   
    where
    C1 = A1+B1
    C2 = A2+B2
    <etc>

  b := merge( a, "Income", A1:A14, [1990..2003 ] );
  b := merge( b, "Outgoings", B1:B14, [1990..2003 ] );
  b := merge( b, "Profit", C1:C14, [1990..2003 ] );
  b := roll( b, "Income", $+$ );
  b := roll( b, "Outgoings", $+$ );
  b := roll( b, "Profit", $+$ );
    <Income[1990..2003] Outgoings[1990..2003] Profit[1990..2003]
    >   
    where
    Profit[all T] = Income[T]-Outgoings[T]

  c := (b plus < MonthlyProfit: [1990..2003] >
               where MonthlyProfit[all Y] = Profit[Y] / 12
       );
    <Income[1990..2003] MonthlyProfit[1990..2003]
     Outgoings[1990..2003] Profit[1990..2003] 
    >   
    where
    Profit[all T] = Income[T]-Outgoings[T]
    MonthlyProfit[all Y] = Profit[Y]/12

Here, we first renamed and merged the cells in the input spreadsheet, and then added a MonthlyProfit attribute. The main point here is how we added the new attribute, something that would entail inserting a new column and cell-by-cell formulae if done directly in Excel. We did have to go to some effort to do the renaming and merging (i.e. to make the structure of the input spreadsheet explicit), but we plan various ways to simplify the process: by using formats, via user-defined spreadsheet algebra functions which, for example, would package up the above sequence of merges and rolls into one call; by automating equation rolling. Also, the user can always store the spreadsheet after it has been converted to an MM program, and start from that each time.

We next show a more complicated example where we load and combine three spreadsheets:

  a := load("company.slk"); <etc>
    <Income[1990..2003] Outgoings[1990..2003] Profit[1990..2003]
    >  
    where
    Profit[all T] = Income[T]-Outgoings[T]

  b := load("OxfordData.slk"); <etc>
    <Income[1990..2003] 
    > 
    where 
    Income = [ 347, 326, 291, 285, 279, 251, 277, 301, 340, 461, 312, 332, 324, 318 ]
    Outgoings = [ 213, 210, 225, 252, 274, 287, 291, 312, 345, 311, 341, 221, 219, 232 ]

  c := load("CambridgeData.slk"); <etc>
    <Income[1990..2003] 
    > 
    where 
    Income = [ 431, 455, 498, 501, 522, 526, 547, 561, 558, 412, 494, 504, 509, 517 ]
    Outgoings = [ 341, 358, 293, 287, 260, 287, 281, 315, 327, 363, 389, 290, 287, 281 ]

  d := (a on b) plus (a on c);
    <Income_1[1990..2003] Income_2[1990..2003] Outgoings_1[1990..2003] 
     Outgoings_2[1990..2003] Profit_1[1990..2003] Profit_2[1990..2003] 
    >
    where 
    Income_1 = [ 431, 455, 498, 501, 522, 526, 547, 561, 558, 412, 494, 504, 509, 517 ]
    Income_2 = [ 347, 326, 291, 285, 279, 251, 277, 301, 340, 461, 312, 332, 324, 318 ]
    Outgoings_1 = [ 341, 358, 293, 287, 260, 287, 281, 315, 327, 363, 389, 290, 287, 281 ]
    Outgoings_2 = [ 213, 210, 225, 252, 274, 287, 291, 312, 345, 311, 341, 221, 219, 232 ]
    Profit_1[all T] = Income_1[T]-Outgoings_1[T]
    Profit_2[all T] = Income_2[T]-Outgoings_2[T]

In this example, we loaded another income-outgoings-profit spreadsheet and reformatted it (with similar commands as in the previous example, here omitted to save space). We then loaded two two-column data spreadsheets, one for the Oxford branch of the business and one for the Cambridge branch. Then we glued all three together with the expression (a on b) plus (a on c). Each on superimposes attributes of the same name. So the first one copies the Oxford data into the income-outgoings-profit spreadsheet; the second copies the Cambridge data into a new copy of that spreadsheet. The net result is a spreadsheet with two copies of the calculations. With a little more effort, we can merge the attributes to make the branch an explicit index, similar to the season in our first example:

    <Income[1990..2003]*{Oxford,Cambridge} Outgoings[1990..2003]*{Oxford,Cambridge}
     Profit[1990..2003]*{Oxford,Cambridge}
    >
    where 
    Income = [ [ 431, 455, 498, 501, 522, 526, 547, 561, 558, 412, 494, 504, 509, 517 ],
               [ 347, 326, 291, 285, 279, 251, 277, 301, 340, 461, 312, 332, 324, 318 ] ]
    Outgoings = [ [ 341, 358, 293, 287, 260, 287, 281, 315, 327, 363, 389, 290, 287, 281 ],
                  [ 213, 210, 225, 252, 274, 287, 291, 312, 345, 311, 341, 221, 219, 232 ] ]
    Profit[all T,all B] = Income[T,B]-Outgoings[T,B]

Incidentally, we can also generate spreadsheet templates where the attributes' bounds (such as 1990 and 2003) are not wired in as above but passed as parameters. This process of generalisation enables us to produce a module that can be glued into any spreadsheet, regardless of the size of the row or column we are attaching it to.

SPREADSHEET ALGEBRA EXAMPLE: ADDING CODE FOR NUMERICAL DIFFERENTIATION

As a final example, we show an application inspired by work with Jack Ponton of Edinburgh University's Chemical Engineering Department. Suppose we have a spreadsheet that calculates the concentration x of some chemical at various timepoints, and we want also to see the rate of change of concentration, x_dot. If we assume the difference between timepoints is always the same - 1000, say - we can approximate the rate of change by just dividing this into the successive differences in concentration. To do this with MM simply entailed loading the spreadsheet (which, to save space in the listing, is just one column into which the concentrations can be typed as data), and adding to it a new attribute x_dot and the difference code:

  a := load( "Ponton.xls" );
    <A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
    >

  b := merge( a, "x", A1:A10, [1..10] );
    <x[1..10]
    >

  c := (b plus <x_dot[2..10]>) where 
       x_dot[all T] = (x[T] - x[T-1]) / 1000; 
    <x[1..10] x_dot[2..10]
    >
    where
    x_dot[all T] = (x[T]-x[T-1])/1000

We can do more ambitious transformations in the same way, for example adding various schemes for numerical integration. Since we can store these in libraries and parameterise them to fit any range of attribute, spreadsheet algebra gives us an extremely powerful generic way to add them to almost any spreadsheet we could want.

FURTHER WORK AND CONNECTIONS TO OTHER RESEARCH

Automatic structure discovery

Listings much bigger than our example would become tedious to inspect, and there is always a risk of missing errors even in small listings. Can we automatically compares formulae and flags a warning when they differ? Markus Clermont reported his research into this at EuSpRIG 2002; dissertation, [Clermont]. His program "stains" the raw spreadsheet, proposing sets of cells that, in our terminology, might be grouped to become elements of a single attribute. Some cells get proposed because they contain similar formulae, but he uses other means too, for example proposing that cells mentioned in a cell range should be grouped. In our example, this would be like saying that D14 to D20 should be considered for grouping, as should D10 to D12. In MM, we have independently experimented with something similar, using unification and inverse unification as a way to detect related formulae. [Our next paper in this series, on structure discovery and logic programming, takes this further.]

We believe there are other methods worth investigating. Discrete Fourier transforms might help discover repeating patterns of related but non-contiguous cells like these:

  XYZ      XYZ      XYZ

  XYZ     

  XYZ     

  XYZ

It might be worth looking in general into the perceptual techniques developed in computer vision, and certainly at blackboard systems and other methods that artificial intelligence has developed for fusing knowledge from different sources.

Another possibility is searching with parallel terraced scans, a technique Hofstadter's Fluid Analogy Research Group have used in solving analogy problems [Hofstadter,Hofstadter (2)], and one we believe promising for resolving tensions between different ways to view a problem. With spreadsheets, one such tension is that between lumping and splitting: grouping too coarsely (the extreme being to cover the entire worksheet with just one attribute) versus dividing too finely (the extreme being to split the spreadsheet into one attribute for every cell).

Perhaps the most promising technique is inductive logic programming [ILP]. This learns, given logical descriptions of examples of a concept, general rules describing the concept itself. Unlike neural nets and other machine learning techniques, the learnt concepts are easy to translate into English and to use in programs. ILP has some very convincing industrial applications - drug companies use it to learn the relation between biological activity and 3D structure in vast collections of molecules - and we believe it could be invalauable in learning libraries of common patterns. This is something we are experimenting with.

A standard for interchange of structure-discovery data

There are so many possible structure-discovery techniques that we are unlikely to find them all in the same program. This means we need a standard representation of patterns, errors and other information, and a way to share it between tools. What kind of information should be exchanged? We suggest:

High-level specification of what a spreadsheet, or regions therein, is doing.
Semantically related regions and the reasons for believing them to be related, together with a description of how they map onto the worksheet surface.
Names, derived from labels, proposed for regions. Inductive logic programming will be useful for learning from example where authorss tend to put labels in relation to the cells they name.
Errors, such as a cell where an equation "ought" to resemble those around it but doesn't.
Proposed corrections for errors.

The algebra language

Is our algebra language appropriate? We had to give the user a means of performing element-by-element manipulation of rows and columns, as when appending the cells in columns A and B to make attribute names. We provided a functional language, partly because we liked it and partly because it mapped cleanly onto our implementation language Kawa, a dialect of Scheme. However, this requires expertise the average user will certainly lack.

It is worth asking whether a graphical interface might help - something along the lines of MINOS: Steps Towards a Syntax-free Computer Algebra Program [Tintarev, 1999], where the author adopts a desktop metaphor, with different icons on the desktop representing different expressions to be manipulated.

Incremental reverse-engineering

We are working on algorithms by which a user can compile MM code to a spreadsheet, change the compiled spreadsheet, then calculate the equivalent change to the MM program. This would be useful if you develop a spreadsheet in MM and then want to change cell styles and formats from Excel, having the changes reflected back to your MM program. Doing so involves some reverse-engineering of the kind described earlier, but under constraints that the new MM program constructed by decompiling the modified spreadsheet resemble the original as much as possible. In general, this is not easy because changes to the spreadsheet may cut across parts of the MM program, rather as though we were to translate a sentence into French, throw a random eror into it, and then try to translate the result back into an English sentence with the "same" error. We are experimenting with an appropriate formalisation of the notion of "part", so that if we change something in the compiled spreadsheet, we can find the minimal enclosing entity in the MM code that it is a part of, and adjust the code accordingly. This may be a new approach to reverse engineering.

REFERENCES

Jonathan Bowen. Recent Developments in the Algebra of Programs, http://vmoc.museophile.com/algebra/section3_6.html, in A Brief History of Algebra and Computing: An Eclectic Oxonian View, http://vmoc.museophile.com/algebra/algebra.html.

Markus Clermont. A Scalable Approach to Spreadsheet Visualisation, http://www.ifi.uni-klu.ac.at/Publications/Dissertations/pubfiles/pdffiles/2003-0175-MC.pdf, Dissertation, Fakultät für Wirtschaftswissenschaften und Informatik, Universität Klagenfurt.

Thomas Grossman. Spreadsheet Queueing Simulation Templates, http://www.ucalgary.ca/%7Egrossman/simulation/index.html.

Dave Hart and Clinton Wolfe. Basic Algebra in Maple, http://www.indiana.edu/~statmath/math/maple/index.html.

Douglas Hofstadter. Analogies and Roles in Human and Machine Thinking, in Metamagical Themas: Questing for the Essence of Mind and Pattern, Basic Books, 1996.

Douglas Hofstadter. Fluid Concepts and Creative Analogies: Computer Models of the Fundamental Mechanisms of Thought, Penguin, 1998.

Inductive Logic Programming, http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/ilp.html.

Jocelyn Ireson-Paine. Ensuring Spreadsheet Integrity with Model Master, http://www.j-paine.org/eusprig2001_as_html/eusprig2001.html, Proceedings of EuSpRIG 2001, Amsterdam 5-6 July 2001.

Raymond R. Panko. Applying Code Inspection to Spreadsheet Testing, http://panko.cba.hawaii.edu/papers/ss/2phase.htm.

Kyril Tintarev. MINOS: Steps Towards a Syntax-free Computer Algebra Program, http://science.uniserve.edu.au/pubs/callab/vol3/tintarev.html, CAL-laborate, Volume 3, October 1999.

Eelco Visser et. al. The Program Transformation Wiki, http://www.program-transformation.org/twiki/bin/view/Transform.