[ Jocelyn Ireson-Paine's Home Page | Publications | Dobbs Code Talk Index | Dobbs Blog Version ]

How to Reveal Implicit Structure in Spreadsheets

For a conference, I once drew a cartoon about spreadsheets. It showed a City gent staring despairingly at an Excel formula while behind him, Excel developers in Canary Wharf fretted. Like this:

In the cartoon, I'd assumed it was his own spreadsheet, but it could have been someone else's "legacy" spreadsheet that he was trying to decipher — and that's what I'm going to write about. My mind has been running on the topic since a colleague asked me to explain my work on what I call "structure discovery": making explicit the things an author knows about his spreadsheet but didn't write down.

To demonstrate, I've written a tiny cash-flow spreadsheet, pictured below. Its formulae are realistic, but I've reduced the number of rows and simplified the currency and date formats so that it doesn't take up too much space on the screen:

The first thing one needs is a listing of the spreadsheet's formulae. Oddly, Excel doesn't provide this; but below, I do so by reading the spreadsheet into a reverse-engineering tool I've developed. This parses the Excel file, extracts the formulae, associates them with their cells, and lists them. That's something that will be useful again later on, when I go on to transform the formulae to make them easier to read. But to start with, here are the formulae without any processing:

Cells
  Sheet1!A1:F12
where:
  Sheet1!B1 = "Simple cash-flow spreadsheet"
  Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software"
  Sheet1!F4 = "Opening cash"
  Sheet1!F5 = 1000
  Sheet1!B7 = "Month"
  Sheet1!C7 = "Income"
  Sheet1!D7 = "Expenses"
  Sheet1!E7 = "Net profit"
  Sheet1!F7 = "Cash at start of period"
  Sheet1!B8 = 1
  Sheet1!C8 = 120
  Sheet1!D8 = 27
  Sheet1!E8 = Sheet1!C8-Sheet1!D8
  Sheet1!F8 = Sheet1!F5
  Sheet1!B9 = 2
  Sheet1!C9 = 120
  Sheet1!D9 = 41
  Sheet1!E9 = Sheet1!C9-Sheet1!D9
  Sheet1!F9 = Sheet1!F8+Sheet1!E8
  Sheet1!B10 = 3
  Sheet1!C10 = 150
  Sheet1!D10 = 64
  Sheet1!E10 = Sheet1!C10-Sheet1!D10
  Sheet1!F10 = Sheet1!F9+Sheet1!E9
  Sheet1!A12 = "Totals"
  Sheet1!C12 = SUM(Sheet1!C8:Sheet1!C10)
  Sheet1!D12 = SUM(Sheet1!D8:Sheet1!D10)
  Sheet1!E12 = SUM(Sheet1!E8:Sheet1!E10)

Having the formulae in a single listing I can print is nice, but it would be even better if the cell addresses were replaced by meaningful names. So suppose again that the cash-flow spreadsheet is someone else's that I'm trying to decipher, and suppose also that from its layout, cell colouring, and labels, I've guessed which cells belong together. For example, suppose I've guessed that cells B8:B10 are successive months, and that C8:C10 are successive incomes. I could summarise my guesses like this:

'Sheet1'!F5      is opening_cash.
'Sheet1'!B8:B10  is months.
'Sheet1'!C8:C10  is income.
'Sheet1'!D8:D10  is expenses.
'Sheet1'!E8:E10  is net_profit.
'Sheet1'!F8:F10  is cash_at_start_of_period.
'Sheet1'!D12     is total_expenses.
'Sheet1'!C12:C12 is total_income.
'Sheet1'!E12:E12 iso total_profit.

Equipped with such a mapping from addresses to names, one could then regard each group of cells as an array. For example, B8:B10 would consist of the elements months[1], months[2] and months[3]. Similarly, C8:C10 would consist of income[1], income[2], and income[3]. It's then fairly simple to rewrite each formula, replacing cell addresses in it by named array elements. I've implemented such transformations in my reverse-engineering software, and below, apply them to the cash-flow spreadsheet. Notice how much more intelligible it becomes:

Cells
  total_profit[1]
  total_expenses[1]
  total_income[1]
  Sheet1!A1:F12
  cash_at_start_of_period[1:3]
  net_profit[1:3]
  expenses[1:3]
  income[1:3]
  months[1:3]
  opening_cash[1]
where:
  Sheet1!B1 = "Simple cash-flow spreadsheet"
  Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software"
  Sheet1!F4 = "Opening cash"
  Sheet1!B7 = "Month"
  Sheet1!C7 = "Income"
  Sheet1!D7 = "Expenses"
  Sheet1!E7 = "Net profit"
  Sheet1!F7 = "Cash at start of period"
  Sheet1!A12 = "Totals"
  cash_at_start_of_period[1] = opening_cash[1]
  cash_at_start_of_period[2] = cash_at_start_of_period[1]+net_profit[1]
  cash_at_start_of_period[3] = cash_at_start_of_period[2]+net_profit[2]
  expenses[1] = 27
  expenses[2] = 41
  expenses[3] = 64
  income[1] = 120
  income[2] = 120
  income[3] = 150
  months[1] = 1
  months[2] = 2
  months[3] = 3
  net_profit[1] = income[1]-expenses[1]
  net_profit[2] = income[2]-expenses[2]
  net_profit[3] = income[3]-expenses[3]
  opening_cash[1] = 1000
  total_expenses[1] = SUM(expenses[1]:expenses[3])
  total_income[1] = SUM(income[1]:income[3])
  total_profit[1] = SUM(net_profit[1]:net_profit[3])

Although such a listing is more intelligible than the original formulae, it can become tedious and very long if the spreadsheet has lots of repeated formulae such as those for net_profit. To overcome this, I've implemented a "run detector". This compares each formula with the ones next to it, looking to see whether it's either identical or differs by 1 in a cell address or constant. When it finds a run of two or more formulae that are thus related, it summarises them by factoring out the common parts and replacing these by a variable, written here as V0. Here's output from the run-detector applied to the cash-flow spreadsheet:

total_profit[1] = SUM(net_profit[1]:net_profit[3]).
total_expenses[1] = SUM(expenses[1]:expenses[3]).
total_income[1] = SUM(income[1]:income[3]).
Sheet1!A12 = "Totals".
Sheet1!B1 = "Simple cash-flow spreadsheet".
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software".
Sheet1!B7 = "Month".
Sheet1!C7 = "Income".
Sheet1!D7 = "Expenses".
Sheet1!E7 = "Net profit".
Sheet1!F4 = "Opening cash".
Sheet1!F7 = "Cash at start of period".
cash_at_start_of_period[1] = opening_cash[1].
cash_at_start_of_period[V0>1] = 
cash_at_start_of_period[V0-1]+net_profit[V0-1].
net_profit[V0] = income[V0]-expenses[V0].
expenses[1] = 27.
expenses[2] = 41.
expenses[3] = 64.
income[V0<3] = 120.
income[3] = 150.
months[V0] = V0.
opening_cash[1] = 1000.

In case this isn't clear, the equation months[V0] = V0 means that for all V0 within the bounds of the months table, the V0 month is V0. That is, the first element of the months table is 1, the second is 2, and so on. Similarly, the equations for cash_at_start_of_period say that the first element of cash_at_start_of_period is equal to opening_cash, and that each subsequent element is the value for the previous month plus that month's net profit. This, by the way, is a common pattern in spreadsheets: an initialisation formula followed by a run of formulae that pass information from one row to the next. In effect, it's recursion, but over space rather than time.

The run-detector also works on "raw" spreadsheet listings: those without name substitutions. Here it is, applied again to the cash-flow spreadsheet:

Sheet1!A12 = "Totals".
Sheet1!B1 = "Simple cash-flow spreadsheet".
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software".
Sheet1!B7 = "Month".
Sheet1!B[V1 in 8:10] = V1-7.
Sheet1!C7 = "Income".
Sheet1!C[8:9] = 120.
Sheet1!C10 = 150.
Sheet1![V0 in C:E]12 = SUM(Sheet1![V0]8:Sheet1![V0]10).
Sheet1!D7 = "Expenses".
Sheet1!D8 = 27.
Sheet1!D9 = 41.
Sheet1!D10 = 64.
Sheet1!E7 = "Net profit".
Sheet1!E[V1 in 8:10] = Sheet1!C[V1]-Sheet1!D[V1].
Sheet1!F4 = "Opening cash".
Sheet1!F5 = 1000.
Sheet1!F7 = "Cash at start of period".
Sheet1!F8 = Sheet1!F5.
Sheet1!F[V1 in 9:10] = Sheet1!F[V1-1]+Sheet1!E[V1-1].

I had to invent a notation for variables that range over row and column numbers. This is illustrated by the line Sheet1!B[V1 in 8:10] = V1-7. This says that Sheet1!B8 = 1, and that Sheet1!B9 = 2, and so on. Similarly, the left-hand side Sheet1![V0 in C:E]12 says that V0 ranges over column addresses C to E. And on its right-hand side, the subexpression Sheet1![V0]8 denotes Sheet1!C8 or Sheet1!D8 or Sheet1!E8.

Such a listing is useful because it helps discover runs of related formulae, and hence determine which cells are part of the same group. For example, it refers repeatedly to rows 8:10 and 9:10. If we didn't already know that the cells in these rows of each column were related, the listing would be a good hint. That isn't actually necessary in my cash-flow example, where the grouping is obvious by eyeballing the spreadsheet, but it can be very useful. I once worked on a 10,000-cell 200-sheet monster of a housing-finance spreadsheet which contained some 60 20×40 interrelated tables. The run-detector was invaluable in showing which formulae were related. 60 × 20×40 is 48,000 cells; I was able to reduce these to about 120 equations. I say 120, because each table tended to be defined by two sets of formulae: one set to initialise its first row, and one to progress down subsequent rows, in a similar way to the recursion mentioned above. The formulae within each set differed only by column address, and hence the run-detector could summarise them into one equation.

Finally, here is a spreadsheet that I've seeded with errors:

This spreadsheet looks like the one I showed earlier, but Excel is not happy with it, which it signals with the little green triangles in cells E9 and D12. It displays these when it thinks a formula is out of step with those around it. To find out what's wrong, I listed the formulae with name substitutions again; and three errors become apparent. The value for cash_at_start_of_period[2] is a hard-wired constant rather than a formula. The formula for net_profit[2] subtracts from income[1] and not income[2], an error hard to see by inspecting the spreadsheet because both cells have the same value. And the SUM making up total_expenses[1] has run off the end of its range:

Cells
  total_profit[1]
  total_expenses[1]
  total_income[1]
  Sheet1!A1:F12
  cash_at_start_of_period[1:3]
  net_profit[1:3]
  expenses[1:3]
  income[1:3]
  months[1:3]
  opening_cash[1]
where:
  Sheet1!B1 = "Simple cash-flow spreadsheet"
  Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software"
  Sheet1!F4 = "Opening cash"
  Sheet1!B7 = "Month"
  Sheet1!C7 = "Income"
  Sheet1!D7 = "Expenses"
  Sheet1!E7 = "Net profit"
  Sheet1!F7 = "Cash at start of period"
  Sheet1!A12 = "Totals"
  cash_at_start_of_period[1] = opening_cash[1]
  cash_at_start_of_period[2] = 1093 Error
  cash_at_start_of_period[3] = cash_at_start_of_period[2]+net_profit[2]
  expenses[1] = 27
  expenses[2] = 41
  expenses[3] = 64
  income[1] = 120
  income[2] = 120
  income[3] = 150
  months[1] = 1
  months[2] = 2
  months[3] = 3
  net_profit[1] = income[1]-expenses[1]
  net_profit[2] = income[1]-expenses[2] Error
  net_profit[3] = income[3]-expenses[3]
  opening_cash[1] = 1000
  total_expenses[1] = SUM(expenses[1]:Sheet1!D11) Error
  total_income[1] = SUM(income[1]:income[3])
  total_profit[1] = SUM(net_profit[1]:net_profit[3])

In bigger spreadsheets, the "out-of-step" errors would also become evident when I list by runs. The spreadsheet here doesn't really have enough related formulae for this to happen. However, the following extract from a by-runs listing does show that something is amiss with net_profit:

net_profit[V0<3] = income[1]-expenses[V0].
net_profit[3] = income[3]-expenses[3].

All three kinds of error, by the way, have occurred in real spreadsheets. I've got repeated formulae wrong in my own spreadsheets, though I've always caught the error before the spreadsheets went live. I seem to remember horror stories of naïve spreadsheeters overtyping a formula with the number it has calculated, to "save the computer work"; and other anecdotes of the same being done to defraud. And errors in SUM ranges are notorious. SF author and critic David Langford once received only half the royalties he was due because a SUM in his publisher's royalties spreadsheet missed some cells. And the European Spreadsheet Risks Interest Group has a horror story (search for "$38K spreadsheet error") of a hospital's Medicare bad-debts spreadsheet whose SUM ranges were misprogrammed. Because of this, the hospital claimed $38,240 more to compensate it for bad debts than it should have.

Making existing spreadsheets easier to understand is one important aspect of spreadsheet safety, and is the main point of this posting. But it's also important to avoid bugs when writing new spreadsheets. I've worked on this too, and I'll finish with a summary of the work. It's based on a language I've developed called Excelsior, which can compile listings like those above into spreadsheets. This means that you can write meaningful identifiers rather than cell addresses. Separate layout statements in the program specify how these identifiers map to cells.

Excelsior also lets you write reuseable modules that can be shared between different spreadsheets. In other words, code reuse. That's something lacking from Excel, where to share code between spreadsheets, you have to copy cells from one spreadsheet and paste into another. And then, of course, you end up having to maintain several copies of your formulae.

The following links give more information on this work. In How to Document a Spreadsheet: an Exercise with Cash-Flow and Loans, I show how to develop, in Excelsior, a spreadsheet that represents several loans, and for each month, decides which is best to borrow from. The algorithm is fairly simple, but it's subtle, and it would be tricky to get right if coding directly in Excel. Having the formulae all available in front of you, with sensible cell names, makes it much easier.

A much longer exercise of the same kind is Gliders, Hasslers, and the Toadsucker: Writing and Explaining a Structured Excel Life Game. In it, I show how to develop a spreadsheet that implements a well-known cellular automaton, John Conway's game of Life. I wrote my article for maths teachers who might want to introduce secondary-school students to Life. Having the primary source code be an Excelsior listing, rather than a spreadsheet, makes it much easier to show students the equations that update the cellular automaton.

Rather different is the science-fiction generator at Ireson-Ireson-Paine Spreadsheet Parts Repository Excel science-fiction generator. I based this on a jokey diagram in an old SF anthology, Gahan Wilson's The Science Fiction Horror Movie Pocket Computer from Harry Harrison and Brian Aldiss's The Year's Best Science Fiction No. 5. It's a familiar idea: a network of linked boxes where each box contains a plot event such as "Earth falls into the Sun" or "And so everybody dies". The links between boxes indicate which events can follow which. My spreadsheet implements this as a transition graph, which it recurses over to generate a story. I wrote the spreadsheet entirely in Excelsior; the recursion would have been very tricky to program if using only Excel.

Those examples use the idea that one can avoid bugs by generating spreadsheets from easily readable programs. Another way to avoid bugs is to build spreadsheets from prefabricated components. That's demonstrated in Spreadsheet Components, Google Spreadsheets, and Code Reuse, where I apply it to Google's online spreadsheet, enabling users to load library formulae into it in the same way they can load charts. The use of modules is an important part of spreadsheet safety, but much neglected.