[ Jocelyn Ireson-Paine's Home Page | Publications | Dobbs Code Talk Index | Dobbs Blog Version ]

# How to Reveal Implicit Structure in Spreadsheets

For a conference, I once drew a cartoon about spreadsheets. It showed a City gent staring despairingly at an Excel formula while behind him, Excel developers in Canary Wharf fretted. Like this:

In the cartoon, I'd assumed it was his own spreadsheet, but it could have been someone else's "legacy" spreadsheet that he was trying to decipher — and that's what I'm going to write about. My mind has been running on the topic since a colleague asked me to explain my work on what I call "structure discovery": making explicit the things an author knows about his spreadsheet but didn't write down.

To demonstrate, I've written a tiny cash-flow spreadsheet, pictured below. Its formulae are realistic, but I've reduced the number of rows and simplified the currency and date formats so that it doesn't take up too much space on the screen:

The first thing one needs is a listing of the spreadsheet's formulae. Oddly, Excel doesn't provide this; but below, I do so by reading the spreadsheet into a reverse-engineering tool I've developed. This parses the Excel file, extracts the formulae, associates them with their cells, and lists them. That's something that will be useful again later on, when I go on to transform the formulae to make them easier to read. But to start with, here are the formulae without any processing:

```Cells
Sheet1!A1:F12
where:
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software"
Sheet1!F4 = "Opening cash"
Sheet1!F5 = 1000
Sheet1!B7 = "Month"
Sheet1!C7 = "Income"
Sheet1!D7 = "Expenses"
Sheet1!E7 = "Net profit"
Sheet1!F7 = "Cash at start of period"
Sheet1!B8 = 1
Sheet1!C8 = 120
Sheet1!D8 = 27
Sheet1!E8 = Sheet1!C8-Sheet1!D8
Sheet1!F8 = Sheet1!F5
Sheet1!B9 = 2
Sheet1!C9 = 120
Sheet1!D9 = 41
Sheet1!E9 = Sheet1!C9-Sheet1!D9
Sheet1!F9 = Sheet1!F8+Sheet1!E8
Sheet1!B10 = 3
Sheet1!C10 = 150
Sheet1!D10 = 64
Sheet1!E10 = Sheet1!C10-Sheet1!D10
Sheet1!F10 = Sheet1!F9+Sheet1!E9
Sheet1!A12 = "Totals"
Sheet1!C12 = SUM(Sheet1!C8:Sheet1!C10)
Sheet1!D12 = SUM(Sheet1!D8:Sheet1!D10)
Sheet1!E12 = SUM(Sheet1!E8:Sheet1!E10)
```

Having the formulae in a single listing I can print is nice, but it would be even better if the cell addresses were replaced by meaningful names. So suppose again that the cash-flow spreadsheet is someone else's that I'm trying to decipher, and suppose also that from its layout, cell colouring, and labels, I've guessed which cells belong together. For example, suppose I've guessed that cells B8:B10 are successive months, and that C8:C10 are successive incomes. I could summarise my guesses like this:

```'Sheet1'!F5      is opening_cash.
'Sheet1'!B8:B10  is months.
'Sheet1'!C8:C10  is income.
'Sheet1'!D8:D10  is expenses.
'Sheet1'!E8:E10  is net_profit.
'Sheet1'!F8:F10  is cash_at_start_of_period.
'Sheet1'!D12     is total_expenses.
'Sheet1'!C12:C12 is total_income.
'Sheet1'!E12:E12 iso total_profit.
```

Equipped with such a mapping from addresses to names, one could then regard each group of cells as an array. For example, B8:B10 would consist of the elements `months[1]`, `months[2]` and `months[3]`. Similarly, C8:C10 would consist of `income[1]`, `income[2]`, and `income[3]`. It's then fairly simple to rewrite each formula, replacing cell addresses in it by named array elements. I've implemented such transformations in my reverse-engineering software, and below, apply them to the cash-flow spreadsheet. Notice how much more intelligible it becomes:

```Cells
total_profit[1]
total_expenses[1]
total_income[1]
Sheet1!A1:F12
cash_at_start_of_period[1:3]
net_profit[1:3]
expenses[1:3]
income[1:3]
months[1:3]
opening_cash[1]
where:
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software"
Sheet1!F4 = "Opening cash"
Sheet1!B7 = "Month"
Sheet1!C7 = "Income"
Sheet1!D7 = "Expenses"
Sheet1!E7 = "Net profit"
Sheet1!F7 = "Cash at start of period"
Sheet1!A12 = "Totals"
cash_at_start_of_period[1] = opening_cash[1]
cash_at_start_of_period[2] = cash_at_start_of_period[1]+net_profit[1]
cash_at_start_of_period[3] = cash_at_start_of_period[2]+net_profit[2]
expenses[1] = 27
expenses[2] = 41
expenses[3] = 64
income[1] = 120
income[2] = 120
income[3] = 150
months[1] = 1
months[2] = 2
months[3] = 3
net_profit[1] = income[1]-expenses[1]
net_profit[2] = income[2]-expenses[2]
net_profit[3] = income[3]-expenses[3]
opening_cash[1] = 1000
total_expenses[1] = SUM(expenses[1]:expenses[3])
total_income[1] = SUM(income[1]:income[3])
total_profit[1] = SUM(net_profit[1]:net_profit[3])
```

Although such a listing is more intelligible than the original formulae, it can become tedious and very long if the spreadsheet has lots of repeated formulae such as those for `net_profit`. To overcome this, I've implemented a "run detector". This compares each formula with the ones next to it, looking to see whether it's either identical or differs by 1 in a cell address or constant. When it finds a run of two or more formulae that are thus related, it summarises them by factoring out the common parts and replacing these by a variable, written here as `V0`. Here's output from the run-detector applied to the cash-flow spreadsheet:

```total_profit[1] = SUM(net_profit[1]:net_profit[3]).
total_expenses[1] = SUM(expenses[1]:expenses[3]).
total_income[1] = SUM(income[1]:income[3]).
Sheet1!A12 = "Totals".
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software".
Sheet1!B7 = "Month".
Sheet1!C7 = "Income".
Sheet1!D7 = "Expenses".
Sheet1!E7 = "Net profit".
Sheet1!F4 = "Opening cash".
Sheet1!F7 = "Cash at start of period".
cash_at_start_of_period[1] = opening_cash[1].
cash_at_start_of_period[V0>1] =
cash_at_start_of_period[V0-1]+net_profit[V0-1].
net_profit[V0] = income[V0]-expenses[V0].
expenses[1] = 27.
expenses[2] = 41.
expenses[3] = 64.
income[V0<3] = 120.
income[3] = 150.
months[V0] = V0.
opening_cash[1] = 1000.
```

In case this isn't clear, the equation `months[V0] = V0` means that for all `V0` within the bounds of the `months` table, the `V0` month is `V0`. That is, the first element of the `months` table is 1, the second is 2, and so on. Similarly, the equations for `cash_at_start_of_period` say that the first element of `cash_at_start_of_period` is equal to `opening_cash`, and that each subsequent element is the value for the previous month plus that month's net profit. This, by the way, is a common pattern in spreadsheets: an initialisation formula followed by a run of formulae that pass information from one row to the next. In effect, it's recursion, but over space rather than time.

The run-detector also works on "raw" spreadsheet listings: those without name substitutions. Here it is, applied again to the cash-flow spreadsheet:

```Sheet1!A12 = "Totals".
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software".
Sheet1!B7 = "Month".
Sheet1!B[V1 in 8:10] = V1-7.
Sheet1!C7 = "Income".
Sheet1!C[8:9] = 120.
Sheet1!C10 = 150.
Sheet1![V0 in C:E]12 = SUM(Sheet1![V0]8:Sheet1![V0]10).
Sheet1!D7 = "Expenses".
Sheet1!D8 = 27.
Sheet1!D9 = 41.
Sheet1!D10 = 64.
Sheet1!E7 = "Net profit".
Sheet1!E[V1 in 8:10] = Sheet1!C[V1]-Sheet1!D[V1].
Sheet1!F4 = "Opening cash".
Sheet1!F5 = 1000.
Sheet1!F7 = "Cash at start of period".
Sheet1!F8 = Sheet1!F5.
Sheet1!F[V1 in 9:10] = Sheet1!F[V1-1]+Sheet1!E[V1-1].
```

I had to invent a notation for variables that range over row and column numbers. This is illustrated by the line `Sheet1!B[V1 in 8:10] = V1-7`. This says that `Sheet1!B8 = 1`, and that `Sheet1!B9 = 2`, and so on. Similarly, the left-hand side `Sheet1![V0 in C:E]12` says that `V0` ranges over column addresses C to E. And on its right-hand side, the subexpression `Sheet1![V0]8` denotes `Sheet1!C8` or `Sheet1!D8` or `Sheet1!E8`.

Such a listing is useful because it helps discover runs of related formulae, and hence determine which cells are part of the same group. For example, it refers repeatedly to rows 8:10 and 9:10. If we didn't already know that the cells in these rows of each column were related, the listing would be a good hint. That isn't actually necessary in my cash-flow example, where the grouping is obvious by eyeballing the spreadsheet, but it can be very useful. I once worked on a 10,000-cell 200-sheet monster of a housing-finance spreadsheet which contained some 60 20×40 interrelated tables. The run-detector was invaluable in showing which formulae were related. 60 × 20×40 is 48,000 cells; I was able to reduce these to about 120 equations. I say 120, because each table tended to be defined by two sets of formulae: one set to initialise its first row, and one to progress down subsequent rows, in a similar way to the recursion mentioned above. The formulae within each set differed only by column address, and hence the run-detector could summarise them into one equation.

Finally, here is a spreadsheet that I've seeded with errors:

This spreadsheet looks like the one I showed earlier, but Excel is not happy with it, which it signals with the little green triangles in cells E9 and D12. It displays these when it thinks a formula is out of step with those around it. To find out what's wrong, I listed the formulae with name substitutions again; and three errors become apparent. The value for `cash_at_start_of_period[2]` is a hard-wired constant rather than a formula. The formula for `net_profit[2]` subtracts from `income[1]` and not `income[2]`, an error hard to see by inspecting the spreadsheet because both cells have the same value. And the SUM making up `total_expenses[1]` has run off the end of its range:

```Cells
total_profit[1]
total_expenses[1]
total_income[1]
Sheet1!A1:F12
cash_at_start_of_period[1:3]
net_profit[1:3]
expenses[1:3]
income[1:3]
months[1:3]
opening_cash[1]
where:
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software"
Sheet1!F4 = "Opening cash"
Sheet1!B7 = "Month"
Sheet1!C7 = "Income"
Sheet1!D7 = "Expenses"
Sheet1!E7 = "Net profit"
Sheet1!F7 = "Cash at start of period"
Sheet1!A12 = "Totals"
cash_at_start_of_period[1] = opening_cash[1]
cash_at_start_of_period[2] = 1093 Error
cash_at_start_of_period[3] = cash_at_start_of_period[2]+net_profit[2]
expenses[1] = 27
expenses[2] = 41
expenses[3] = 64
income[1] = 120
income[2] = 120
income[3] = 150
months[1] = 1
months[2] = 2
months[3] = 3
net_profit[1] = income[1]-expenses[1]
net_profit[2] = income[1]-expenses[2] Error
net_profit[3] = income[3]-expenses[3]
opening_cash[1] = 1000
total_expenses[1] = SUM(expenses[1]:Sheet1!D11) Error
total_income[1] = SUM(income[1]:income[3])
total_profit[1] = SUM(net_profit[1]:net_profit[3])
```

In bigger spreadsheets, the "out-of-step" errors would also become evident when I list by runs. The spreadsheet here doesn't really have enough related formulae for this to happen. However, the following extract from a by-runs listing does show that something is amiss with `net_profit`:

```net_profit[V0<3] = income[1]-expenses[V0].
net_profit[3] = income[3]-expenses[3].
```