Programmer Documentation for the SocialCalc Engine

Revision 8, 17 July 2007
Originally authored by Dan Bricklin


This document explains the SocialCalc Engine. It starts with an overview of the main routines and data structures, and then dives into the major routines, explaining their functionality and the related routines. It includes sample data and code.

The main files for the SocialCalc Engine are DataFiles.pm, SheetFunctions.pm, and Sheet.pm. They are in the /lib/SocialCalc directory.

DataFiles.pm is mainly used to support the actual reading, writing, and listing of files, and the operations on the data file tree maintained by SocialCalc. It relates to uploading the files to servers, maintaining a list of users, etc. Many implementations of spreadsheet functionality can be created without need for this module. The actual encoding and decoding of the spreadsheet data into files is performed by other modules. DataFiles.pm is not discussed further in this document.

The spreadsheet formula functions, such as SUM, COUNT, IRR, SIN, and COS, are provided by the routines in SheetFunctions.pm. The definition of the functions tries to follow the computational specification for those functions in the Open Formula Format of the Open Document Format for Office Applications (http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office-formula).

Most of the rest of the code to implement the main internal functionality of a spreadsheet are in Sheet.pm. This functionality includes the parsing and creating of save format data, sheet manipulation commands, rendering, and recalculation.


Understanding %sheetdata -- the key data structure

Sheet.pm is most concerned with the in-memory representation of a spreadsheet. This data is stored in the form of a single hash, usually called %sheetdata (or something similar if there is more than one instance). %sheetdata holds the associated data for a single 2-dimensional grid of cells.

Understanding how the cell data is stored in %sheetdata is key to understanding the operation of many of the routines in Sheet.pm, so we will cover it first. (A detailed specification of the structure of %sheetdata is in the comments at the beginning of the parse_sheet_save function.)

The main data in %sheetdata are the cell contents and attributes, global attribute values, named range information, a "clipboard" holding copies of cell data, and miscellaneous flags such as error indications and the "recalculation needed" flag.

For efficiency reasons, some of the attribute values in %sheetdata are stored as index numbers into an array. Those arrays and their full values are also stored in %sheetdata.

Cell information is addressed by cell coordinates, specified as strings in the normal "A1", "B7" form.

Within %sheetdata there are separate hashes for each defined cell's datatype, formula, datavalue, and valuetype.

The datatype specifies how to interpret the other values, and is either "v" (plain numeric value), "t" (text value), "f" (formula), or "c" (constant).

The formula is the text of the cell's formula (without the leading "=") if the cell's datatype is "f", it is the text of a "constant" if the datatype is "c", and not present if the datatype is "t" (text) or "v" (plain numeric value).

The datavalue is the cell's current value, either calculated using the formula, converted from the "constant" form, or just a value input by itself. This is the value that is used in calculation and for input to the formatting process for display. The normal Perl issues with strings vs. binary numeric values apply to this value.

The cell's valuetype indicates the type of the datavalue. It consists of one or more characters. The first character is "n" for numeric values, "t" for text, "e" for error, etc. The second and following characters, if present, are the sub-type. For example, "nl" is a logical numeric value, "nd" is a date numeric value, and "e#NAME?" is a specific error value of "#NAME?". For formulas, the valuetype is determined as part of the calculation process. For example, adding a date to a time results in a datetime value (type "ndt"). (This determination is sometimes made using the lookup_result_type function in Sheet.pm.)

"Constants" are values that are typed in using some formatting, such as a trailing "%" for a percent value or with embedded slashes for a date value, and such that you would want to preserve that exact text for use when presenting the cell contents for editing but that text cannot be used directly for calculation. For example, a cell might have "15%" as the text that was typed in to the formula bar. It would be stored as datatype of "c", formula of "15%, datavalue of ".15", and valuetype of "n%".

Within %sheetdata there is a cellattribs hash which contains a hash for each defined cell, accessed by cell coordinate. The formatting attributes are specified by key/value pairs within this hash. If the coord key exists, then the cell is considered non-blank, meaning it may have attributes set or a non-empty value. (In a spreadsheet, a cell may be "blank" but have formatting that will apply when a future value is typed into the cell.)

As an example, here is a simple spreadsheet with four cells: A1, A2, B1, and B2. Cell A1 has the bold text "Value:" and cell B2 has the bold text "Square root:". Column A is set explicitly to 120 pixels wide. Cell B1 contains just the number 9. Cell B2 contains a formula that computes the square root of the contents of cell B1. Here are the respective values in %sheetdata:

   $sheetdata{datatypes}->{A1} = "t"
   $sheetdata{datavalues}->{A1} = "Value:"
   $sheetdata{valuetypes}->{A1} = "t"
   $sheetdata{cellattribs}->{A1}->{coord} = "A1"
   $sheetdata{cellattribs}->{A1}->{font} = 1

   $sheetdata{datatypes}->{A2} = "t"
   $sheetdata{datavalues}->{A2} = "Square root:"
   $sheetdata{valuetypes}->{A2} = "t"
   $sheetdata{cellattribs}->{A2}->{coord} = "A2"
   $sheetdata{cellattribs}->{A2}->{font} = 1

   $sheetdata{datatypes}->{B1} = "v"
   $sheetdata{datavalues}->{B1} = 9
   $sheetdata{valuetypes}->{B1} = "n"
   $sheetdata{cellattribs}->{B1}->{coord} = "B1"

   $sheetdata{datatypes}->{B2} = "f"
   $sheetdata{formulas}->{B2} = "SQRT(B1)"
   $sheetdata{datavalues}->{B2} = 3
   $sheetdata{valuetypes}->{B2} = "n"
   $sheetdata{cellattribs}->{B2}->{coord} = "B2"

   $sheetdata{colattribs}->{A}->{width} = 120

   $sheetdata{fonts}->[1] = "normal bold * *"
   $sheetdata{fonthash}->{"normal bold * *"} = 1

   $sheetdata{sheetattribs}->{lastrow} = 2
   $sheetdata{sheetattribs}->{lastcol} = 2


Saving and loading %sheetdata

The data in %sheetdata can be rendered into a save file format using the create_sheet_save routine. This takes a reference to %sheetdata as an argument and returns a string with that data in a save format. The string consists of multiple lines, each in the general format of "linetype:param1:param2..." The specific format is documented in the comments at the beginning of the parse_sheet_save routine.

The text is all 8-bit bytes, with UTF-8 characters taking up multiple bytes as necessary. This seemed to work best with various modules used for input/output, etc. (During formula evaluation strings with multi-byte characters are treated as UTF-8 encoding so that comparisons and length operations are handled correctly.)

The example spreadsheet used above would produce the following string:

version:1.3
cell:A1:t:Value\c:f:1
cell:B1:v:9
cell:A2:t:Square root\c:f:1
cell:B2:vtf:n:3:SQRT(B1)
col:A:w:120
sheet:r:2:c:2
font:1:normal bold * *

Note that there are escape sequences (using backslashes) for backslash, colon, and newline (\\, \c, and \n, respectively). The escape encoding is done using the encode_for_save function (with decoding done using the decode_from_save function).

Data in save file format can be parsed using the parse_sheet_save routine. This routine takes a reference to an array of lines of characters (such as from reading a saved file) and a reference to a %sheetdata hash. It initializes %sheetdata to empty and then fills it in as it parses each line. Return characters are ignored, so you can edit the data with something like Notepad on a Windows system.


Overview of the other main routines in Sheet.pm

There are three other main routines in Sheet.pm. They will be described in more detail later.

The first routine is execute_sheet_command which manipulates %sheetdata in response to human-readable command strings. It is used to set and change cell contents, insert and delete rows and columns, copy/cut/paste, etc.

The second routine is recalc_sheet. It executes the formulas in all the cells to calculate new values. It is usually invoked after use of execute_sheet_command.

The last routine is render_sheet. It renders the entire sheet into HTML for displaying in a browser. It has a variety of options, such as for rendering with or without the grid visible.


A simple testbed to show the main routines

Here is a simple Perl program, simplesheet1.pl, that can act as a testbed and as an illustration of the use of various routines in Sheet.pm. It is meant to be used in response to a CGI request with a web server. It should be installed in a directory with a sub-directory of "SocialCalc" containing Sheet.pm, SheetFunctions.pm, and Strings.pm from the /lib/SocialCalc directory in the distribution.

The program implements a simple system that:

Receives the definition of a spreadsheet in the save data format
Parses that data with parse_sheet_save
Executes any commands using execute_sheet_command
Recalculates the sheet's cell values using recalc_sheet
Renders the sheet into HTML using render_sheet
Creates updated save data using create_sheet_save

Here is the code:

--- Start source of simplesheet1.pl ---
#!/usr/bin/perl

   use strict;
   use SocialCalc::Sheet;
   use SocialCalc::SheetFunctions;
   use SocialCalc::Strings;
   use CGI qw(:standard);

   my %sheetdata;

   my $q = new CGI;
   my @lines = split(/\n/,$q->param("savestr"));
   parse_sheet_save(\@lines, \%sheetdata);

   my $command = $q->param("command");
   if ($command) {
      execute_sheet_command(\%sheetdata, $command);
      }

   recalc_sheet(\%sheetdata);

   my ($stylestr, $outstr) =
      render_sheet(\%sheetdata, "", "", "s", "a", "inline", "", "", "");

   my $savestr = create_sheet_save(\%sheetdata);

   print <<"EOF";
Content-type: text/html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Simple Spreadsheet System</title>
</head>
<body>
<h2>SIMPLE SPREADSHEET SYSTEM</h2>
<h3>Rendered Sheet:</h3><hr>
$outstr
<hr>
<form action="" method="POST">
<h3>Command:</h3>
<input type="text" name="command" value="" size="60">
<h3>Saved Data:</h3>
<textarea name="savestr" rows="7" cols="60">$savestr</textarea>
<br><br>
<input type="submit" name="render" value="Execute and Render">
</form>
</body>
</html>
EOF
--- End source of simplesheet1.pl ---

When invoked using a browser, this script should display an empty rendered sheet, an input field for a command, a text area labeled "Saved Data" displaying the save format data for the empty sheet, and a button labeled "Execute and Render".

To load a more interesting spreadsheet, namely our example, copy the following into the Saved Data text area:

version:1.3
cell:A1:t:Value\c:f:1
cell:B1:v:9
cell:A2:t:Square root\c:f:1
cell:B2:vtf:n:3:SQRT(B1)
col:A:w:120
sheet:r:2:c:2
font:1:normal bold * *

If you then press the button the page should repaint with the appropriate rendered sheet.

If you type a command into the command input field, such as "set B1 value n 25", and then press the button, you will see the updated sheet and save data. There is a list of various commands in the comments associated with the execute_sheet_command routine.

This program implements a "stateless" spreadsheet editing system. The program running on the server does not remember anything between invocations. All of the information comes from forms sent with the request and then populated with updated information in the response.


The clipboard

Each spreadsheet (i.e., one 2-dimensional grid) has a clipboard associated with it. When a range of one or more cells are copied or cut, the contents of the cells (attributes, values, formulas, etc.) are copied into similar hashes in %sheetdata{clipboard}, such as %sheetdata{clipboard}->{datavalues}. The coordinates of the range where the clipboard contents came from (e.g., "E7:G9") are saved in %sheetdata{clipboard}->{range}.

In order to use those values in other spreadsheets, the references to fonts and other attributes stored as indexes into arrays needs to be normalized appropriately. This is done by App::SocialCalc::Commands::Tools::execute_tools_command in the code for loading from a sheet.

Traditional GUI spreadsheets, such as Multiplan on the Mac (one of the first?) and Excel do not actually make copies of cells being copied/cut/pasted. Instead the range being selected is remembered (and shown with a moving dotted line). The actual copying is done without the need for a clipboard when the paste operation is done. This works well in a memory constrained system. In SocialCalc, though, the program actually makes a copy.

The copied cells maintain their original cell coordinates in the copy. The saved range value helps the pasting code determine appropriate offsets. In a spreadsheet, relative cell references (e.g., the first cell reference in "=E20*$E$5" is relative, the second is absolute) need to be modified to point to cells in the same relative position in the pasted copy.

This use of a sheet-specific clipboard, instead of a session-related clipboard, facilitates the creation of stateless interaction with regards to the server. Also, the clipboard survives communications disruptions and session breaks while editing.


The execute_sheet_command routine

Other than recalculation, most of the manipulations of %sheetdata are performed by the execute_sheet_command routine. This routine takes as arguments a reference to %sheetdata and a command string.

The command string is in the form of "command-name param1 param2..." Commands include "set", "copy", "cut", "paste", "insertcol", and "sort". The list of commands and their format can be found by looking at the code.

One command in particular is helpful for using the testbed: The set command. Some of the forms of this command are: "set coord value type numeric-value", "set coord text type text-value", and "set coord formula formula-body-less-initial-=". For example: "set B1 value n 25" and "set A3 text t More things..."

The paste, fillright, filldown, and sort commands move cells from one place to another. They need to correctly handle relative cell references in formulas. This is done using the offset_formula_coords routine. That routine is given an offset that the new location is on the sheet from the original location of the formula. The offset is applied to a formula by first parsing the formula into tokens (explained later), fixing up appropriate cell references, and then recreating a text version of the formula.

The insertcol, insertrow, deletecol, and deleterow commands need to modify all cell references that refer to cells that are moved by those operations. That is done using the adjust_formula_coords routine. This changes all cell references to cells that start at a specified row and column by specified offsets. In this case, even absolute references may be modified.

This style of API, using human-readable command strings to effect changes to the sheet, was done for a variety of reasons. One is to make it easy to keep a log of changes to the sheet. This log could be used to start from a checkpoint and then run forward to simulate step-by-step changes. This could be used as a "poor man's undo", by stopping before the last step. Another reason is that this opens up a simple macro language. An "interactive console mode" is easily implemented, as we see with the simplesheet1.pl testbed.


The render_sheet routine

This routine is used to render the data in %sheetdata in HTML or other forms for display. It takes as arguments a reference to %sheetdata as well as several strings and a mode setting ($editmode). It returns $stylestr, a string containing the text for use within a <style> element, and $outstr, a string with the HTML for the <table> element representing the rendered spreadsheet.

The $editmode may be any of the following: ajax, publish, embed, inline, or null ("").

The ajax mode creates an HTML table that includes a visible grid between all the cells as well as row and column headers. The publish mode creates an HTML table with no grid, and the cssc values (explicit CSS class names) are used for cells that have one set instead of the one the cell would otherwise get. The embed mode creates lines of text formatted for use with /lib/App/SocialCalc.pm::create_embeddable_JS. The formats of $stylestr and $outstr make it easy for simple Javascript to create the HTML with explicit style information without needing to repeat the same long style strings that would normally be named styles in a style sheet. The inline mode produces HTML with all of the style information in the <td> tags instead of in stylesheet classes (except explicit cssc classes). This is similar to embed mode, except that the expansion has already been done. The null mode renders normal HTML without a grid, but without cssc used.

The string parameters are $extratableattribs, $extratdattribs, $styleprefix, $anchorprefix, $editcoords, $onclickstr, and $linkstyle. The $extratableattribs and $extratdattribs are strings that are added (along with a leading space) to the <table> and <td> tags, respectively, generated by this routine. They can be used, for example, to add a class attribute to all cells or an ID to the table. The $styleprefix is added to a number assigned to each distinct style combination to create a CSS class name. If you have more than one sheet displayed on a page you can use this to distinguish among them. The default is "s". The $anchorsuffix is currently unused. It was originally to be used to provide an anchor tag at the beginning of each row for a non-Ajax editing system. This has been replaced by an ID tag on each row header. The $editcoords are the cell coordinates (e.g., "B9") for the cell with the edit cursor. It is used to set different CSS classes for the headers on that row and column and for that cell. The $onclickstr is a string that is added to each <td> tag. The characters "$coord" in the string are replaced by the cell coordinates for each cell. When publishing to the web this string is usually blank, and when editing maybe something like onclick="rc0('$coord');". The $linkstyle string is used by the wiki_page_command function that creates the link to other pages when rendering text using wiki-text formatting.

In ajax mode (with the grid) an extra 10 rows are displayed past the maximum active rows. This gives some extra space for adding new data during editing. Otherwise, only as many rows and columns as the extent of the sheet are displayed.

The render_sheet routine handles merged cells, creating appropriate colspan and rowspan attributes. It also tries to handle cell borders as well as possible. The spreadsheet model is of single borders between each cell on each side. The HTML model is of four borders on each cell, with two between each cell on each side. The <table> style "border-collapse:collapse" helps with this. Some browsers have issues with multiple settings for shared borders and the routine tries to only have one border actually set for each shared border. It also takes care of adding a thin gray dotted border between cells without borders when the grid is displayed.

Rendering each cell involves looking at the explicit cell format settings as well as all the defaults to come up with the style setting. The style sets the font, borders, layout, etc., and ends up being expressed as a string of CSS settings.

Each cell's value is formatted for display as determined by its valuetype. This formatting (i.e., currency symbols, thousands separators, number of digits to the right of the decimal point, etc.) is controlled at the highest level by the format_value_for_display function, which returns the formatted value as HTML.

Text values are formatted using the format_text_for_display function. If an explicit format has not been set for this cell, this function looks to the sub-type (the characters after the "t" in the valuetype) of the value itself as a guide to which format to use. For example, valuetype "tw" would use the format "text-wiki" and "n%" would be displayed scaled by 1/100 with a trailing "%" character.

Numeric values are formatted using the format_number_for_display function. This function takes as arguments the raw numeric value, the valuetype, and the format string. The format string is in the traditional spreadsheet style like "#,##0.0". If no format string is provided (or the string is "Auto") the valuetype is used to provide a default.

Most of the actual work of number formatting is performed by the format_number_with_format_string function. This function takes as arguments a numeric value and a format string. (An optional third argument is now obsolete.) It returns a string with the formatted value as HTML. (While most of a final value is usually plain text, HTML tags are usually used for colors or links.)

Format strings are in a form similar to that used by most other spreadsheets, such as Microsoft Excel or OpenOffice Calc. The current implementation does much of what Excel 2003 does, with a few extensions. The Excel Help text documents some of the specifics of number formatting in the "Create or delete custom number format" help item and OpenOffice Calc in "Number Format Codes". (Search for "custom number format" in Excel or on the web.) Not implemented are features such as "e" scientific format (you only get that in General format when Perl thinks you should), the plain "@" for text formatting, fractions ("/"), and the "*" repeat functionality (it always repeats the character 5 times). Added are the ability to have any number of sections with conditions (see below); [RED], etc., augmented with [style=css-style-spec] (e.g., [style=font-weight:bold;color:red]); and @r, @s, and @u for raw text, special characters escaped, and url-encoded text insertion. A good way to learn what Excel uses is to choose, in Excel, a specific pre-defined format and then click the "custom" category to see the format string for it. To see examples of what SocialCalc does, look at the share/socialcalc/definitions.txt file.

The formatting starts by making sure that a parsed form of the format string is available in %format_defs. The parsing is done by parse_format_string, and that parsed form is kept around from cell to cell to save the parsing time since it is common for many cells to use the same format string. The parsed form consists of an array of operators and operands. The operators are numbers representing commands such as copy a character into the output, set the color, set the CSS style, insert an integer digit, insert a currency symbol, etc. The operands are the characters to copy, the color, etc.

In the original text form, there may be multiple sections separated by semicolons. The sections can have an optional comparison to see if the value meets a particular condition to which that section's format should be applied, for example "[<=100]". If there are no comparisons, then the first section applies to positive numbers, the second to negative, and the third to zero values.

The parsed form of the format string includes an array of hashes containing information about each section in the format string. In that hash is the start index in the operators and operands arrays for that section, the number of integer and fractional digits to be displayed, how many scaling commas were encountered (cause scaling by 1/1000), how many scaling % characters there were (cause scaling by 1/100 and display of a percent character), whether there should be a thousands separator in the result, and whether or not there are date/time placeholders in the format (the overhead of converting the raw value into date/time components is only done if the format includes date or time placeholders).


The determine_value_type function

The determine_value_type function takes a value as a string and uses a variety of algorithms to determine the type of value it represents and then returns that value along with a value type. It looks at extra characters, such as $, %, /, etc., and does tests with various regular expressions. The value forms that it handles try to follow the specification for the spreadsheet function VALUE(v).


The render_values_only routine

The Ajax interface to the SocialCalc Engine relies on Javascript in the browser to update the table rows and cells that make up the spreadsheet. The render_values_only routine is similar to the render_sheet routine, but instead of rendering to complete HTML <table>, <tr>, and <td> elements, it creates a %celldata hash that just has the display value for cells and some other values needed for updating the contents of cells. Major style information is not calculated, since that is assumed to be kept constant when only cell values are changed. A before and after copy of %celldata is created by the Ajax-response code for the request parameter ajaxsetcell in lib/App/SocialCalc.pm and only the changes are sent back to the browser.


The recalc_sheet routine

Formulas are recalculated by calling the recalc_sheet routine. This routine uses the simple algorithm of going through all of the cells in the spreadsheet with formulas and calling check_and_calc_cell on those cells in turn.

The check_and_calc_cell routine keeps track of whether or not the cell being evaluated has already been recalculated. If it has, the routine just returns. If not, the routine parses the formula into tokens using the parse_formula_into_tokens routine. For each cell reference, including all cells in ranges, check_and_calc_cell calls itself with that cell coordinate as an argument, basically recursing down the formula cell reference tree. Checks are made to catch circular references.

Once check_and_calc_cell has ensured that all cell references in a cell's formula have been recalculated, it calls the evaluate_parsed_formula routine to calculate the parsed formula's new value.

This is a very simplistic algorithm. It breaks down with very large numbers of formulas that refer to each other resulting in very deep recursion, or with very large numbers of formulas that don't refer to each other and therefore don't need to be recalculated on every change. Future improvements could add facilities for more minimal recalculation and recursion-free operation.

Parsing of a formula is done with the parse_formula_into_tokens routine. This takes as an argument the string making up the formula (with no leading "=") and returns a reference to a %parseinfo hash. The %parseinfo hash is filled with arrays for tokentext, tokentype, and tokenopcode. The tokentext is the characters that made up the token, e.g., "F22" or ">=". The tokentype is the type of token, e.g., $token_coord or $token_op. The tokenopcode is a single character version of operators, e.g., "+" or "G". Parsing is done using a simple state machine that looks at each character in turn.

For example, if you call parse_formula_into_tokens with "20+4*3", you will get a reference to a %parseinfo hash with the following content (the text in parenthesis corresponds to $token_num and $token_op):

   %parseinfo->{tokentype}->[0]=1 (num), %parseinfo->{tokentext}->[0]=20
   %parseinfo->{tokentype}->[1]=3 (op), %parseinfo->{tokentext}->[1]=+
   %parseinfo->{tokentype}->[2]=1 (num), %parseinfo->{tokentext}->[2]=4
   %parseinfo->{tokentype}->[3]=3 (op), %parseinfo->{tokentext}->[3]=*
   %parseinfo->{tokentype}->[4]=1 (num), %parseinfo->{tokentext}->[4]=3


The evaluate_parsed_formula routine

The actual calculation of formulas is performed by the evaluate_parsed_formula routine. This routine first converts the parsed formula into reverse polish notation using a shunting yard algorithm. This is based on the algorithm found at the time it was written on Wikipedia.

The precedence of various operators is specified in the @token_precedence array.

The result of this conversion is an array @revpolish with indexes into the arrays of token text/type/opcodes in appropriate order for executing using a simple postfix evaluation algorithm.

For example, if you call evaluate_parsed_formula with the %parseinfo data listed above it will first create a @revpolish array with the following values:

   $revpolish[0]=0
   $revpolish[1]=2
   $revpolish[2]=4
   $revpolish[3]=3
   $revpolish[4]=1

This corresponds to "20 4 3 * +" in reverse polish, and the following %parseinfo items, in order:

   %parseinfo->{tokentype}->[0]=1 (num), %parseinfo->{tokentext}->[0]=20
   %parseinfo->{tokentype}->[2]=1 (num), %parseinfo->{tokentext}->[2]=4
   %parseinfo->{tokentype}->[4]=1 (num), %parseinfo->{tokentext}->[4]=3
   %parseinfo->{tokentype}->[3]=3 (op), %parseinfo->{tokentext}->[3]=*
   %parseinfo->{tokentype}->[1]=3 (op), %parseinfo->{tokentext}->[1]=+

The @revpolish array is then processed one element after another to do the calculation. This processing involves the major structure used during formula evaluation, the @operand stack.


The @operand stack

When evaluating a parsed formula by processing the elements in the @revpolish array tokens that represent operands are pushed onto the @operand stack. When an operator is encountered, its arguments are popped off the top of the @operand stack and used in the evaluation of the operation. The result is pushed onto the @operand stack.

The format of the @operand stack is as follows: Each element of the stack is a reference to a hash with two items, type and value. The type indicates the type of operand and the value depends upon the type. Unlike traditional arithmetic expression formula evaluators, a spreadsheet formula has a very wide variety of special operand types and many operators make use of these different value types. The spreadsheet formula operator evaluation code usually has to check several special cases and can return a wide variety of value types.

For example, after processing the first element of @revpolish in our example above, which would push the value "20" onto the stack, the @operand stack would look like this:

   $operand[0]->{type}=n, $operand[0]->{value}=20

After processing the next two elements, both also operand values, the stack would look like this:

   $operand[0]->{type}=n, $operand[0]->{value}=20
   $operand[1]->{type}=n, $operand[1]->{value}=4
   $operand[2]->{type}=n, $operand[2]->{value}=3

The next element in @revpolish is the infix operator "*". This operator takes two operands. It pops those two operands, the numeric values 3 and 4, off of the stack and then pushes the product, the numeric value 12, back on. The stack then looks like this:

   $operand[0]->{type}=n, $operand[0]->{value}=20
   $operand[1]->{type}=n, $operand[1]->{value}=12

The next element in @revpolish is the infix operator "+" which works similarly to the "*" operator, resulting in the numeric value 32 being pushed onto the stack:

   $operand[0]->{type}=n, $operand[0]->{value}=32

With all of the elements in @revpolish processed, the single operand left on the stack is returned as the result of evaluate_parsed_formula, both as a value and as a type. There is some special code that checks for cases when the result is a special type such as a cell reference.

The type values of operands on the @operand stack are a superset of the normal value types. Additional types include "coord", "range", and "start".

The "coord" type indicates a cell reference in the form "coord" or "coord!sheetname" (e.g., "F8" and "F8!basesheet").

The "range" type indicates a range of cells in the form of "coord|coord|number". For example, "C5:D7" could be coded as "C5|D7|". The number starts out as a null string. When referenced cells are fetched one after another for use in processing an operator, the number is used to indicate the offset of the next item to fetch. This index is set and used by the step_through_range_up and step_through_range_down routines. Successive calls to those routines return the coordinates and optional sheet name of each cell in the range in turn, using the range type's number value to keep of which cell to return next.

The "start" type indicates the start of arguments to a spreadsheet function, that is, it represents the "open parenthesis" that starts an argument list. This helps in the processing of spreadsheet functions that take a variable number of arguments.


Fetching operands and evaluating operators

In order to evaluate an operator, its operands must be popped off the stack, sometimes being converted into the data type needed by the operator. For example, the infix "+" operator needs two numeric values as operands. If an operand is a cell reference, such as D12, the value of that operand is found by fetching the value of cell D12. If the contents of that cell are empty, then the numeric value is treated as a zero. If it is a text value, then there has to be a determination if it can be converted to a numeric value using a spreadsheet's wide variety of numeric input forms, including dates and percents. If it cannot be converted, then a numeric value of zero is used for that operand.

After the operation is performed, the resulting value and type are pushed on the stack. The determination of the type varies by operator and often depends upon the types of each of the operands. The lookup_result_type function is often used for this purpose. It takes as arguments the value types of two operands, and a reference to a hash containing mapping information. Based upon the major types (the first letter of the type value) and the sub-types (the characters following the first) it determines the result type. The %typelookup hash contains references to mapping hashes for many different types of operators. Some, like the $typelookup{plus}, the $typelookup{twoargument}, and $typelookup{oneargument) mappings, are used by many operators and spreadsheet functions, both in the Sheet.pm and SheetFunctions.pm modules.

There are a number of routines that implement the operand fetching and conversion functions. They are also used in both the Sheet.pm and SheetFunctions.pm modules.

The operand_value_and_type routine pops the top of the stack and returns it as a value and type. It is the main workhorse for getting values. If the operand is a cell reference, it follows that reference and returns the value of the referenced cell. If the cell is in another sheet, it looks to the sheet cache to find %sheetdata for that sheet. The sheet cache is accessed using the find_sheet_in_cache routine. If the operand is a range, this routine uses step_through_range_down to access successive cells in the range, starting in the upper left corner. Named cells and ranges are also expanded to access the specific cell(s).

The routines operand_as_number and operand_as_text use operand_value_and_type to get a value and then coerce it appropriately into the desired type. There are other routines, such as operand_as_coord, for getting other types and for doing the special processing they require.

The lookup_name routine takes a name string and returns a value and type. Names in SocialCalc can be just a coordinate (e.g., A1), a range (e.g., A1:B7), or a formula (e.g., "=INDEX(A1:B5,0,1)"). This routine evaluates the formula if it exists and then returns that value (which may be a range value). For coordinates and ranges it just returns either a coord or range value.


Spreadsheet functions

If, when processing the parsed formula, a function reference is encountered, the SocialCalc::SheetFunctions::calculate_function routine is used. This routine is called with arguments of the function name, a reference to the @operand stack, a reference to a string to hold error messages, a reference to the %typelookup hash, and a reference to %sheetdata.

The calculate_function routine looks up the function name in the %function_list hash. This structure uses the function names as keys and an array of two items for each value. The first item in the array for each function is a reference to a function subroutine. The second item in the array is a number indicating the number of arguments used by that function. If the number is non-negative, it indicates the required number of arguments for that function. If the value is negative, the absolute value indicates the minimum number of argument required by that function. A value of -100 indicates that the number of arguments should not be checked.

If the function name is not in the %function_list hash, it is assumed to be a name and returned as a "name" type.

All of the function subroutines take the same argument list: the name of the function being invoked, a reference to the @operand stack, a reference to an @foperand function argument stack (see below), a reference to the error string variable, a reference to the %typelookup hash, and a reference to %sheetdata.

The calculate_function routine uses Sheet::copy_function_args to copy the arguments (the operands starting at the top of the @operand stack down to the "start" item on the stack, which is discarded) onto the @foperand stack and pops them off of the @operand stack. The @foperand stack is in the reverse order of the @operand stack -- the left-most arguments to the function are at the top of the stack.

The calculate_function routine then checks for errors in the number of arguments. Then it calls the function subroutine.

Some of the function subroutines implement a single function, and many of them implement more than one. When done, they push the result onto the @operand stack and return. The definitions of what the functions are supposed to do can be found in the documentation of other spreadsheets and most importantly in the documentation of the Open Document Format for Office Applications (OpenDocument) Recalculated Formula (OpenFormula) Format (http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office-formula).

The choice of functions to implement for version 1.0 of wikiCalc and SocialCalc 1.1.0 was the "Small Group" of functions listed in the OpenFormula document, plus four wikiCalc/SocialCalc-specific functions (WKC...) used to explore some of its capabilities, especially in a server-based environment.


An example of a function routine

As an example, we will look at the series_functions function routine. This routine implements many of the most common functions, such as SUM and COUNT, as well as others such as STDEV and VAR. It takes a variable number of arguments, either as values, cell references, or cell ranges. It goes through each value in turn, including those of all the cells in a range, and calculates the running "result". The computational overhead in this routine is accessing the values, especially for cell ranges, the most common use. Doing running calculations for all possible functions is not much more overhead.

The series_functions routine goes through each element of @foperand and uses operand_value_and_type to retrieve a value for computation. The COUNT, COUNTA, and COUNTBLANK functions increment their counts depending upon the type of the values (i.e., COUNT only counts numeric values, COUNTA counts all non-blank values, and COUNTBLANK counts only blank values). If the value is numeric, a sum is accumulated, as is the product, the minimum and maximum value so far are checked for, and the accumulated values needed to compute the statistical functions are added to. A value type is determined as if there were addition after addition.

After all elements have been processed, the appropriate value is pushed onto the @operand stack.


Security

There are many aspects to security when you have an active system like SocialCalc. If you deploy the product, and especially when you use it in configurations different than normal, you should pay attention to potential security violations. These include users being able to get the server to perform operations you may not have intended. One of the areas to be concerned about are the facilities for having the server access arbitrary URLs. This is mostly concentrated in the code for accessing other sheets (find_in_sheet_cache in Sheet.pm and the WKCHTTP function in SheetFunctions.pm). You may want to disable this functionality in simple deployments where they are not needed. For find_in_sheet_cache, remove the entire IF statement (and it's ELSE clause) with the comment "URL - USE HTTP GET" and replace it with just the assignment of "unable to load" to $loaderr. For the WKCHTTP function, just remove its entry from the %function_list.

[END]