Thursday, April 1, 2010

Script: average

A few weeks ago, I was cruising the Ubuntu forums and came across a question from a poster who wanted to find the average of a series of floating-point numbers.  The numbers were extracted from some other command and were output in a column.  He wanted a command line incantation that would take the column of numbers and return the average.  Several people answered this query with clever one-line solutions, however I thought that this problem would be a good task for a script.  Using a script, one could have a solution that was a little more robust and general purpose.  I wrote the following script, presented here with line numbers:


     1    #!/bin/bash
     2    
     3    # average - calculate the average of a series of numbers
     4    
     5    # handle cmd line option
     6    if [[ $1 ]]; then
     7        case $1 in
     8            -s|--scale)    scale=$2 ;;
     9            *)             echo "usage: average [-s scale]" >&2
    10                           exit 1 ;;
    11        esac
    12    fi
    13    
    14    # construct instruction stream for bc
    15    c=0
    16    {    echo "t = 0; scale = 2"
    17        [[ $scale ]] && echo "scale = $scale"
    18        while read value; do
    19    
    20            # only process valid numbers
    21            if [[ $value =~ ^[-+]?[0-9]*\.?[0-9]+$ ]]; then
    22                echo "t += $value"
    23                ((++c))
    24            fi
    25        done
    26    
    27        # make sure we don't divide by zero
    28        ((c)) && echo "t / $c"
    29    } | bc

This script takes a series of numbers from standard input and prints the result.  It is invoked as follows:

average -s scale < file_of_numbers

where scale is an integer containing the desired number of decimal places in the result and file_of_numbers is a file containing the series of number we desire to average.  If scale is not specified, then the default value of 2 is used.

To demonstrate the script, we will calculate the average size of the programs in the /usr/bin directory:

me@linuxbox:~$ stat --format "%s" /usr/bin/* | average
81766.66

The basic idea behind this script is that it uses the bc arbitrary precision calculator program to figure out the average.  We need to use something like bc, because arithmetic expansion in the shell can only handle integer math.

To perform our calculation, we need to construct a series of instructions and pipe them into bc.  This task comprises the bulk of our script.  In order to do something that complicated, we employ a shell feature known as a group command.  Starting with line 16 and ending with line 29 we capture all of the standard output and consolidate it into a single stream.  That is, all of the standard output produced by the commands on lines 16-29 is treated as though it is a single command and piped into bc on line 29.

We'll look at our group command piece by piece.  As you know, an average is calculated by adding up a series of numbers and dividing the sum by the number of entries.  In our case, the number of entries is stored in the variable c and the sum is stored (within bc) in the variable t.  We start our group command (line 16) by passing some initial values to bc.  We set the initial value of the bc variable t to zero and the value of scale to our default value of two (the default scale of bc is zero).

On line 17, we evaluate the scale variable to see if the command line option was used and if so, pass that new value to bc.

Next, we start a while loop that reads entries from our standard input.  Each iteration of the loop causes the next entry in the series to be assigned to the variable value.

Lines 20-24 are interesting.  Here we test to see if the string contained in value is actually a valid floating point number.  To do this, we employ a regular expression that will only match if the number is properly formatted.  The regular expression says, to match, value may start with a plus or minus sign, followed by zero or more numerals, followed by an optional decimal point, and ending with one or more numerals..  If value passes this test, an instruction is inserted into the stream telling bc to add value to t (line 22) and we increment c (line 23), otherwise value is ignored.

After all of the numbers have been read from standard input, it's time to perform the calculation,  First, we test to see that we actually processed some numbers.  If we did not, then c would equal zero and the resulting calculation would cause a "division by zero" error, so we test the value of c and only if it is not equal to zero we insert the final instruction for bc.

This script would make a good starting point for a series of statistical programs.  The most significant design weakness of the script as written is that it fails to check that the value supplied to the scale option is really an integer.  That's an improvement I will leave to my faithful readers...

Further Reading

The following man pages:
  • bc
  • bash (the "Compound Commands" section, covers group commands and the [[]] and (()) compound commands)
The Linux Command Line
  • Chapter 20 (regular expressions)
  • Chapter 28 (if command, [[]] and (()) compound commands and && and || control operators)
  • Chapter 29 (the read command)
  • Chapter 30 (while loops)
  • Chapter 35 (arithmetic expressions and expansion, bc program)
  • Chapter 33 (positional parameters)
  • Chapter 37 (group commands)

1 comment:

  1. For the scaling you can use
        echo "scale = ${scale:-2}"
    instead of always printing scale=1 and printing the desired scale after that, if defined (default value parameter expansion). You might check/ensure if $scale is a valid integer, too, maybe scale=$((scale)) or similar.

    But also note (which is not topic of this article, yes), this can all be substituted by a small AWK script.

    ReplyDelete