Ubuntu

One of the things I’ve done most in the last two decades in the IT business was writing shell scripts. Was it because to combine and condense different data sources or was it to get a custom reporting or to encrypt/ decrypt data. In the end, there was a wide range of different use cases where a small script helped to achieve the outcome I was seeking for. Often these scripts were used to support parts of global processes with high financial or legal impact on the business. There is nothing wrong with this as long as everything is working. Unfortunately, the only constant in the IT business is the change and those scripts that have done their job for years silently in the background stop out of the sudden to work. In a well managed IT you know where those scripts are being used, you know the purpose of each script and you can easily adapt them to the new environment. At least, this is the theory, let’s have a look at a real-life example:

#!/bin/sh
set -x
TC8=$(ps -ef | grep tomcat)
if ! [ -n "$TC8" ] ; then
    /home/portal/tc8/bin/catalina.sh start &
    exit
fi

Time to talk about enterprise-grade shell scripts. I’m pretty sure the script has done its job when it was deployed to the production machine, but, does anyone know what the purpose of this script is? With some experience in this area, you will recognize that the script is used to do something with a Tomcat instance. Based on the path I would assume it’s a JAVA portal running on a Tomcat in the namespace of the user “portal”. But why does it use the ampersand and the exit command and why is it only sometimes working? As you already noticed, this script is an example of how enterprise-grade scripts should not look like.

Header

This brings us to the question, how should an enterprise-grade shell script look like? First and most obvious, some additional context would be helpful. What is the purpose of that script, who wrote it and when? A typical header that I use in my scripts looks like this:

#
# Author:       Thomas Bendler <code@thbe.org>
# Date:         Fri Apr 17 22:48:34 CEST 2020
#
# Note:         To debug the script change the shebang to: /usr/bin/env bash -vx
#
# Prerequisite: This release needs a shell that could handle functions.
#               If shell is not able to handle functions, remove the
#               error section.
#
# Release:      1.0.0
#
# ChangeLog:    v0.1.0 - Initial release
#               v0.9.0 - Prepare go-live
#               v1.0.0 - Production go-live
#
# Purpose:      Watchdog for a Tomcat 8 instance
#

Without reading a single line of code, you know who has written the script and when the script was written. You know which version of the script is deployed, you know the version history for the script, the purpose and so on and so forth. Depending on the processes in the company it could be handy to add additional information like the URL to the GIT repository or which deployment pipeline has been used or approvals or other information. Whatever is used, it should be used in every script and the structure of the information should be equal in every script.

Shebang

A script could contain different technologies. It could contain a shell script, a Python script, a Ruby script or something else. To tell the operating system which kind of script it is, the so-called “Shebang” is used. The “Shebang” is the hashtag plus an exclamation mark followed by the interpreter that should be used to execute the script. In the real-life, you’ll find lots of “Shebang” like the following:

#!/usr/bin/local/ksh

Here we see that someone installed manually a Korn-Shell on that box (because it’s in /usr/local/bin/) and that the script uses this shell. This works in principle but it’s not the way how an enterprise-grade shell script should look like. Imagine that at some point in time we want to deploy the script on another box. On this box, the Korn-Shell was installed through the package manager. This usually means that the path to the shell is in this case /usr/bin/ksh. Before we can deploy the script we need to change the “Shebang”. The better approach to tackle this is to use env binary instead. You’ll find env on every box under /usr/bin/env. As a parameter env will use the command that should be used for the Shebang. So assuming I would like to use the Korn-Shell, the Shebang would be:

#!/usr/bin/env ksh

Env knows the path to ksh and will call the correct binary for the “Shebang”. This will also work for Perl/ Python or other script interpreters.

Script behavior

Regardless of the programming language, it’s always beneficial if you know how a script behaves. When it stops, when it aborts, what is allowed, whatnot, and so on and so forth. This is controlled by the set command in the script and should be equally set in all scripts:

### General script behavior ###
set -euo pipefail

In the end, this is the combination of three arguments which are explained here:

-e           stops the script after the first command has failed
-u           stops the script after the first unset variable has been found
-o pipefail  stops the script after the first piped command has failed

Error handling

A central element of an enterprise-grade shell script is a proper error handling. Errors can happen and they happen unfortunately more often than anybody want so it’s key to deal with them in a predictable way. I use a function called error_handling() to achieve this:

### Error handling ###
error_handling() {
  if [ "${RETURN_CODE}" -eq 0 ]; then
    echo_verbose "${SCRIPT_NAME} successful!"
  else
    echo_error "${SCRIPT_NAME} aborted, reason: ${EXIT_REASON}"
    echo; script_usage
  fi
  exit "${RETURN_CODE}"
}
trap "error_handling" EXIT HUP INT QUIT TERM
RETURN_CODE=0
EXIT_REASON="Finished!"

The way of working for this function is quite simple. Whenever one of the signals is raised that you see at the end of the trap line, the function error_handling() is called. You also see an important rule for writing enterprise-grade ready scripts, the use of meaningful names for functions, variables, constants, and other script components. The purpose of the variable ${SCRIPT_NAME} is much easier to understand compared to ${0}.

Output

You might have also noticed, that I use additional functions in the error handling function:

### Print out information if in verbose mode ###
echo_verbose() { if [ ${ARGUMENT_VERBOSE} -eq 1 ]; then echo "${@}"; fi }

### Print out information on error channel ###
echo_error() { echo "${@}" >&2; }

Good enterprise-grade scripts run silent by default and start being verbose if called with the respective option set. The second function takes care that the output in case of an error is redirected to STDERR instead of STDOUT. This enables calling programs to separate the normal script output from the error messages.

Dry run

Another good practice is to offer the possibility of a dry run:

### Don't execute commands if in dry run mode ###
execute_command() {
  if [ ${ARGUMENT_DRYRUN} -eq 1 ]; then
    echo "Command to execute: ${*}"
  else
    "${@}" || COMMAND_RETURN_CODE=${?}; return ${COMMAND_RETURN_CODE}
  fi
}
COMMAND_RETURN_CODE=0

When the execute command function is used and the script was called with the dry run flag, the command is only displayed and not executed. Unfortunately, this function isn’t that straight forward and requires some more thinking before widely use it especially when using pipes or redirects.

Usage

Now where we have covered the execution helpers, we need to provide a function for the script usage:

### Print out usage information ###
script_usage() { cat <<EOT
usage:   ${SCRIPT_NAME} [-n] [-v] [-h]
example: ./${SCRIPT_NAME} -v

arguments (optional):
-n:      Dry run
-v:      Be verbose
-h:      Print this help
EOT
}

Defaults

With all the functions in place, we can start with the script code itself. The first thing that needs to be done is to initialize the variables because otherwise, the script could fail because of uninitialized variables:

### Default script variables ###
export LC_ALL=C
export LANG=C
ARGUMENT_DRYRUN=0; ARGUMENT_VERBOSE=0
SCRIPT_NAME=$(basename ${0})

Setting the language variables to an explicit value is as well a common good shell scriptwriting practice. This makes the output of called programs much more predictable which is beneficial if the output is used for other actions as well.

Arguments

The next step takes care of the arguments that might have been passed to the script during execution:

### Get the arguments used at script execution ###
while [ ${#} -ne 0 ]; do
  case "${1}" in
    -n|--dry-run) ARGUMENT_VERBOSE=1; ARGUMENT_DRYRUN=1 ;;
    -v|--verbose) ARGUMENT_VERBOSE=1 ;;
    -h|--help)    script_usage; exit ;;
  esac;
  shift
done

The code snippet is pretty straight forward, it checks if arguments exist and process each argument passed to the script.

Prerequisites

The last part before the script code starts is the check if the prerequisites has been matched:

### Check script prerequisite ###
TOMCAT_CATALINA_SCRIPT=/home/portal/tc8/bin/catalina.sh
if ! [ -x ${TOMCAT_CATALINA_SCRIPT} ]; then
  RETURN_CODE=1
  EXIT_REASON="The Tomcat catalina script (${TOMCAT_CATALINA_SCRIPT}) is not executable, aborting!"
  exit
fi

Main script logic

Now as we have everything covered and in place it’s time to implement the script logic:

### Check if Tomcat is running and start Tomcat if stopped ###
TOMCAT_STATUS=$(ps -ef | grep -E "[t]omcat " || echo "no")
echo_verbose "Is Tomcat running: ${TOMCAT_STATUS}"
if [ ${TOMCAT_STATUS} == "no" ] ; then
  echo_verbose "Try to start Tomcat ..."
  execute_command ${TOMCAT_CATALINA_SCRIPT} start
  if [ ${COMMAND_RETURN_CODE} -ne 0 ]; then
    RETURN_CODE=${COMMAND_RETURN_CODE}
    EXIT_REASON="Could not analyze ${LOGFILE}, aborting!"
    exit
  fi
fi

As I mentioned at the beginning of the post, one of the questions was, why does the script sometimes work and sometimes not. The reason for this is the way the script checks if a Tomcat process exists or not. If you do a simple grep on tomcat, grep will find occasionally his own process in the process list that seeks for tomcat. If you instead use regular expressions with grep to match the Tomcat process name, the grep command will be excluded from the result.

The enterprise-grade shell script

Now we can put everything together and deploy the enterprise-grade shell script in a productive environment with high financial impact without having the fear that we operate hidden time bombs that create severe risks in case of failures, especially if developers had left the company meanwhile:

#!/usr/bin/env bash
#
# Author:       Thomas Bendler <code@thbe.org>
# Date:         Fri Apr 17 22:48:34 CEST 2020
#
# Note:         To debug the script change the shebang to: /usr/bin/env bash -vx
#
# Prerequisite: This release needs a shell that could handle functions.
#               If shell is not able to handle functions, remove the
#               error section.
#
# Release:      1.0.0
#
# ChangeLog:    v0.1.0 - Initial release
#               v0.9.0 - Prepare go-live
#               v1.0.0 - Production go-live
#
# Purpose:      Watchdog for a Tomcat 8 instance
#

### General script behavior ###
set -euo pipefail

### Error handling ###
error_handling() {
  if [ "${RETURN_CODE}" -eq 0 ]; then
    echo_verbose "${SCRIPT_NAME} successful!"
  else
    echo_error "${SCRIPT_NAME} aborted, reason: ${EXIT_REASON}"
    echo; script_usage
  fi
  exit "${RETURN_CODE}"
}
trap "error_handling" EXIT HUP INT QUIT TERM
RETURN_CODE=0
EXIT_REASON="Finished!"

### Print out information if in verbose mode ###
echo_verbose() { if [ ${ARGUMENT_VERBOSE} -eq 1 ]; then echo "${@}"; fi }

### Print out information on error channel ###
echo_error() { echo "${@}" >&2; }

### Don't execute commands if in dry run mode ###
execute_command() {
  if [ ${ARGUMENT_DRYRUN} -eq 1 ]; then
    echo "Command to execute: ${*}"
  else
    "${@}" || COMMAND_RETURN_CODE=${?}; return ${COMMAND_RETURN_CODE}
  fi
}
COMMAND_RETURN_CODE=0

### Print out usage information ###
script_usage() { cat <<EOT
usage:   ${SCRIPT_NAME} [-n] [-v] [-h]
example: ./${SCRIPT_NAME} -v

arguments (optional):
-n:      Dry run
-v:      Be verbose
-h:      Print this help
EOT
}

### Default script variables ###
export LC_ALL=C
export LANG=C
ARGUMENT_DRYRUN=0; ARGUMENT_VERBOSE=0
SCRIPT_NAME=$(basename ${0})

### Get the arguments used at script execution ###
while [ ${#} -ne 0 ]; do
  case "${1}" in
    -n|--dry-run) ARGUMENT_VERBOSE=1; ARGUMENT_DRYRUN=1 ;;
    -v|--verbose) ARGUMENT_VERBOSE=1 ;;
    -h|--help)    script_usage; exit ;;
  esac;
  shift
done

### Check script prerequisite ###
TOMCAT_CATALINA_SCRIPT=/home/portal/tc8/bin/catalina.sh
if ! [ -x ${TOMCAT_CATALINA_SCRIPT} ]; then
  RETURN_CODE=1
  EXIT_REASON="The Tomcat catalina script (${TOMCAT_CATALINA_SCRIPT}) is not executable, aborting!"
  exit
fi

### Check if Tomcat is running and start Tomcat if stopped ###
TOMCAT_STATUS=$(ps -ef | grep -E "[t]omcat " || echo "no")
echo_verbose "Is Tomcat running: ${TOMCAT_STATUS}"
if [ ${TOMCAT_STATUS} == "no" ] ; then
  echo_verbose "Try to start Tomcat ..."
  execute_command ${TOMCAT_CATALINA_SCRIPT} start
  if [ ${COMMAND_RETURN_CODE} -ne 0 ]; then
    RETURN_CODE=${COMMAND_RETURN_CODE}
    EXIT_REASON="Could not analyze ${LOGFILE}, aborting!"
    exit
  fi
fi

Final thoughts

Let’s conclude this exercise of writing enterprise-grade shell scripts with some thoughts. The first question I normally get, is this over-engineered? It pretty much depends, it’s finally all about risk, financial impact, maintainability and so on and so forth. Templates and practices like this are used in environments where the flawless use of scripts to support processes is key. Where it’s not acceptable to pause a business for a week because a script doesn’t work anymore and no one is able to fix the script. These are the typical scenarios where it is worth the effort to standardized shell scripts as shown and that require those structures before something goes into production. In hobby environments, it’s not necessarily required but even there it becomes handy once you have to change a script you’ve written years before. However you do it finally, enjoy coding and happy scripting!