Basic Shell Scripting for Operators
Small scripts, big leverage
Open interactive version (quiz + challenge)Real-world analogy
A good ops script is a recipe card. It has a title, ingredients checked at the start, steps in order, a taste-test before serving, and a signal if something burned. Kitchens that cook from memory produce great chefs and terrible consistency.
What is it?
Shell scripting for operators is small, safe, repeatable automation. You’re not writing software products — you’re making the system click the same buttons you would, but reliably and logged.
Real-world relevance
A healthcheck script pings a list of endpoints every 5 minutes, writes status to a log, and exits non-zero if any endpoint fails. A simple monitoring tool watches for non-zero exits and pages the on-call engineer. Two hours of writing prevents years of silent outages.
Key points
- Shebang and safe options — Start every script with #!/usr/bin/env bash and set -euo pipefail. That tiny 2-line habit eliminates a huge class of silent failures.
- Variables and quoting — Use double quotes around variables: "$HOST" not $HOST. Unquoted variables cause pain when values contain spaces. Use ${VAR:-default} to handle missing values safely.
- Conditionals and tests — [[ -f file ]] for file exists, [[ -d dir ]] for dir, [[ -n str ]] non-empty string, [[ $a == $b ]] string compare, (( a > b )) arithmetic. Know both [[ ]] and (( )) and use them deliberately.
- Loops the right way — for file in /var/log/*.log; do ... done. while IFS= read -r line; do ... done < file. Don’t parse ls — loop files with glob patterns; shells handle spaces and newlines better.
- Functions and return codes — Wrap logic in small functions. Return 0 for success, non-zero for failure. Propagate failures with ‘|| return 1’. Keep scripts under 200 lines; beyond that, consider Python.
- Logging discipline — Timestamp every line. Log to a file with >> and 2>&1. Prefix with the script name. Use levels (INFO/WARN/ERROR) so future grep is easy.
- Exit codes that downstream tools trust — If your script is invoked by cron, systemd, or monitoring, its exit code is how the world knows. Exit non-zero on any real failure. Exit 0 silently on ‘nothing to do’ only when safe.
- When to stop writing bash — If you need JSON parsing, HTTP with retries, or complex logic, move to Python. jq is okay for light JSON, but avoid 500-line bash monsters. The right tool beats heroic scripting.
Code example
#!/usr/bin/env bash
# /usr/local/bin/disk_alert.sh
set -euo pipefail
LOG=/var/log/disk_alert.log
THRESHOLD=85
log(){ printf '%s [%s] %s\n' "$(date -Is)" "$1" "$2" >> "$LOG"; }
check_filesystem() {
local fs="$1"
local pct
pct=$(df -P "$fs" | awk 'NR==2 {gsub("%",""); print $5}')
if (( pct >= THRESHOLD )); then
log ERROR "$fs at ${pct}%% (>= ${THRESHOLD}%%)"
return 1
fi
log INFO "$fs at ${pct}%%"
return 0
}
rc=0
for fs in / /var /home; do
check_filesystem "$fs" || rc=1
done
exit "$rc"Line-by-line walkthrough
- 1. Shebang and script path
- 2. Safer bash options
- 3. Log and threshold variables
- 4. Blank separator
- 5. log helper function
- 6. Blank separator
- 7. Check function header
- 8. Read fs arg
- 9. Parse df percent
- 10. Compare to threshold
- 11. Log ERROR and fail
- 12. Log INFO and pass
- 13. Blank separator
- 14. Main loop
- 15. Iterate over filesystems
- 16. Accumulate failure bit
- 17. Exit with final code
Spot the bug
#!/bin/bash
FS=$1
pct=$(df -P $FS | awk 'NR==2 {print $5}')
if [ $pct -gt 85 ]; then echo full; fiNeed a hint?
What are the unsafe parts here?
Show answer
Missing set -euo pipefail; variables not quoted; pct still contains ‘%’ from df and will break numeric comparison; no logging; no exit code for downstream monitoring. Fix: add safe options, quote "$FS", strip % with awk gsub, use (( pct >= 85 )), and exit non-zero on failure with logged context.
Explain like I'm 5
A script is a recipe the computer follows perfectly, every time. A bad recipe leaves out steps and doesn’t say if the cake burned. A good recipe tells you clearly what happened — even at 3 AM.
Fun fact
Many of the most reliable scripts in corporate IT are short: 20–60 lines, well-logged, and running unchanged for years. Length is not quality; discipline is.
Hands-on challenge
Write a bash script that checks disk usage on /, /var, /home; logs each result; and exits non-zero if any is above 85%. Test it with a synthetic threshold to prove the failure path works.
More resources
- Bash manual (GNU)
- ShellCheck (ShellCheck)
- Bash scripting for sysadmins (Learn Linux TV)