Scientific Computing

Python subprocess tee to screen and variable

Python subprocess can be used to run a long-running program, capturing the output to a variable and printing to the screen simultaneously. This gives the user the comfort that the program is working OK and gives program status messages without waiting for the program to finish.

This example demonstrates the “tee” subprocess behavior.

Python subprocess multi-line Python script

Python subprocess can run inline multi-line Python code. This is useful to use Python as a cross-platform demonstration or for production code where a new Python instance is called.

import subprocess
import sys

# the -u is to ensure unbuffered output so that program prints live
cmd = [sys.executable, "-u", "-c", r"""
import sys
import datetime
import time

for _ in range(5):
    print(datetime.datetime.now())
    time.sleep(0.3)
"""]

subprocess.check_call(cmd)

Matlab batch use stdout

Matlab command batch “matlab -batch” is useful for running Matlab scripts from the command line. When using “stdout” text output from Matlab, especially if only a single line is expected, there may be extraneous text output from Matlab with regard to licensing. A command example is prereleases like:

matlab -batch "disp(matlabroot)"

outputs to stdout:

    Prerelease License -- for engineering feedback and testing
	purposes only. Not for sale.

/Applications/MATLAB_R2023b.app

A workaround for this in shell scripts is like:

set -e  # stop on error

r=$(matlab -batch "disp(matlabroot)" | tail -n1)

cd ${r}
# and so on

Open file in default program from Terminal

It can be convenient to open a file by launching the default program without first leaving the Terminal. For simplicity, we assume the file is named “file.txt” but it can be any file openable by a program on the computer. This technique works with any file type that has an associated default program on the computer.

  • macOS: open file.txt
  • Linux: xdg-open file.txt
  • Windows: start file.txt

rsync private Git avoid sharing credentials

Recommended: rather than using Rsync, it is more convenient to give the remote host read-only Git access via:


Rsync over SSH allows one to edit and update code without putting credentials on the remote host.

Laptop to remote host:

rsync -r -t -v -z --exclude={build/,.git/} ~/myProg login@host:myProg
--exclude={build/,.git/}
Exclude syncing of Git information and build/ directory, which wastes time and may fail
-z --compress
compress data for transfer
-r --recursive
recursively sync directories
-t --times
preserve modification times
-v --verbose
verbose output

CMake ExternalProject and Git filters

Git filters may clash with the CMake ExternalProject update step. The “download” step invokes checkout and the “update” step may stash and invoke the Git filters, causing the build to fail.

There is not a straightforward way to turn off CMake Git filters.

Solution: Git pre-commit hook instead of Git filters. Users with Git filters need to disable the filters and preferably change the filters to pre-commit hooks if possible.

Things that did not work

For reference, these did not help override Git filters.

ExternalProject_Add(...
GIT_REMOTE_UPDATE_STRATEGY  "CHECKOUT"
UPDATE_COMMAND ""
)
ExternalProject_Add_Step(MyProj gitOverride
DEPENDERS update
COMMAND git -C <SOURCE_DIR> config --local filter.strip-notebook-output.clean cat
COMMAND git -C <SOURCE_DIR> config --local --list
COMMENT "CMake ExternalProject: override git config to strip notebook output"
LOG true
INDEPENDENT true
)

Strip Jupyter notebook outputs from Git

Jupyter notebook outputs can be large (plots, images, etc.), making Git repo history excessively large and making Git operations slower as the Git history grows. Jupyter notebook outputs can reveal personal information with regard to usernames, Python executable, directory layout, and data outputs.

Strip all Jupyter outputs from Git tracking with a client-side Git pre-commit hook. We use Git pre-commit hook because Git filters can interfere with other programs such as CMake ExternalProject.

Tell Git user-wide where to find Git hooks:

git config --global core.hooksPath ~/.git/hooks

Edit the file ~/.git/hooks/pre-commit to contain:

Watch shell command repeat

The procps watch command allows running a command repeatedly on a Unix-like system such as Linux and macOS. Typically the command is a very quick shell command watching temperature, file status, etc. An alternative in general is a small standalone C program watch.

On macOS “watch” is available via Homebrew. Most Linux distributions have “watch” available by default.

How much time an HPC batch job took

HPC batch systems generally track resources used by users and batch jobs to help ensure fair use of system resources, even if the user isn’t actually charged money for specific job usage. The qacct command allows querying batch accounting logs by job number or username, etc.

For example

qacct -d 7 -o $(whoami) -j

Gives the last 7 days of jobs. “ru_wallclock” is the number of seconds it took to run the job.

accounting log format

Cache directory vs. temporary directory

The system temporary directory has long been used as a scratch pad in examples. Over time, security limitations (virus scanners) and performance issues (abrupt clearing of system temporary directory) have lead major programs to use user temporary or cache directories instead of the system temporary directory.

The XDG Base Directory specification is a standard for the user cache directory. For systems not implementing the environment variable “XDG_CACHE_HOME”, typical defaults for user cache directory are:

  • Windows %LOCALAPPDATA%
  • macOS ${HOME}/Library/Caches
  • Linux ${HOME}/.cache