Troubleshooting¶
This page collects common issues and how to diagnose them. When something goes wrong, start with General checks and then look for a section matching your symptoms.
General checks¶
Are the scripts on your PATH?
launch.py must be available on every machine where you run containers
(local and remote). refresh.py is only needed on the local machine.
which launch.py # should print a path
which refresh.py # local machine only
If either command prints nothing, add the directory containing the scripts to
your PATH (see How to get the scripts), or call the script by its full path.
On a remote system it is easy to forget to download launch.py or to
put it somewhere that is not on the PATH.
Are required environment variables set?
Many problems come down to missing or incorrect environment variables. Check the critical ones (on each system you are running containers on):
# For Claude Code / Pi with Bedrock
echo $CLAUDE_CODE_USE_BEDROCK # should be 1
echo $AWS_PROFILE # should be your profile name
echo $AWS_REGION # should be your region, e.g. us-east-1
# For Pi with Bedrock
echo $PI_USE_BEDROCK # should be 1
If any of these are blank, add them to your ~/.bashrc (or equivalent)
and source it or open a new terminal.
Do the credential files exist?
# Codex
ls ~/.codex/auth.json
# Claude / Pi (AWS)
ls ~/.aws/config
ls ~/.aws/credentials.json
# Claude Code config
ls ~/.claude.json
ls -d ~/.claude
If any are missing, run refresh.py to create them.
Container runtime not found¶
Symptom: missing command 'podman' in PATH or missing command
'singularity' in PATH.
Podman (Mac):
Install Podman Desktop and make sure it is running. The Podman CLI requires the Podman machine to be started.
Verify:
podman --version
Singularity (Linux / HPC):
Singularity is typically provided as a module on HPC systems. On NIH-specific Biowulf:
module load singularity singularity --version
This must be done after getting an interactive node (e.g.,
sinteractive). Singularity may not be available on login nodes.
Container image not found¶
Symptom: podman image '...' not found or singularity image '...'
not found.
Podman:
The default image is pulled automatically on first use. If you specified
a custom --image-name, make sure you have either built it locally with
build.py or that it is available on the registry.
In some cases, you may need to do an explicit podman pull to get Podman
to recognize the latest image:
podman pull ghcr.io/nichd-bspc/llm:latest
Singularity:
The default SIF is pulled automatically from GitHub Container Registry. If you
specified a custom --sif-path, make sure the file exists:
ls -lh /path/to/your.sif
If using
oras://URIs on a system that requires authentication to the registry, check that you can reach it.
Credentials expired or missing¶
Symptom: The agent starts but cannot connect to the model, or you see
authentication errors like ExpiredTokenException, UnauthorizedException,
or The SSO session ... has expired.
On your local machine, run refresh.py:
refresh.py
If the session is on a remote system, include the hostname:
refresh.py --remote biowulf.nih.gov
You do not need to restart the container. Because credential files are mounted into the container, the running agent can see refreshed credentials on the next request.
If it has been a long time since credentials expired, the agent may have given up retrying. Re-send your last prompt.
If Pi still keeps using the expired credentials, stop and restart
pi. Some SDK/provider paths cache resolved credentials in-process, so updating~/.aws/credentials.jsonalone may not always be enough for an already running Pi session.
Codex login failures:
refresh.py calls codex login under the hood. If this fails,
verify that Codex is installed locally:
which codex
codex --version
If Codex is not installed, follow the local install steps in Getting started: Codex.
AWS SSO login failures:
refresh.py calls aws sso login under the hood. If this
fails:
Verify AWS CLI v2 is installed:
aws --version(must be2.x)Verify your profile is configured:
aws configure listVerify
AWS_PROFILEis set correctlyTry logging in manually:
aws sso loginSee Setting up AWS STRIDES Single Sign-On for the full SSO setup walkthrough
SSL/TLS connection errors¶
Symptom: Connection errors inside the container, especially on VPN or
enterprise networks. You may see messages about certificate verification
failures, CERTIFICATE_VERIFY_FAILED, or SSL: CERTIFICATE_VERIFY_FAILED.
For example:
⚠ MCP client for `codex_apps` failed to start: MCP startup failed: handshaking with MCP server failed: Send message error Transport [rmcp::transport::worker::WorkerTransport<rmcp::transport::streamable_http_client
::StreamableHttpClientWorker<codex_rmcp_client::http_client_adapter::StreamableHttpClientAdapter>>] error: Client error: HTTP request failed: http/request failed: error sending request for url
(https://chatgpt.com/backend-api/wham/apps), when send initialize request
⚠ MCP startup incomplete (failed: codex_apps)
The issue is that the container does not have access to host-installed enterprise certificates. See Enterprise TLS certificates for full details.
If you are not on VPN or an enterprise network and still see SSL errors,
make sure you are not accidentally setting LLM_DEVCONTAINER_CERTS. In some
cases, being on an enterprise network is fine (and doesn’t need --certs)
but a VPN connection to the same network does need --certs.
Remote system issues¶
Credentials must be refreshed on the local machine and pushed to the remote; see Login model for why this is necessary.
Credentials not arriving on remote:
refresh.py --remote HOST uses rsync over SSH. If credentials
do not appear on the remote:
Verify you can SSH to the host without errors:
ssh HOST hostnameCheck that
rsyncis available locally:which rsyncUse
--show-filesto see what would be transferred:refresh.py --show-filesRun with
--fullif you need to push entire config directories (not just credentials):refresh.py --full --remote biowulf.nih.gov
If the remote system should use exported short-lived AWS credentials instead of trying to refresh SSO itself, push those explicitly:
refresh.py --remote biowulf.nih.gov
Environment variables not set on remote:
Environment variables exported in your local ~/.bashrc are not
available on the remote host. You must also add the relevant exports
(CLAUDE_CODE_USE_BEDROCK, AWS_REGION, model defaults, etc.) to the
remote ~/.bashrc.
When ~/.aws/credentials.json is present on the remote,
launch.py will automatically use the llm-export AWS profile, so
explicitly setting AWS_PROFILE on the remote is optional.
If you use sinteractive on Biowulf, note that the interactive node
inherits the login node’s environment, so exporting in ~/.bashrc on
Biowulf is sufficient.
Singularity-specific issues¶
Home directory warnings:
Singularity normally auto-mounts your home directory. launch.py
disables this for isolation (see Mounts and config).
If you see warnings about home directory handling, they can generally be
ignored.
File permission errors:
Singularity maps your host UID into the container. If files inside the
container are owned by a different user, you may get permission errors. This
usually happens when using a custom SIF built with different user
assumptions. The default image uses devuser (UID 1000), and
launch.py handles the mapping.
Module not loaded:
On HPC systems, remember to load the Singularity module before running
launch.py:
module load singularity
Podman-specific issues¶
Podman machine not running:
If podman commands fail with connection errors, make sure Podman
Desktop is running and the Podman machine is started:
podman machine list
podman machine start # if not running, can also use GUI
Image pull failures:
If pulling from GHCR fails, check network connectivity and authentication:
podman pull ghcr.io/nichd-bspc/llm:latest
The images are public, so no authentication should be needed. If on VPN, check for TLS interception issues (SSL/TLS connection errors).
Architecture mismatch:
The container is built for linux/amd64. On Apple Silicon Macs, Podman
handles the emulation transparently, but you may see a warning like this, which
is expected:
WARNING: image platform (linux/amd64) does not match the expected platform (linux/arm64)
If you encounter architecture-related errors, ensure your Podman machine is
configured for amd64 emulation.
Conda environments in containers¶
Symptom: Binaries from a mounted conda environment do not work inside the container, or packages fail to install.
Mounting conda environments only works on Linux hosts where the architecture
matches the container (linux/x86_64). On macOS with Apple Silicon, the
host conda environment is arm64 and the binaries will not run inside the
amd64 container.
Additionally, macOS filesystems are case-insensitive, which prevents some
conda packages (like ncurses) from working even if the architecture
matched.
On a compatible Linux host:
launch.py --conda-env my-env codex
Using dry-run for debugging¶
When something is not working, launch.py --dry-run prints the exact
container command that would be run, without actually running it. This is
useful for inspecting mounts, environment variables, and arguments:
launch.py --dry-run codex
launch.py --dry-run claude
launch.py --dry-run shell
Compare the output against what you expect: are credential paths mounted? Are the right environment variables being passed? Is the image correct?
You can also launch a shell to poke around inside the container
interactively:
launch.py shell
This mounts credentials for all agents and drops you into a bash shell inside the container, where you can inspect the environment directly.
Claude Code-specific issues¶
Update notice: Claude may display
Update available! Run: your package manager update commandThis is usually a false positive. The container uses the stable version of Claude Code as published to the Debain repository. To double-check you can run the
/doctorcommand from within Claude Code. For example, at the time of writing these docs, that update message was being shown but/doctorshowed the following, indicating that the current version is in fact the stable version:Diagnostics Currently running: package-manager (2.1.116) Commit: 9e176d077241 Platform: linux-x64 Package manager: deb Path: /usr/bin/claude Config install method: not set Search: OK (bundled) Updates Auto-updates: Managed by package manager Auto-update channel: latest Stable version: 2.1.116 Latest version: 2.1.123
In this case, the update is a false positive and can be ignored. Hopefully this will be fixed in future stable versions.
Copying text: In recent versions, by default Claude Code will automatically copy text that you select. If you are used to other text selection mechanisms (like tmux), you can use
/configand change Copy on select to false. This will add a new entry in~/.claude.json.
Codex-specific issues¶
WebSockets to HTTPS warning: In Codex, you may see the following message:
Falling back from WebSockets to HTTPS transport. stream disconnected before completion: invalid peer certificate: UnknownIssuerThis is usually due to setting certificates with
--certsor$LLM_DEVCONTAINER_CERTSwhen they are not needed. Omit--certsand unsetLLM_DEVCONTAINER_CERTS.