Troubleshooting

This page collects common issues and how to diagnose them. When something goes wrong, start with General checks and then look for a section matching your symptoms.

General checks

Are the scripts on your PATH?

launch.py must be available on every machine where you run containers (local and remote). refresh.py is only needed on the local machine.

which launch.py    # should print a path
which refresh.py   # local machine only

If either command prints nothing, add the directory containing the scripts to your PATH (see How to get the scripts), or call the script by its full path.

On a remote system it is easy to forget to download launch.py or to put it somewhere that is not on the PATH.

Are required environment variables set?

Many problems come down to missing or incorrect environment variables. Check the critical ones (on each system you are running containers on):

# For Claude Code / Pi with Bedrock
echo $CLAUDE_CODE_USE_BEDROCK   # should be 1
echo $AWS_PROFILE               # should be your profile name
echo $AWS_REGION                # should be your region, e.g. us-east-1

# For Pi with Bedrock
echo $PI_USE_BEDROCK            # should be 1

If any of these are blank, add them to your ~/.bashrc (or equivalent) and source it or open a new terminal.

Do the credential files exist?

# Codex
ls ~/.codex/auth.json

# Claude / Pi (AWS)
ls ~/.aws/config
ls ~/.aws/credentials.json

# Claude Code config
ls ~/.claude.json
ls -d ~/.claude

If any are missing, run refresh.py to create them.

Container runtime not found

Symptom: missing command 'podman' in PATH or missing command 'singularity' in PATH.

Podman (Mac):

  • Install Podman Desktop and make sure it is running. The Podman CLI requires the Podman machine to be started.

  • Verify: podman --version

Singularity (Linux / HPC):

  • Singularity is typically provided as a module on HPC systems. On NIH-specific Biowulf:

    module load singularity
    singularity --version
    
  • This must be done after getting an interactive node (e.g., sinteractive). Singularity may not be available on login nodes.

Container image not found

Symptom: podman image '...' not found or singularity image '...' not found.

Podman:

The default image is pulled automatically on first use. If you specified a custom --image-name, make sure you have either built it locally with build.py or that it is available on the registry.

In some cases, you may need to do an explicit podman pull to get Podman to recognize the latest image:

podman pull ghcr.io/nichd-bspc/llm:latest

Singularity:

The default SIF is pulled automatically from GitHub Container Registry. If you specified a custom --sif-path, make sure the file exists:

ls -lh /path/to/your.sif
  • If using oras:// URIs on a system that requires authentication to the registry, check that you can reach it.

Credentials expired or missing

Symptom: The agent starts but cannot connect to the model, or you see authentication errors like ExpiredTokenException, UnauthorizedException, or The SSO session ... has expired.

  1. On your local machine, run refresh.py:

    refresh.py
    
  2. If the session is on a remote system, include the hostname:

    refresh.py --remote biowulf.nih.gov
    
  3. You do not need to restart the container. Because credential files are mounted into the container, the running agent can see refreshed credentials on the next request.

  4. If it has been a long time since credentials expired, the agent may have given up retrying. Re-send your last prompt.

  5. If Pi still keeps using the expired credentials, stop and restart pi. Some SDK/provider paths cache resolved credentials in-process, so updating ~/.aws/credentials.json alone may not always be enough for an already running Pi session.

Codex login failures:

refresh.py calls codex login under the hood. If this fails, verify that Codex is installed locally:

which codex
codex --version

If Codex is not installed, follow the local install steps in Getting started: Codex.

AWS SSO login failures:

refresh.py calls aws sso login under the hood. If this fails:

  • Verify AWS CLI v2 is installed: aws --version (must be 2.x)

  • Verify your profile is configured: aws configure list

  • Verify AWS_PROFILE is set correctly

  • Try logging in manually: aws sso login

  • See Setting up AWS STRIDES Single Sign-On for the full SSO setup walkthrough

SSL/TLS connection errors

Symptom: Connection errors inside the container, especially on VPN or enterprise networks. You may see messages about certificate verification failures, CERTIFICATE_VERIFY_FAILED, or SSL: CERTIFICATE_VERIFY_FAILED.

For example:

 ⚠ MCP client for `codex_apps` failed to start: MCP startup failed: handshaking with MCP server failed: Send message error Transport [rmcp::transport::worker::WorkerTransport<rmcp::transport::streamable_http_client
::StreamableHttpClientWorker<codex_rmcp_client::http_client_adapter::StreamableHttpClientAdapter>>] error: Client error: HTTP request failed: http/request failed: error sending request for url
(https://chatgpt.com/backend-api/wham/apps), when send initialize request

⚠ MCP startup incomplete (failed: codex_apps)

The issue is that the container does not have access to host-installed enterprise certificates. See Enterprise TLS certificates for full details.

If you are not on VPN or an enterprise network and still see SSL errors, make sure you are not accidentally setting LLM_DEVCONTAINER_CERTS. In some cases, being on an enterprise network is fine (and doesn’t need --certs) but a VPN connection to the same network does need --certs.

Remote system issues

Credentials must be refreshed on the local machine and pushed to the remote; see Login model for why this is necessary.

Credentials not arriving on remote:

refresh.py --remote HOST uses rsync over SSH. If credentials do not appear on the remote:

  • Verify you can SSH to the host without errors: ssh HOST hostname

  • Check that rsync is available locally: which rsync

  • Use --show-files to see what would be transferred:

    refresh.py --show-files
    
  • Run with --full if you need to push entire config directories (not just credentials):

    refresh.py --full --remote biowulf.nih.gov
    
  • If the remote system should use exported short-lived AWS credentials instead of trying to refresh SSO itself, push those explicitly:

    refresh.py --remote biowulf.nih.gov
    

Environment variables not set on remote:

Environment variables exported in your local ~/.bashrc are not available on the remote host. You must also add the relevant exports (CLAUDE_CODE_USE_BEDROCK, AWS_REGION, model defaults, etc.) to the remote ~/.bashrc.

When ~/.aws/credentials.json is present on the remote, launch.py will automatically use the llm-export AWS profile, so explicitly setting AWS_PROFILE on the remote is optional.

If you use sinteractive on Biowulf, note that the interactive node inherits the login node’s environment, so exporting in ~/.bashrc on Biowulf is sufficient.

Singularity-specific issues

Home directory warnings:

Singularity normally auto-mounts your home directory. launch.py disables this for isolation (see Mounts and config). If you see warnings about home directory handling, they can generally be ignored.

File permission errors:

Singularity maps your host UID into the container. If files inside the container are owned by a different user, you may get permission errors. This usually happens when using a custom SIF built with different user assumptions. The default image uses devuser (UID 1000), and launch.py handles the mapping.

Module not loaded:

On HPC systems, remember to load the Singularity module before running launch.py:

module load singularity

Podman-specific issues

Podman machine not running:

If podman commands fail with connection errors, make sure Podman Desktop is running and the Podman machine is started:

podman machine list
podman machine start   # if not running, can also use GUI

Image pull failures:

If pulling from GHCR fails, check network connectivity and authentication:

podman pull ghcr.io/nichd-bspc/llm:latest

The images are public, so no authentication should be needed. If on VPN, check for TLS interception issues (SSL/TLS connection errors).

Architecture mismatch:

The container is built for linux/amd64. On Apple Silicon Macs, Podman handles the emulation transparently, but you may see a warning like this, which is expected:

WARNING: image platform (linux/amd64) does not match the expected platform (linux/arm64)

If you encounter architecture-related errors, ensure your Podman machine is configured for amd64 emulation.

Conda environments in containers

Symptom: Binaries from a mounted conda environment do not work inside the container, or packages fail to install.

Mounting conda environments only works on Linux hosts where the architecture matches the container (linux/x86_64). On macOS with Apple Silicon, the host conda environment is arm64 and the binaries will not run inside the amd64 container.

Additionally, macOS filesystems are case-insensitive, which prevents some conda packages (like ncurses) from working even if the architecture matched.

On a compatible Linux host:

launch.py --conda-env my-env codex

Using dry-run for debugging

When something is not working, launch.py --dry-run prints the exact container command that would be run, without actually running it. This is useful for inspecting mounts, environment variables, and arguments:

launch.py --dry-run codex
launch.py --dry-run claude
launch.py --dry-run shell

Compare the output against what you expect: are credential paths mounted? Are the right environment variables being passed? Is the image correct?

You can also launch a shell to poke around inside the container interactively:

launch.py shell

This mounts credentials for all agents and drops you into a bash shell inside the container, where you can inspect the environment directly.

Claude Code-specific issues

  • Update notice: Claude may display

    Update available! Run: your package manager update command

    This is usually a false positive. The container uses the stable version of Claude Code as published to the Debain repository. To double-check you can run the /doctor command from within Claude Code. For example, at the time of writing these docs, that update message was being shown but /doctor showed the following, indicating that the current version is in fact the stable version:

    Diagnostics
    Currently running: package-manager (2.1.116)
    Commit: 9e176d077241
    Platform: linux-x64
    Package manager: deb
    Path: /usr/bin/claude
    Config install method: not set
    Search: OK (bundled)
    
    Updates
    Auto-updates: Managed by package manager
    Auto-update channel: latest
    Stable version: 2.1.116
    Latest version: 2.1.123
    

    In this case, the update is a false positive and can be ignored. Hopefully this will be fixed in future stable versions.

  • Copying text: In recent versions, by default Claude Code will automatically copy text that you select. If you are used to other text selection mechanisms (like tmux), you can use /config and change Copy on select to false. This will add a new entry in ~/.claude.json.

Codex-specific issues

  • WebSockets to HTTPS warning: In Codex, you may see the following message:

    Falling back from WebSockets to HTTPS transport. stream disconnected before completion: invalid peer certificate: UnknownIssuer

    This is usually due to setting certificates with --certs or $LLM_DEVCONTAINER_CERTS when they are not needed. Omit --certs and unset LLM_DEVCONTAINER_CERTS.