Troubleshooting
===============
This page collects common issues and how to diagnose them. When something
goes wrong, start with :ref:`ts-general-checks` and then look for a section
matching your symptoms.
.. _ts-general-checks:
General checks
--------------
**Are the scripts on your PATH?**
:file:`launch.py` must be available on every machine where you run containers
(local and remote). :file:`refresh.py` is only needed on the local machine.
.. code-block:: bash
which launch.py # should print a path
which refresh.py # local machine only
If either command prints nothing, add the directory containing the scripts to
your ``PATH`` (see :ref:`getscripts`), or call the script by its full path.
On a remote system it is easy to forget to download :file:`launch.py` or to
put it somewhere that is not on the ``PATH``.
**Are required environment variables set?**
Many problems come down to missing or incorrect environment variables. Check
the critical ones (on **each** system you are running containers on):
.. code-block:: bash
# For Claude Code / Pi with Bedrock
echo $CLAUDE_CODE_USE_BEDROCK # should be 1
echo $AWS_PROFILE # should be your profile name
echo $AWS_REGION # should be your region, e.g. us-east-1
# For Pi with Bedrock
echo $PI_USE_BEDROCK # should be 1
If any of these are blank, add them to your :file:`~/.bashrc` (or equivalent)
and source it or open a new terminal.
**Do the credential files exist?**
.. code-block:: bash
# Codex
ls ~/.codex/auth.json
# Claude / Pi (AWS)
ls ~/.aws/config
ls ~/.aws/credentials.json
# Claude Code config
ls ~/.claude.json
ls -d ~/.claude
If any are missing, run :ref:`refresh` to create them.
.. _ts-container-runtime:
Container runtime not found
---------------------------
**Symptom:** ``missing command 'podman' in PATH`` or ``missing command
'singularity' in PATH``.
**Podman (Mac):**
- Install `Podman Desktop `__ and make sure it is
running. The Podman CLI requires the Podman machine to be started.
- Verify: :cmd:`podman --version`
**Singularity (Linux / HPC):**
- Singularity is typically provided as a module on HPC systems. On
:nih:`NIH-specific` Biowulf:
.. code-block:: bash
module load singularity
singularity --version
- This must be done *after* getting an interactive node (e.g., :cmd:`sinteractive`).
Singularity may not be available on login nodes.
.. _ts-image-not-found:
Container image not found
-------------------------
**Symptom:** ``podman image '...' not found`` or ``singularity image '...'
not found``.
**Podman:**
The default image is pulled automatically on first use. If you specified
a custom ``--image-name``, make sure you have either built it locally with
:ref:`build` or that it is available on the registry.
In some cases, you may need to do an explicit :cmd:`podman pull` to get Podman
to recognize the latest image:
.. code-block:: bash
podman pull ghcr.io/nichd-bspc/llm:latest
**Singularity:**
The default SIF is pulled automatically from GitHub Container Registry. If you
specified a custom ``--sif-path``, make sure the file exists:
.. code-block:: bash
ls -lh /path/to/your.sif
- If using ``oras://`` URIs on a system that requires authentication to
the registry, check that you can reach it.
.. _ts-credentials-expired:
Credentials expired or missing
------------------------------
**Symptom:** The agent starts but cannot connect to the model, or you see
authentication errors like ``ExpiredTokenException``, ``UnauthorizedException``,
or ``The SSO session ... has expired``.
1. On your **local** machine, run :ref:`refresh`:
.. code-block:: bash
refresh.py
2. If the session is on a **remote** system, include the hostname:
.. code-block:: bash
refresh.py --remote biowulf.nih.gov
3. You do **not** need to restart the container. Because credential files are
mounted into the container, the running agent can see refreshed
credentials on the next request.
4. If it has been a long time since credentials expired, the agent may have
given up retrying. Re-send your last prompt.
5. If Pi still keeps using the expired credentials, stop and restart ``pi``.
Some SDK/provider paths cache resolved credentials in-process, so updating
:file:`~/.aws/credentials.json` alone may not always be enough for an already
running Pi session.
**Codex login failures:**
:cmd:`refresh.py` calls :cmd:`codex login` under the hood. If this fails,
verify that Codex is installed locally:
.. code-block:: bash
which codex
codex --version
If Codex is not installed, follow the local install steps in
:doc:`getting-started-codex`.
**AWS SSO login failures:**
:cmd:`refresh.py` calls :cmd:`aws sso login` under the hood. If this
fails:
- Verify AWS CLI v2 is installed: :cmd:`aws --version` (must be ``2.x``)
- Verify your profile is configured: :cmd:`aws configure list`
- Verify ``AWS_PROFILE`` is set correctly
- Try logging in manually: :cmd:`aws sso login`
- See :doc:`aws-sso` for the full SSO setup walkthrough
.. _ts-ssl-tls:
SSL/TLS connection errors
-------------------------
**Symptom:** Connection errors inside the container, especially on VPN or
enterprise networks. You may see messages about certificate verification
failures, ``CERTIFICATE_VERIFY_FAILED``, or ``SSL: CERTIFICATE_VERIFY_FAILED``.
For example:
.. code-block:: text
⚠ MCP client for `codex_apps` failed to start: MCP startup failed: handshaking with MCP server failed: Send message error Transport [rmcp::transport::worker::WorkerTransport>] error: Client error: HTTP request failed: http/request failed: error sending request for url
(https://chatgpt.com/backend-api/wham/apps), when send initialize request
⚠ MCP startup incomplete (failed: codex_apps)
The issue is that the container does not have access to host-installed
enterprise certificates. See :doc:`certificates` for full details.
If you are **not** on VPN or an enterprise network and still see SSL errors,
make sure you are not accidentally setting ``LLM_DEVCONTAINER_CERTS``. In some
cases, being on an enterprise network is fine (and doesn't need ``--certs``)
but a VPN connection to the same network *does* need ``--certs``.
.. _ts-remote:
Remote system issues
--------------------
Credentials must be refreshed on the local machine and pushed to the remote;
see :ref:`container-notes-login-model` for why this is necessary.
**Credentials not arriving on remote:**
:cmd:`refresh.py --remote HOST` uses :cmd:`rsync` over SSH. If credentials
do not appear on the remote:
- Verify you can SSH to the host without errors: :cmd:`ssh HOST hostname`
- Check that :cmd:`rsync` is available locally: :cmd:`which rsync`
- Use ``--show-files`` to see what would be transferred:
.. code-block:: bash
refresh.py --show-files
- Run with ``--full`` if you need to push entire config directories (not
just credentials):
.. code-block:: bash
refresh.py --full --remote biowulf.nih.gov
- If the remote system should use exported short-lived AWS credentials instead
of trying to refresh SSO itself, push those explicitly:
.. code-block:: bash
refresh.py --remote biowulf.nih.gov
**Environment variables not set on remote:**
Environment variables exported in your *local* :file:`~/.bashrc` are not
available on the remote host. You must also add the relevant exports
(``CLAUDE_CODE_USE_BEDROCK``, ``AWS_REGION``, model defaults, etc.) to the
**remote** :file:`~/.bashrc`.
When :file:`~/.aws/credentials.json` is present on the remote,
:file:`launch.py` will automatically use the ``llm-export`` AWS profile, so
explicitly setting ``AWS_PROFILE`` on the remote is optional.
If you use :cmd:`sinteractive` on Biowulf, note that the interactive node
inherits the login node's environment, so exporting in :file:`~/.bashrc` on
Biowulf is sufficient.
.. _ts-singularity:
Singularity-specific issues
---------------------------
**Home directory warnings:**
Singularity normally auto-mounts your home directory. :cmd:`launch.py`
disables this for isolation (see :ref:`container-notes-persistent-mounts`).
If you see warnings about home directory handling, they can generally be
ignored.
**File permission errors:**
Singularity maps your host UID into the container. If files inside the
container are owned by a different user, you may get permission errors. This
usually happens when using a custom SIF built with different user
assumptions. The default image uses ``devuser`` (UID 1000), and
:cmd:`launch.py` handles the mapping.
**Module not loaded:**
On HPC systems, remember to load the Singularity module before running
:cmd:`launch.py`:
.. code-block:: bash
module load singularity
.. _ts-podman:
Podman-specific issues
----------------------
**Podman machine not running:**
If :cmd:`podman` commands fail with connection errors, make sure Podman
Desktop is running and the Podman machine is started:
.. code-block:: bash
podman machine list
podman machine start # if not running, can also use GUI
**Image pull failures:**
If pulling from GHCR fails, check network connectivity and authentication:
.. code-block:: bash
podman pull ghcr.io/nichd-bspc/llm:latest
The images are public, so no authentication should be needed. If on VPN,
check for TLS interception issues (:ref:`ts-ssl-tls`).
**Architecture mismatch:**
The container is built for ``linux/amd64``. On Apple Silicon Macs, Podman
handles the emulation transparently, but you may see a warning like this, which
is expected:
.. code-block:: text
WARNING: image platform (linux/amd64) does not match the expected platform (linux/arm64)
If you encounter architecture-related errors, ensure your Podman machine is
configured for ``amd64`` emulation.
.. _ts-conda:
Conda environments in containers
---------------------------------
**Symptom:** Binaries from a mounted conda environment do not work inside the
container, or packages fail to install.
Mounting conda environments only works on Linux hosts where the architecture
matches the container (``linux/x86_64``). On macOS with Apple Silicon, the
host conda environment is ``arm64`` and the binaries will not run inside the
``amd64`` container.
Additionally, macOS filesystems are case-insensitive, which prevents some
conda packages (like ``ncurses``) from working even if the architecture
matched.
On a compatible Linux host:
.. code-block:: bash
launch.py --conda-env my-env codex
.. _ts-dry-run:
Using dry-run for debugging
---------------------------
When something is not working, :cmd:`launch.py --dry-run` prints the exact
container command that would be run, without actually running it. This is
useful for inspecting mounts, environment variables, and arguments:
.. code-block:: bash
launch.py --dry-run codex
launch.py --dry-run claude
launch.py --dry-run shell
Compare the output against what you expect: are credential paths mounted?
Are the right environment variables being passed? Is the image correct?
You can also launch a ``shell`` to poke around inside the container
interactively:
.. code-block:: bash
launch.py shell
This mounts credentials for all agents and drops you into a bash shell
inside the container, where you can inspect the environment directly.
Claude Code-specific issues
---------------------------
- **Update notice:** Claude may display
``Update available! Run: your package manager update command``
This is usually a false positive. The container uses the *stable* version of
Claude Code as published to the Debain repository. To double-check you can
run the :cmd:`/doctor` command from within Claude Code. For example, at the
time of writing these docs, that update message was being shown but
:cmd:`/doctor` showed the following, indicating that the current version is
in fact the stable version:
.. code-block:: text
Diagnostics
Currently running: package-manager (2.1.116)
Commit: 9e176d077241
Platform: linux-x64
Package manager: deb
Path: /usr/bin/claude
Config install method: not set
Search: OK (bundled)
Updates
Auto-updates: Managed by package manager
Auto-update channel: latest
Stable version: 2.1.116
Latest version: 2.1.123
In this case, the update is a false positive and can be ignored. Hopefully this
will be fixed in future stable versions.
- **Copying text:** In recent versions, by default Claude Code will
*automatically copy* text that you select. If you are used to other text
selection mechanisms (like tmux), you can use :cmd:`/config` and change *Copy
on select* to *false*. This will add a new entry in :file:`~/.claude.json`.
Codex-specific issues
---------------------
- **WebSockets to HTTPS warning:** In Codex, you may see the following message:
``Falling back from WebSockets to HTTPS transport. stream disconnected before completion: invalid peer certificate: UnknownIssuer``
This is usually due to setting certificates with ``--certs`` or
``$LLM_DEVCONTAINER_CERTS`` when they are not needed. Omit ``--certs`` and
unset ``LLM_DEVCONTAINER_CERTS``.