Troubleshooting guide
This page is dedicated to solutions to common issues encountered when using the Covalent platform.
Workflow hanging
If a workflow is hanging, it is likely due to a task that is not completing. To get more information on the status of the workflow, users can check the details of the Covalent server via:
covalent logs
Note
In order to get more informative logs, users can start the Covalent server with covalent start -d
This may happen for a Local Executor since a worker process in ProcessPoolExecutor
can stall sporadically without failing. In this case, the user needs to restart the server and resubmit the workflow.
Workflow failing
A workflow can fail for a number of reasons. The most common reasons are:
- A task is failing to execute due to a runtime error arising from the task definition. In this case, check the logs of the Covalent server to see if there are any errors and the task error cards in the Covalent UI.
- The executor memory and compute resources are insufficient. For memory issues with the Dask executor, the allocated memory per worker needs to increased via (provided the user has enough memory available):
covalent start -n 4 -m "2GB" -d
Check out this discussion for more details.
- Some larger workflows fail for MacOS users due to a too many open files in system error. This can be resolved by increasing the open file limit via the terminal command
ulimit -n 10240
and restarting Covalent.
Covalent server not starting
If dispatches are failing due to connection refused errors after running covalent start
it is possible the covalent server was unable to start.
- Ensure that you are able to run the
python
command and thatpython --version
is compatible with covalent refer to compatibility section. If your python version is not compatible or if you only havepython3
installed it is recommended that a virtual environment is used (several tools can also be leveraged for this: poetry, conda, pyenv, ect.)
Users should ultimately check covalent logs
for more information and submit a new issue on Github discussion with any relevant log information associated with the issue.
Covalent CLI commands throws error/warning
- The covalent config file can periodically get corrupted due to multiple processes attempting to modify the config file simultaneously. This can sometimes be fixed by manually editing the config file. However, if this does not work, the user can delete the config file and restart the Covalent server.
caution
The user also has the option of running covalent purge
which will delete the config file. A new config file will be created when the user runs covalent start
again. This option must be used with caution.
- If DB migration error is thrown, that implies that the database schema is not up to date with the latest version of Covalent. This can be fixed by running covalent db migrate. For more information, check out What To Do When Encountering Database Migration Errors
Getting Result fails for long workflows
- For long-running workflows if a user runs
get_result
synchronously withwait=True
and observes aRecursionError: maximum recursion depth exceeded
error this means that the result may still be pending or complete but covalent failed during the polling process. Users should still be able to re-run the command to continue waiting for a workflow result.
Executor issues after installation
- Users can get
executor not found
orCovalent config file missing default values
for the executor if Covalent was not restarted after the executor was installed. This can be fixed by restarting the Covalent server.
Lattice not found error when using Self-hosted Covalent
- Errors related to the lattice not being found arise from the user trying to retrieve / access data corresponding to a dispatch id that does not exist in the database. This can happen when the self-hosted and local Covalent servers get mixed up. This can be avoided by explicitly specifying the dispatcher address when dispatching and retrieving results.
Note
In general, users should set the dispatcher_addr
in the ct.dispatch()/ct.get_result()
functions rather than using ct.set_config if they’d like to only temporarily change the dispatcher address.
- The dispatch id is invalid.
Connection timeout error when using Self-hosted Covalent
If a user is getting a connection timeout error while using self-hosted Covalent, it is likely that the local and self-hosted servers are getting mixed up. In this case, the user needs to ensure that the dispatcher address is explicitly set and that the corresponding Covalent server is actually running.