Resource allocation#
canary uses two sets of resources to manage execution of test cases: “workers”[1] (lightweight threads responsible for
managing asynchronous execution of concurrent tests in a ProcessPoolExecutor),
and a “resource pool” that tracks hardware resources available for execution of the actual test cases within subprocesses.
Generally speaking, the number of workers corresponds to the upper bound of the number of test cases that can run concurrently.
By default, canary detects the physical CPU cores and GPU devices[4] available on the machine to construct the
resource pool. The resource pool can be specified with command line options or a file as described below. Tests are
submitted to the exector such that the number of occupied slots of each resource type remain less than or equal to the total
number of slots of each type available. Tests request their resources from the pool as described in Defining resources required by a test case.
Note
canary extensions, such as for HPC execution, may change the interpretation of the resource pool
and its specification on the command line. See specific extension’s documentation for details.
Defining the resource pool#
By default, the resource pool is automatically generated based on a machine probe and consists of a single node with N CPUs and 0 GPUs. The number of CPUs N is determined by a system probe[3]. No other resource types are assumed to exist. Users have the flexibility to define the resource pool in a variety of ways using command line flags, configuration file, or a combination of both, depending on the specific requirements of their computing environment.
Note
Any resources other than cpus and gpus[4] must be defined by the user.
The resource pool can be specified on the command line by simply defining the number of each resource type. For example, resource pool having the default CPU count (as determined by canary) and 4 GPUs, can be generated via
canary -r gpus=4 ...
The resource pool can also be defined in a configuration file and passed to canary:
$ cat FILE.yaml
resource_pool:
gpus: 4
$ canary --resource-pool-file=FILE.yaml ...
For more complex resource pools, it is necessary to define resources explicitly:
resource_pool:
resources:
cpus:
- id: "0"
slots: 1
- id: "1"
slots: 1
# Repeat entries until "id": "N-1" for N CPUs in total
- id: "N-1"
slots: 1
gpus:
- id: "0"
slots: 2
- id: "1"
slots: 2
- id: "2"
slots: 4
- id: "3"
slots: 4
To see the resource pool, issue
canary config show resource-pool
Defining resources required by a test case#
The resources required by a test case are inferred by comparing the case’s parameters with the resource types defined in the resource pool. For example, a test requiring 4 cpus and 4 gpus must define the appropriate cpus and gpus parameters and the resource pool must contain enough slots of cpus and gpus resource types:
canary.directives.parameterize("cpus,gpus", [(4, 4)])
resource_pool:
cpus: 32
gpus: 4
Note
A test case is assumed to require 1 CPU if not otherwise specified by the cpus parameter.
If a test requires a non-default resource, that resource type must appear in the resource pool - even if the count is 0. For example, consider the test requiring n fpgas
canary.directives.parameterize("fpgas", [n])
canary will not treat fpgas as a resource consuming parameter unless it is explicitly defined within the resource pool - either by the command line, a configuration file, or both. Even if the system does not contain any fpgas (i.e., the count is 0), the user still must explicitly set the count to zero. Otherwise, canary will treat fpgas as a regular parameter and proceed with executing the test on systems not having fpgas.
Resource pool specification#
The resource pool is a JSON object whose entries describe the resources available to canary[2]. For example, a machine having N CPUs is defined by:
{
"resource_pool": {
"additional_properties": {},
"resources": {
"cpus": [
{"id": "0", "slots": 1},
{"id": "1", "slots": 1},
// Repeat entries until "id": "N-1" for N CPUs in total
]
}
}
}
Each resource type in resource_pool:resources is defined by an array of JSON objects whose entries describe a single instance of the specified resource. Each instance’s members are:
id: a string uniquely identifying this instance of the resource; andslots: the number ofslotsof the resource available. If not defined, the number ofslotsis 1.
Example#
A machine having 4 CPUs with one slot each and 2 GPUs with 2 slots each would be defined as:
{
"resource_pool": {
"additional_properties": {},
"resources": {
"cpus": [
{"id": "0", "slots": 1},
{"id": "1", "slots": 1},
{"id": "2", "slots": 1},
{"id": "3", "slots": 1}
],
"gpus": [
{"id": "0", "slots": 2},
{"id": "1", "slots": 2}
]
}
}
}
Environment variables#
When a test is executed by canary it sets and passes the following environment variables to the test process:
CANARY_<NAME>_IDS: comma separated list of global ids for machine resourceNAME.
For example, consider the test requiring 4 CPUs and 4 GPUs and suppose that canary acquires CPUs 10, 11, 12, and 13, and GPUs 0, 1, 2, and 3 from the resource pool, respectively. The test environment would have the following variables defined: CANARY_CPU_IDS=10,11,12,13 and CANARY_GPU_IDS=0,1,2,3.
Additionally, existing environment variables having the placeholders %(<name>_ids)s are replaced with the actual global ids. If, in the previous example, the session environment had defined CUDA_VISIBLE_DEVICES="%(gpu_ids)s", then CUDA_VISIBLE_DEVICES=0,1,2,3 would be defined in the test environment.