The Metering & Billing Collector can integrate with Nvidia’s Run:ai to collect allocated and used resources for your AI/ML workloads, including GPUs, CPUs, and memory. This is useful for companies using Run:ai to run GPU workloads that want to bill and invoice their customers based on consumption of allocated and used resources.
Run:ai Collector
How it works
You can install the Collector as a Kubernetes pod in your Run:ai cluster to collect metrics from your Run:ai platform automatically. The collector periodically scrapes the metrics from your Run:ai platform and emits them as CloudEvents to Metering & Billing. This allows you to track usage and monetize Run:ai workloads.
Once you have the usage data ingested into Metering & Billing, you can use it to set up prices and billing for your customers based on their usage.
Example
Let’s say you want to charge your customers $0.2 per GPU minute and $0.05 per CPU minute. The Collector will emit the following events every 30 seconds from your Run:ai workloads:
{
"id": "123e4567-e89b-12d3-a456-426614174000",
"specversion": "1.0",
"type": "workload",
"source": "run_ai",
"time": "2025-01-01T00:00:00Z",
"subject": "my-customer-id",
"data": {
"name": "my-runai-workload",
"namespace": "my-runai-benchmark-test",
"phase": "Running",
"project": "my-project-id",
"department": "my-department-id",
"workload_minutes": 1.0,
"cpu_limit_core_minutes": 96,
"cpu_request_core_minutes": 96,
"cpu_usage_core_minutes": 80,
"cpu_memory_limit_gigabyte_minutes": 384,
"cpu_memory_request_gigabyte_minutes": 384,
"cpu_memory_usage_gigabyte_minutes": 178,
"gpu_allocation_minutes": 1,
"gpu_usage_minutes": 1,
"gpu_memory_request_gigabyte_minutes": 40,
"gpu_memory_usage_gigabyte_minutes": 27
}
}
Note: The collector normalizes the collected metrics to a minute, which is configurable, so you can set per second, minute, or hour pricing similar to how AWS EC2 pricing works.
Run:ai metrics
The Collector supports the following Run:ai metrics:
Pod metrics
See the following table for supported pod metrics:
|
Metric Name |
Description |
|---|---|
GPU_UTILIZATION_PER_GPU
|
GPU utilization percentage per individual GPU |
GPU_UTILIZATION
|
Overall GPU utilization percentage for the pod |
GPU_MEMORY_USAGE_BYTES_PER_GPU
|
GPU memory usage in bytes per individual GPU |
GPU_MEMORY_USAGE_BYTES
|
Total GPU memory usage in bytes for the pod |
CPU_USAGE_CORES
|
Number of CPU cores currently being used |
CPU_MEMORY_USAGE_BYTES
|
Amount of CPU memory currently being used in bytes |
GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU
|
Graphics engine utilization percentage per GPU |
GPU_SM_ACTIVITY_PER_GPU
|
Streaming Multiprocessor (SM) activity percentage per GPU |
GPU_SM_OCCUPANCY_PER_GPU
|
SM occupancy percentage per GPU |
GPU_TENSOR_ACTIVITY_PER_GPU
|
Tensor core usage percentage per GPU |
GPU_FP64_ENGINE_ACTIVITY_PER_GPU
|
FP64 (double precision) engine activity percentage per GPU |
GPU_FP32_ENGINE_ACTIVITY_PER_GPU
|
FP32 (single precision) engine activity percentage per GPU |
GPU_FP16_ENGINE_ACTIVITY_PER_GPU
|
FP16 (half precision) engine activity percentage per GPU |
GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU
|
Memory bandwidth usage percentage per GPU |
GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU
|
NVLink transmitted bandwidth per GPU |
GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU
|
NVLink received bandwidth per GPU |
GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU
|
PCIe transmitted bandwidth per GPU |
GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU
|
PCIe received bandwidth per GPU |
GPU_SWAP_MEMORY_BYTES_PER_GPU
|
Amount of GPU memory swapped to system memory per GPU |
Workload metrics
See the following table for supported workload metrics:
|
Metric Name |
Description |
|---|---|
GPU_UTILIZATION
|
Overall GPU usage percentage across all GPUs in the workload |
GPU_MEMORY_USAGE_BYTES
|
Total GPU memory usage in bytes across all GPUs |
GPU_MEMORY_REQUEST_BYTES
|
Requested GPU memory in bytes for the workload |
CPU_USAGE_CORES
|
Number of CPU cores currently being used |
CPU_REQUEST_CORES
|
Number of CPU cores requested for the workload |
CPU_LIMIT_CORES
|
Maximum number of CPU cores allowed for the workload |
CPU_MEMORY_USAGE_BYTES
|
Amount of CPU memory currently being used in bytes |
CPU_MEMORY_REQUEST_BYTES
|
Requested CPU memory in bytes for the workload |
CPU_MEMORY_LIMIT_BYTES
|
Maximum CPU memory allowed in bytes for the workload |
POD_COUNT
|
Total number of pods in the workload |
RUNNING_POD_COUNT
|
Number of currently running pods in the workload |
GPU_ALLOCATION
|
Number of GPUs allocated to the workload |
Get started
First, create a new YAML file for the collector configuration. Use the run_ai Redpanda Connect input:
input:
run_ai:
url: '${RUNAI_URL:}'
app_id: '${RUNAI_APP_ID:}'
app_secret: '${RUNAI_APP_SECRET:}'
schedule: '*/30 * * * * *'
metrics_offset: '30s'
resource_type: 'workload'
metrics:
- CPU_LIMIT_CORES
- CPU_MEMORY_LIMIT_BYTES
- CPU_MEMORY_REQUEST_BYTES
- CPU_MEMORY_USAGE_BYTES
- CPU_REQUEST_CORES
- CPU_USAGE_CORES
- GPU_ALLOCATION
- GPU_MEMORY_REQUEST_BYTES
- GPU_MEMORY_USAGE_BYTES
- GPU_UTILIZATION
- POD_COUNT
- RUNNING_POD_COUNT
http:
timeout: 30s
retry_count: 1
retry_wait_time: 100ms
retry_max_wait_time: 1s
Configuration options
See the following table for supported configuration options:
|
Option |
Description |
Default |
Required |
|---|---|---|---|
url
|
Run:ai base URL | - | Yes |
app_id
|
Run:ai app ID | - | Yes |
app_secret
|
Run:ai app secret | - | Yes |
resource_type
|
Run:ai resource to collect metrics from (workload or pod)
|
workload
|
No |
metrics
|
List of Run:ai metrics to collect | All available | No |
schedule
|
Cron expression for the scrape interval |
*/30 * * * * *
|
No |
metrics_offset
|
Time offset for queries to account for delays in metric availability |
0s
|
No |
http
|
HTTP client configuration | - | No |
Next, configure the mapping from the Run:ai metrics to CloudEvents using bloblang:
pipeline:
processors:
- mapping: |
let duration_seconds = (meta("scrape_interval").parse_duration() / 1000 / 1000 / 1000).round().int64()
let gpu_allocation_minutes = this.allocatedResources.gpu.number(0) * $duration_seconds / 60
let cpu_limit_core_minutes = this.metrics.values.CPU_LIMIT_CORES.number(0) * $duration_seconds / 60
# Add metrics as needed...
root = {
"id": uuid_v4(),
"specversion": "1.0",
"type": meta("resource_type"),
"source": "run_ai",
"time": now(),
"subject": this.name,
"data": {
"tenant": this.tenantId,
"project": this.projectId,
"department": this.departmentId,
"cluster": this.clusterId,
"type": this.type,
"gpuAllocationMinutes": $gpu_allocation_minutes,
"cpuLimitCoreMinutes": $cpu_limit_core_minutes,
}
}
Finally, configure the output:
output:
label: 'openmeter'
drop_on:
error: false
error_patterns:
- Bad Request
output:
http_client:
url: '${OPENMETER_URL:https://us.api.konghq.com}/v3/openmeter/events'
verb: POST
headers:
Authorization: 'Bearer $KONNECT_SYSTEM_ACCESS_TOKEN'
Content-Type: 'application/json'
timeout: 30s
retry_period: 15s
retries: 3
max_retry_backoff: 1m
max_in_flight: 64
batch_as_multipart: false
drop_on:
- 400
batching:
count: 100
period: 1s
processors:
- metric:
type: counter
name: openmeter_events_sent
value: 1
- archive:
format: json_array
dump_request_log_level: DEBUG
Replace $KONNECT_SYSTEM_ACCESS_TOKEN with your own system access token.
Scheduling
The collector runs on a schedule defined by the schedule parameter using cron syntax. It supports:
- Standard cron expressions (for example,
*/30 * * * * *for every 30 seconds) - Duration syntax with the
@everyprefix (for example,@every 30s)
Resource types
The collector can collect metrics from two different resource types:
-
workload: Collects metrics at the workload level, which represents a group of pods -
pod: Collects metrics at the individual pod level
Installation
The Metering & Billing Collector (a custom Redpanda Connect distribution) is available via the following distribution strategies:
- Binaries can be downloaded from the GitHub Releases page.
- Container images are available on ghcr.io.
- A Helm chart is also available on GitHub Packages.