Inference Matrices

Jan 2 · 5min

#LLM

Time to First Token (TTFT)
End-to-End Request Latency (e2e_latency)
Inter-token Latency (ITL)
Tokens Per Second (TPS)
Requests Per Second (RPS)
p50/p95/p99 latency
Service-Level Objective (SLO)

Time to First Token (TTFT)

Definition : How long a user needs to wait before seeing the model’s output.

End-to-End Request Latency (e2e_latency)

Definition : How long it takes from submitting a query to receiving the full response, including the performance of your queueing/batching mechanisms and network latencies
e2e_latency = TTFT+Generation_time

Inter-token Latency (ITL)

Definition : `the average time between consecutive tokens and is also known as time per output token (TPOT).

Tokens Per Second (TPS)

Definition : Total TPS per system `represents the total output tokens per seconds throughput
As the number of requests increases, the total TPS per system increases, until it reaches a saturation point for all the available GPU compute resources, beyond which it might decrease.

Requests Per Second (RPS)

Definition : the average number of requests that can be successfully completed by the system in a 1-second period.

p50/p95/p99 latency

P99 的值表示99% 的请求都在这个时间值以下完成，只有最慢的 1% 的请求，它们的响应时间会比 P99 的值更长。"

Service-Level Objective (SLO)

Defines the target performance level for a particular metric.
- For example, an SLO for TTFT might specify that 95% of chatbot interactions should have a TTFT below 200 milliseconds.
An SLO is typically a key part of a broader service-level agreement (SLA) between a service provider and its users

>

CC BY-NC-SA 4.0 2021-PRESENT © Alex Yang