Avatar

Inference Matrices

Jan 2 · 5min

#LLM

Time to First Token (TTFT)

  • Definition : How long a user needs to wait before seeing the model’s output.

End-to-End Request Latency (e2e_latency)

  • Definition : How long it takes from submitting a query to receiving the full response, including the performance of your queueing/batching mechanisms and network latencies
  • e2e_latency = TTFT+Generation_time

Inter-token Latency (ITL)

  • Definition : `the average time between consecutive tokens and is also known as time per output token (TPOT).

Tokens Per Second (TPS)

  • Definition : Total TPS per system `represents the total output tokens per seconds throughput
  • As the number of requests increases, the total TPS per system increases, until it reaches a saturation point for all the available GPU compute resources, beyond which it might decrease.

Requests Per Second (RPS)

  • Definition : the average number of requests that can be successfully completed by the system in a 1-second period.

p50/p95/p99 latency

  • P99 的值表示99% 的请求都在这个时间值以下完成,只有最慢的 1% 的请求,它们的响应时间会比 P99 的值更长。"

Service-Level Objective (SLO)

  • Defines the target performance level for a particular metric.
    • For example, an SLO for TTFT might specify that 95% of chatbot interactions should have a TTFT below 200 milliseconds.
  • An SLO is typically a key part of a broader service-level agreement (SLA) between a service provider and its users
>