Test Date 2025-01-15 14:30:00
Eval Goal Evaluate the agent's ability to handle customer inquiries, process refunds, and resolve complaints effectively.
58%
Pass Rate traces met the evaluation criteria
7 of 12 traces passed

The agent handles routine inquiries well but struggles with refund processing and frustrated customers. Tool usage gaps and occasional hallucinations require attention before production deployment.

!
2 Marginal
2 Failed
?
1 Review
! Attention Required 5 traces
Trace Input Status Issue Reason
trace-... "This is ridiculous! My order #12345 hasn't arrived and it's been 2 weeks!" Fail Hallucination detected Agent fabricated shipping details without querying the order system.
trace-... "I want a full refund immediately. The product was damaged." Fail Task not completed Refund tool was available but agent only provided general policy info.
trace-... "Can you expedite my order? I need it by tomorrow." ! Marginal Partial completion Agent checked order status but didn't attempt to expedite shipping.
trace-... "I think there's a billing error on my last invoice." ? Review Ambiguous outcome Agent identified a discrepancy but escalation path was unclear.
trace-... "The tracking says delivered but I never got the package." ! Marginal Incomplete resolution Agent verified tracking but didn't offer replacement or refund options.

What do the numbers say?

Completeness ?
Task Completion Rate
75% (9/12)
! Fair
Correctness ?
Hallucination Rate
92% (11/12)
Good
Relevance Rate
100% (12/12)
Good
Tool Usage Appropriateness
83% (10/12)
! Fair
Response Quality ?
User Satisfaction Score
67% (8/12)
! Fair
Response Clarity
92% (11/12)
Good
Persona Consistency
83% (10/12)
! Fair
Efficiency & Performance ?
Output Tokens
342 tokens P50: 298, P95: 512
! In range
Conversation Depth
2.3 turns P50: 2, P95: 4
! In range
Latency
3.2s P50: 2.8s, P95: 5.1s, P99: 6.2s
! In range
Persona Performance Gap
Latency: +18.5% Tokens: +12.3%
! Info