Arize Phoenix: LLM Observability That's Actually Free
Arize Phoenix is an open-source LLM observability platform that's evolved well beyond its origins as a drift detector.
Arize Phoenix occupies a useful position in the LLM tooling landscape: it’s genuinely free (Apache 2.0 license), has production-ready observability features, and has developed well beyond what “open-source LLM observability” typically implies.
This review focuses on the security-relevant features — hallucination detection, prompt monitoring, retrieval quality evaluation — and how they perform in production deployments.
The product in brief
Phoenix runs as a local or self-hosted web application. You instrument your LLM application with OpenInference (Arize’s instrumentation library), which captures spans for LLM calls, retrieval steps, tool invocations, and other events. These spans are sent to Phoenix, which provides:
- Trace visualization (full request/response flows)
- LLM evaluation (hallucination, relevance, toxicity scoring via judge LLM)
- Dataset management (golden datasets, golden queries)
- Prompt playground (test prompts against datasets)
- Drift detection (input distribution, output length, score distribution)
The instrumentation is clean:
import phoenix as px
from phoenix.otel import register
tracer_provider = register(
project_name="my-llm-app",
endpoint="http://localhost:6006/v1/traces"
)
# After this, OpenAI, LangChain, LlamaIndex, etc. are auto-instrumented
Hallucination detection: how it works and how it performs
Phoenix’s hallucination evaluation uses a judge LLM to score responses for groundedness — whether claims in the response are supported by the provided context.
The evaluator:
from phoenix.evals import HallucinationEvaluator, OpenAIModel
evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4"))
results = evaluator.evaluate(
input=query,
output=response,
context=retrieved_documents
)
# results.label: "hallucinated" | "factual" | "unknown"
# results.score: 0.0-1.0
Performance is bounded by the judge model you point it at and by the quality of the context you pass. LLM-as-a-judge groundedness scoring is a probabilistic classifier, not a deterministic check: it will both miss some unsupported claims and flag some well-grounded responses as hallucinated. Arize documents this limitation and publishes evaluator-benchmark results in its LLM-as-a-judge documentation ↗ rather than asking teams to take the accuracy on faith.
Because the false-positive rate is non-trivial for high-volume use, the evaluator is most useful as a sampling tool: run it on a fraction of production traffic and watch aggregate trends, rather than treating any individual score as ground truth.
Retrieval quality evaluation
For RAG applications, Phoenix provides retrieval quality scoring: given a query and retrieved documents, does the retrieval include the information needed to answer the query?
This is distinct from hallucination detection (which asks “is the response grounded in the context?”) — it asks “did the retrieval give the model what it needed?”
This separation is where retrieval-quality scoring earns its place: a class of queries that consistently retrieves tangentially related rather than directly relevant documents will drive a downstream rise in apparent hallucination, and the retrieval-quality score surfaces the root cause before the hallucination metric does. Per Arize’s documentation, that makes it a useful leading indicator for RAG-pipeline regressions that user feedback alone tends to miss.
Prompt monitoring
Phoenix’s trace view shows you the system prompt, user message, and model response for every logged span. For security monitoring, the most useful feature is the ability to search and filter traces:
- Filter by model (catch model version changes)
- Filter by high latency (correlates with context stuffing attacks)
- Filter by high token count (long-context injection attempts)
- Full text search on prompts (find specific injection patterns after the fact)
This is incident response tooling as much as monitoring tooling — when you suspect an attack, you can search historical traces for the pattern.
The self-hosted story
Phoenix runs in Docker with a one-command start:
docker run -p 6006:6006 arizephoenix/phoenix
For production deployments, the recommended setup adds PostgreSQL persistence and a reverse proxy. This is a few hours of engineering work but produces a robust deployment.
The alternative is Arize’s commercial cloud platform, which offers hosted Phoenix with better storage, authentication, and integration with the broader Arize product suite.
Verdict
Phoenix is the best open-source LLM observability platform for teams that want to avoid vendor lock-in and have the engineering bandwidth to self-host. The evaluation features (hallucination detection, retrieval quality) are more mature than most alternatives at any price point.
The limitations: the evaluator accuracy is good but not sufficient for high-confidence individual decisions; the self-hosted infrastructure requires maintenance; and the security-specific features (attack detection, injection monitoring) are limited compared to dedicated security tools.
For teams that need more than observability and less than a dedicated security platform, combining Phoenix with a dedicated injection detector like Lakera Guard ↗ or Rebuff covers the coverage gap.
For the methodology behind these comparisons, see our AI security tool evaluation framework, which covers the full tooling landscape.
Sources
AI Sec Reviews — in your inbox
Reviews of AI security products and platforms. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Patronus AI Review: Automated LLM Evaluation and Guardrails
A review of Patronus AI's evaluation platform — the Lynx hallucination model, the Glider custom evaluator, the built-in judge and safety evaluators, and
Protect AI's ModelScan and NB Defense: Open-Source AI Supply-Chain Scanning
A hands-on review of Protect AI's two best-known open-source tools — ModelScan for model serialization attacks and NB Defense for Jupyter notebooks.
Giskard Review: Open-Source Testing and Evaluation for LLM and RAG Apps
A long-form review of Giskard, the open-source Python library for testing AI systems. Its automated Scan for LLM vulnerabilities, the RAGET RAG-evaluation