The architecture
Here's how we think about the architecture. Your documents feed a Knowledge Layer made of many components. Tools and an orchestrator draw on those components — many-to-many. The orchestrator chains tools into workflows. Workflows compose into the mini-apps your teams use. Color shows what we bring vs. build net-new — but everything on this diagram is fully customizable. Every tool, every knowledge-layer component, workflows, orchestrator, mini-apps — even the pieces we've built before, we customize and tune to exactly what Viridon needs.
Flow ↑ Sources (bottom) → Knowledge Layer → Tools → Workflows → Mini-apps (top)
↳ Click any tool, workflow, or mini-app to pin what it uses across the stack below. Click again to clear. The links run many-to-many across the stack.
Figure 1. High-level overview of the architecture — what's built today and what's net-new for Viridon.
Two things the colors say. First, the components marked all apps — retrieval, ingestion, concept linkages, scoped retrieval, grounded Q&A, web research — are shared infrastructure every mini-app reuses. Second, almost everything is foundation we've built and customize to you; the bespoke pieces — app-specific tools (T1, T3, T4, T6, T10, T12) and knowledge-layer modules KL·I–L — are where we spend the saved time.
01 — Ingestion · Indexing · Retrieval
Diligence question 1.1
How would you parse our source documents, and what tools or platforms would you use and why? Specifically: how do you handle complex tables — merged cells, multi-page tables, nested or irregular layouts?
We parse Viridon's corpus with layout-aware parsing rather than naive text extraction, and the distinction matters specifically because of what your documents are: 300+ page proposals with 100+ attachments, ~200-page selection reports, embedded tables and figures, mixed formatting from many authors and source systems. Copying a PDF's text layer and splitting on line breaks scrambles multi-column reading order, flattens tables into meaningless runs of numbers, and silently drops every figure. Instead we render each page, detect its regions with a layout model, recover table structure cell-by-cell, OCR anything without a text layer, and run a vision-language model over figures so that content carried in images becomes searchable text.
The backbone is the Unstructured library, which turns PDF, DOCX, XLSX, and HTML into typed elements — titles, narrative text, tables, images — each carrying its text, metadata, and a bounding box locating it on the page. Three models do the heavy lifting underneath: a layout-detection model (yolox) that finds regions and preserves reading order on complex pages; a table-transformer that recovers row and column structure, including merged cells, and emits each table as addressable HTML; and OCR (optical character recognition — turning page images into machine-readable text) for scanned pages and image regions. We route each document to the right parsing strategy — high-resolution layout-plus-OCR for the messy real-world proposals, a cheaper native-text path for clean digital PDFs, and full-page OCR for scans — rather than forcing a single mode across the corpus.
For complex tables, merged cells and spans are recovered by the table-transformer and preserved all the way through to per-cell records, so a value like "$42M" stays bound to its row ("Phase 2 capex") and column ("2027 estimate"). Nested and irregular tables that defeat structure inference degrade gracefully: we retry without forcing a grid and keep the table as both an image and a VLM-written description, so content is never lost even when the exact grid can't be recovered. Multi-page tables are the one piece that's net-new for Viridon — because we split documents per page for throughput, a table spanning two pages first appears as two elements, and we add a reconciliation pass that detects the continuation (matching headers and column count, no intervening narrative text) and stitches them into one logical table before indexing.
Figures get dedicated handling because so much of the signal in transmission proposals and ISO/RTO studies lives in diagrams, not prose — one-line diagrams, substation layouts, deliverability-study charts. Every figure is extracted and, whether or not it carries any extractable text, passed to a vision-language model — an AI that looks at an image and describes it in words. We index that description alongside a pointer back to the original image. A query about a deliverability constraint near a given substation can then retrieve a one-line diagram that contains zero searchable text. Optionally, multi-modal embeddings can make the image itself retrievable by visual similarity — we'd add that only if evaluation shows figure-heavy queries underperforming, rather than paying for it speculatively.
The commodity stages — raw OCR and the base layout and table models — we take off the shelf rather than rebuild. Our differentiated value is the orchestration around them: strategy routing per document, the dual image-plus-structure table representation, VLM figure indexing, multi-page reconciliation, and the fallbacks that keep your hardest documents parsing into something useful instead of failing silently.
The parsing layer is built on the Unstructured Python library (unstructured 0.17.0 + unstructured-inference 0.8.10). It converts PDF, DOCX, XLSX, and HTML into a stream of typed elements — Title, NarrativeText, Table, Image, PageBreak — each carrying text, metadata, and (in the high-resolution path) a bounding-box polygon. Everything downstream — chunking, table handling, figure description, retrieval scope — keys off those types and coordinates, which is why getting this layer right is what makes the rest possible. For a 300-page proposal the document is pulled from S3, split into one PDF per page, then parsed in parallel page-buckets — a throughput decision with one consequence we handle explicitly (multi-page tables, below).
| Model / component | Role | Why it's needed for Viridon's docs |
|---|---|---|
| yolox | Layout detection — finds text blocks, titles, tables, figures as bounding boxes on the rendered page | Multi-column layouts, sidebars, callouts, headers/footers — positional detection preserves reading order instead of interleaving columns. |
| Table-transformer | Recovers row/column structure including spans for each detected table; emits text_as_html | Turns a table from a meaningless run of numbers into cell-addressable structure. |
| OCR (Tesseract) | Reads text out of rendered regions and scanned pages | Selection reports and attachments are frequently scans or image-based exports. |
| pdfminer | Native text-layer extraction (FAST path) | Cheap, accurate path for clean digital PDFs where the layout model isn't warranted. |
| PyMuPDF (fitz) | Extracts hyperlink annotations, attaches each URL to the nearest element by coordinate | Proposals and contracts cross-reference prior sections and external docs by link. |
All of the above run in the pipeline today (Figure 1).
A large share of the signal in transmission proposals and ISO/RTO studies lives in figures, not prose. The pipeline extracts every image/figure region, uploads it to S3, and keeps the base64 in metadata. If the region's OCR text comes back empty — i.e. it's a true diagram, not a text box — we run a Vision Language Model over it — on Amazon Bedrock (reached privately over PrivateLink) or a self-hosted open model — to generate a natural-language description of what the figure shows, and we index that description as searchable text paired with a pointer back to the original image. The practical effect: a query like "deliverability constraint near the X substation" can retrieve a figure that contains zero extractable text, because the VLM's description of it is in the index.
Today we embed text — including the VLM-generated figure descriptions — into a single text vector space (Vespa). The figure is therefore retrievable via its description. An optional extension adds a multi-modal embedding (a CLIP-style model) so the image itself is embedded into a shared vector space and retrievable by visual similarity, not only through its description. The trade-off is additional indexing infrastructure and cost; in practice the description-indexing approach already captures most of the retrieval value, so we'd enable this only if the eval harness shows figure-heavy queries underperforming.
| Layer | Tool / model | Why this one |
|---|---|---|
| Parsing framework | Unstructured (0.17.0 / inference 0.8.10) | Typed-element output with coordinates; one interface across PDF/DOCX/XLSX/HTML; proven on long, messy documents. |
| Layout detection | yolox | Recovers reading order and region types on complex multi-column pages. |
| Table structure | table-transformer | Cell-level structure + spans → addressable HTML. |
| OCR | Tesseract (via Unstructured) | Reads scanned pages and image regions with no text layer. |
| Native text | pdfminer | Cheap, accurate path for clean digital PDFs. |
| Hyperlinks | PyMuPDF (fitz) | Recovers and re-attaches link annotations by coordinate. |
| Table → records | pandas | Turns table HTML into per-cell row/column records. |
| Figure understanding | VLM — Bedrock or self-hosted open model | Describes figures so image-only content becomes searchable text. |
| Embedding + index | text embedding model → Vespa | Embeds text + figure descriptions into the hybrid retrieval index (KL·A). |
| Object store | S3 | Original assets stay in your tenant. |
Edge vs. commodity — preview of 1.5
Raw OCR and the base layout/table models are commoditized — we use strong off-the-shelf components there. Our differentiated value is the orchestration around them: strategy routing, the dual image-plus-structure table representation, VLM figure-description indexing, multi-page table reconciliation, and the graceful-degradation fallbacks that mean Viridon's hardest documents still parse into something useful.01 — Ingestion · Indexing · Retrieval
Diligence question 1.2
What is your approach to chunking these documents, and what tools would you use and why? Specifically: what metadata schema would you attach to chunks (e.g. document type, ISO/RTO, date, section), and how would that metadata be created — automated extraction, manual tagging, or a mix?
We chunk in two stages. First, structural chunking from the parse (Figure 1, KL·B) breaks a document into typed elements — narrative blocks, table rows, figure descriptions — so each chunk already knows what kind of thing it is and where it sits. Then each element's text is split by a semantic chunker into topic-coherent pieces, and each piece becomes one embedding (a numeric representation of its meaning, so related passages land near each other). We do this because fixed-size windows cut through the middle of an argument or a table; topic-aware splits keep a retrievable chunk on a single idea, which is what drives retrieval quality on long proposals and ~200-page selection reports.
We don't use one chunker for everything — we match the strategy to the content to control both quality and cost. Long-form prose goes through semantic chunking (the primary strategy); short, already-structured fields (table rows, short metadata snippets) use cheaper recursive or fixed-length splitting, because running the full semantic pass on a one-line field spends embedding calls for no retrieval gain. Each Vespa record is capped at five embeddings so no single record becomes a bag of unrelated vectors. The tooling is a custom semantic chunker (Greg Kamradt's percentile-breakpoint method, run on our own embeddings rather than a third-party wrapper), LangChain's recursive character splitter for structured content, and Vespa as the index.
Every chunk lands in our standard_document schema, which carries a deep metadata set on each record: document type and subtype; page number and on-page coordinates (for exact citation); section and hierarchy IDs; table position (row, column, is-table-root); figure pointers; created/updated dates and a version; folder path; and access scope (organization, owner, collaborators). The schema also includes typed key/value maps — string→string, string→int, string→double — so Viridon-specific dimensions like ISO/RTO, sponsor, project, or filing date attach to chunks without a schema change when a new dimension shows up. Maps are how we store arbitrary per-client metadata as dictionaries rather than hard-coding columns.
That metadata drives two kinds of filtering. Hard filtering uses the exact-match maps — "only chunks where ISO_RTO = CAISO and document_type = selection report" — applied before ranking to scope the search precisely. Soft filtering uses metadata that's full-text indexed (titles, filenames, folder path, short structured fields): it's folded into the hybrid score via BM25 — the standard keyword-matching algorithm — so a matching project name boosts a chunk's relevance without excluding anything. (Hybrid search blends this keyword signal with vector search, which matches on meaning rather than exact words.) And because we set this per field, metadata can be made BM25-searchable or kept filter-only by choice — ISO/RTO can be a hard filter, a soft ranking signal, or both, depending on how you want it to behave.
Metadata is created mostly automatically, with light human curation. The parser emits structure, coordinates, page, and hierarchy; file and source systems give dates, folder path, and access scope; a classifier assigns document type/subtype; and an LLM extraction pass pulls Viridon-specific values (ISO/RTO, sponsor, project, key dates) out of the content into the maps. Humans confirm the taxonomies and correct the occasional misclassification — corrections feed back rather than being re-done each time. This whole step is the Ingestion & Structuring component (KL·B) feeding the index (KL·A).
Most file indexing (PDF, DOCX, TXT, PPTX, XLSX) follows the same shape: a structured element's text → SemanticChunker → a long_text_data[] array (each entry independently embedded) → break_element_arr_into_semi_equal_lengths (max 5) → one or more Vespa records. Semantic chunking decides how to split a block's text for embedding; the structural pass before it decides what each block is. The record-splitter is not a text splitter — it caps each record at five embeddings and distributes the chunks evenly, so retrieval granularity stays clean and no single record stores too many vectors.
The default for almost all document indexing, based on Greg Kamradt's percentile-breakpoint approach, customized to run on our own encode_many rather than a LangChain embedding wrapper. The algorithm:
Used by PDF, DOCX, TXT, PPTX, XLSX and the shared embedding helper — i.e. the whole Viridon corpus.
Cost discipline: semantic chunking embeds every combined-sentence pair for the distance calculation and then the final chunks — materially more embedding calls than recursive/fixed splitting. Reserving it for long-form prose and using cheaper splitters for short structured fields is a deliberate cost choice.
Before any of the above, documents are chunked by structure: tables become row/column children (carrying row_num, col_num, is_table_root), images get VLM descriptions, and hierarchy is captured via group IDs. Only the resulting text (or image description) then goes through semantic chunking.
Each chunk is one standard_document record (inheriting a shared base schema). The fields map directly onto everything Viridon needs:
| Need | Schema field(s) | How it's created | Indexing |
|---|---|---|---|
| Document type | document_type, document_subtype | Classifier at ingest (proposal, selection report, RFI, NDA, ISO/RTO study) | filter |
| ISO/RTO, sponsor, project, dates | string_string_hard_filter_map, string_int_hard_filter_map, string_double_hard_filter_map | LLM extraction + source metadata — no schema change for new dimensions | hard filter (exact) |
| Section / hierarchy | group_id, parent_group_id, hierarchy_group_ids | Structural parse | filter / scope |
| Page & location | page_number, coordinates (map<string,float>) | Parser (Figure 1) | filter / citation |
| Table position | row_num, col_num, is_table_root, is_number_only | Table parser | filter |
| Figures | image_uuid, image_aws_key, is_image_useful | Image + VLM pipeline | filter |
| Date / recency / version | document_created_at, created_at, updated_at, version | File & source metadata | attribute (filter + rank) |
| Location in tenant | folder_path, folder_path_ids | SharePoint tree | BM25 + filter |
| Access scope (RBAC — role-based access control) | organization_id(s), owner_id, collaborator_ids | Source / SSO | filter (→ KL·H) |
| Titles | name, filename | Source | BM25 |
| Short structured fields | short_text_field_data (weightedset) | Extraction | BM25 (weighted) |
| The chunk text itself | long_text_data (array<string>) + long_text_embeddings | Semantic chunker | BM25 + semantic |
The three typed maps (string→string, string→int, string→double) are stored with exact match on both key and value, as fast-search attributes. That's what lets a query say "restrict to ISO_RTO = CAISO, document_type = selection report, year ≥ 2023" and have Vespa narrow the candidate set before ranking. Because they're maps, adding a new metadata dimension is a data change, not a schema migration.
Fields marked enable-bm25 (long_text_data, name, filename, folder_path, short_text_field_data) are full-text searchable and carry per-field weights in the rank profile (e.g. name/filename weighted ~2× body text), so matching metadata boosts relevance rather than excluding. Fields marked attribute-only (document type, page, the hard-filter maps) are not BM25-searched — they're pure filters. This is a per-field choice: any piece of metadata can be made a soft ranking signal (indexed), a hard filter (attribute), or both. There are also trigram (gram-size 3) variants of the indexed fields for fuzzy/typo-tolerant matching, and an empty-field discount so a chunk isn't unfairly penalized for missing an optional field. The full hybrid rank profile — BM25 / semantic / n-gram weighting, multi-vector closeness over long_text_embeddings, and reranking — is covered in Section 1.4 (Retrieval).
01 — Ingestion · Indexing · Retrieval
Diligence question 1.3
What is your approach to embedding, and which model(s) or platform(s) would you use and why?
For Viridon we'd run a privacy-first, model-agnostic embedding layer on Amazon Bedrock, so every document is embedded entirely within Viridon's own AWS environment. We treat the embedding model as a swappable component rather than a fixed dependency — the retrieval architecture doesn't change when the model does — but the deployment posture (in-tenant, no data egress) is the part we'd hold fixed, because it's important in ensuring this system becomes an asset in your data room at time of exit.
The reason to anchor on Bedrock is data control. Accessed through an AWS PrivateLink VPC endpoint, embedding traffic stays on the AWS network within Viridon's chosen region and never crosses the public internet. Bedrock does not use inputs or outputs to train any model, and does not share them with model providers; data is encrypted in transit and at rest, optionally under Viridon's own KMS keys; and the service carries the compliance coverage a Blackstone-backed infrastructure company will be asked about (SOC 1/2/3, ISO 27001 and family, HIPAA-eligible, GDPR, FedRAMP). The practical consequence for the exit story: the embedding index is an owned asset with no third-party data exposure that could reprice a deal.
On model choice, the default for text is Amazon Titan Text Embeddings V2 — optimized for RAG (retrieval-augmented generation: answering from retrieved documents), multilingual, with selectable output dimensions (256 / 512 / 1024) and unit-normalized vectors. For the figure-heavy ISO/RTO and transmission material, we'd use a multimodal embedding model — Amazon Nova Multimodal Embeddings (a single model spanning text, documents, images, video and audio, with cross-modal retrieval) or Titan Multimodal Embeddings G1. This is what makes the multimodal retrieval we flagged in Parsing (1.1) real: a one-line diagram or deliverability chart is embedded into the same space as the surrounding text, so a text query can retrieve the figure directly, not only via its written description. If a privacy requirement or an eval result points elsewhere, Cohere embeddings on Bedrock — or self-hosted open-source models (e.g. BGE, E5, GTE) that run entirely inside the VPC — are also available; the choice is eval-gated and privacy-gated, not locked.
Two engineering invariants govern the layer. First, index and query must use the same model and the same dimension — both sides are wired to the chosen Bedrock model, and the vector index's tensor dimension is set to match. Second, we pick the output dimension to balance accuracy against storage and latency; these models use Matryoshka-style dimensions (embeddings trained so they can be safely truncated to a shorter length), so e.g. dropping from 1024 to 512 keeps ~99% of retrieval accuracy at half the storage. This is the detail behind the Indexing & Retrieval component (KL·A) in Figure 1.
Finally, embeddings are one signal of three, not the whole story. Retrieval blends semantic (vector closeness) with BM25 (lexical) and n-gram (fuzzy) matching, and the semantic weight is automatically zeroed for chunks that have no vector (e.g. number-only table cells) so lexical matching still works. That design is what lets the embedding model be swapped without rebuilding retrieval — the full ranker is covered in Section 1.4.
| Model | Modality | Output dims | Why / when |
|---|---|---|---|
| Titan Text Embeddings V2 titan-embed-text-v2:0 | Text (8k tokens, 100+ languages) | 256 / 512 / 1024 | Default for prose. RAG-tuned, normalized, flexible dimension for storage/latency control. |
| Amazon Nova Multimodal Embeddings | Text · document · image · video · audio (unified) | 256 / 384 / 1024 / 3072 | Best fit for figure-heavy ISO/RTO + transmission docs; cross-modal retrieval in one space. |
| Titan Multimodal Embeddings G1 titan-embed-image-v1 | Text + image (shared space) | 256 / 384 / 1024 | Lighter multimodal option for image-by-text / image-by-image search. |
| Cohere (on Bedrock) · open models (BGE / E5 / GTE) | Text (model-dependent) | model-dependent | Eval/privacy-gated alternatives. Open models run fully self-hosted in-VPC. |
A large share of Viridon's signal lives in diagrams — one-lines, substation layouts, deliverability charts. In Parsing (1.1) we make those searchable by indexing a VLM-written description of each figure. A multimodal embedding model goes one step further: it embeds the figure itself into the same vector space as text, so a text query retrieves the image by visual-semantic similarity, not only through its description. Nova Multimodal Embeddings is explicitly designed for exactly this — searching documents that mix infographics and text — which is why it's our lead recommendation where figure retrieval matters.
The embedding dimension must match the vector index. Bedrock's models expose Matryoshka-style dimensions, so we choose a point on the accuracy/cost curve — typically 1024, or 512 where storage and latency matter and ~99% of accuracy is retained — and set the index tensor dimension to match. The one hard rule: the same model and dimension are used at index time and query time. Changing the model later means re-embedding the corpus and updating the index dimension, which is a deliberate, eval-gated migration rather than a silent swap.
Bedrock offers asynchronous / batch embedding jobs for indexing large corpora (the 300+ page proposals) and a latency-optimized path for query-time embedding, which maps cleanly onto our parallel indexing design and keeps interactive search fast.
Embeddings feed the semantic leg of a three-signal hybrid ranker (semantic + BM25 + n-gram, default blend ≈ 0.5 / 0.4 / 0.1). Semantic weight auto-zeroes for chunks with no vector so lexical and fuzzy matching still operate. Because retrieval is hybrid and the embedding model is abstracted behind a single encode interface, the model is genuinely swappable — the retrieval architecture, covered in 1.4, is unchanged by the choice.
01 — Ingestion · Indexing · Retrieval
Diligence question 1.4
Walk us through your retrieval approach, including the tools or platforms used and why. Specifically: is retrieval semantic-only or hybrid (keyword + semantic), and what signals drive ranking?
Retrieval is hybrid, not semantic-only, and it's a multi-stage pipeline rather than a single vector lookup. Every search fuses three signals inside one Vespa query — BM25 (lexical/keyword), semantic (vector nearest-neighbour over the chunk embeddings), and n-gram (character-trigram, typo-tolerant) — and several lightweight LLM steps wrap around that core to handle the messiness of real enterprise questions. Vespa (open-source, self-hosted in Viridon's VPC) is the engine because it does true hybrid retrieval and two-phase ranking in a single query; the query is embedded with the same Bedrock model used at index time (Section 1.3).
Before anything is searched, two things happen. Query distillation rewrites a conversational message into a standalone query — "what about their deliverability?" becomes "Sunrise project deliverability study outcome" using the recent conversation history, with explicit instructions not to over-interpret domain terms. Then multi-query expansion generates three additional phrasings (synonyms, full-forms/abbreviations like "CAISO" ↔ "California ISO", and different angles), so we typically search with four parallel queries. This lifts recall on under-specified questions, which is the common failure mode on a corpus this varied.
Each query then runs hybrid search in Vespa, scoped by organization, optional source selection, and hard metadata filters (the maps from 1.2 — e.g. ISO_RTO = CAISO). This is the Scoped Retrieval component (KL·H) in Figure 1. The lexical legs run over text plus high-value fields (name, filename, folder path); the semantic leg runs an approximate-nearest-neighbour search over the multi-vector chunk embeddings, with a candidate set of ~500 before ranking.
Ranking is where the signals combine, and it's fully configurable through Vespa rank profiles. The default profile blends semantic at 0.5, BM25 at 0.4, and n-gram at 0.1 in a first phase, then re-ranks the top 500 in a global phase with each signal linearly normalized (a master profile offers reciprocal-rank fusion — a standard method for merging several ranked lists — as an alternative). On top of that, field-level weights mean a match in a document's name or filename (weight 300) outranks the same match in body text (150), an empty-field discount keeps a chunk from being penalized for missing optional metadata, and the semantic leg auto-zeroes for chunks that have no embedding (number-only table cells) so lexical matching still surfaces them. A per-word significance step scores each query term HIGH / MEDIUM / LOW (1.0 / 0.5 / 0.01) so filler words don't pollute the lexical match while the full-query embedding stays intact.
After the parallel searches return, an application fusion layer merges and de-duplicates across the four queries and boosts documents that matched multiple phrasings (final = top score + 0.2 × second-best) — deliberate recall amplification for results that show up under several framings. Optional precision passes sit on top: Vespa's global-phase rerank is always on; a self-hosted open cross-encoder reranker (e.g. BGE-reranker) — a slower, more accurate model that re-scores the shortlist for precision — can be added; and an optional LLM source-picker can read the shortlist and return an ordered set of the most relevant sources before the answer is generated.
Finally, custom ranking is a first-class lever, not an afterthought — the rank profile is tuned per customer. The signal weights, the field weights, significance on/off, linear-norm vs reciprocal-rank fusion, scope defaults, and image inclusion are all configurable per request or per org without code changes, and we tune them against Viridon's eval set (Section 3). A worked retrieval trace on a representative document is below; we'd deliver the full trace on a Viridon document of your choosing as the requested artifact.
A single Vespa request combines all three retrieval modes, plus the per-word significance weights injected into the lexical legs:
The semantic leg uses approximate nearest-neighbour over the multi-vector chunk embeddings with a ~500-candidate target; the final hits count returned is tunable (e.g. 30 for the API, up to 500 for the full RAG path).
Two phases. First-phase computes a weighted blend of the three signals; global-phase re-ranks the top 500 with each signal linearly normalized (the master profile swaps in reciprocal-rank fusion when use_reciprocal_rank is set). The defaults on standard_document:
| Signal | Default weight | Mechanism |
|---|---|---|
| Semantic | 0.5 | closeness of query embedding vs the chunk's multi-vector long_text_embeddings |
| BM25 (lexical) | 0.4 | BM25 over long_text_data, name, filename, folder_path, short fields |
| N-gram | 0.1 | nativeRank over the trigram fields — typo / partial-token tolerance |
Inside the BM25 leg, fields are weighted so identity matches win: name 300, filename 300, folder_path 200, body text 150, short fields 150. Two guards matter: an empty-field discount (×0.1) so a chunk isn't penalized for lacking an optional field, and a semantic auto-zero — if a chunk has no embedding (number-only cells), its semantic weight drops to 0 and the blend renormalizes over the lexical signals so the chunk is still retrievable. All weights are inputs, so they're overridable per request.
Per-word weights are injected into the YQL so noisy lexical matches on filler words are suppressed while the full-query embedding is untouched:
HIGH (1.0) for names/entities, MEDIUM (0.5) for secondary context, LOW (0.01) for stopwords/generic terms. Significance can be disabled per org where it isn't helping.
Within each query's result set, signals are linearly normalized and re-scored with the per-hit weights; across the four queries, hits are de-duplicated by ID and a document that matched multiple phrasings is boosted: final = top score + 0.2 × second-highest score. This intentionally amplifies recall for documents that surface under several framings of the same question.
User message (with history)
"what was the deliverability outcome for Sunrise in CAISO?"
After distillation + expansion → 4 parallel queries
Sunrise project deliverability study outcome CAISO Sunrise transmission deliverability assessment California ISO Sunrise full capacity deliverability status CAISO Sunrise interconnection deliverability resultTerm significance (query 1)
Sunrise · 1.0project · 0.01deliverability · 1.0study · 0.5outcome · 0.5CAISO · 1.0Hard filter applied: ISO_RTO = CAISO.
Candidate chunks — per-signal scores → blended (0.4·bm25 + 0.5·sem + 0.1·ngram)
| Chunk | BM25 | Semantic | N-gram | Blended | Matched queries |
|---|---|---|---|---|---|
| A · Selection report, "Deliverability" §, p.142 | 0.82 | 0.91 | 0.40 | 0.82 | 1, 2, 3 |
| D · file Sunrise_Deliverability_Study.pdf | 0.88 (filename wt 300) | 0.66 | 0.55 | 0.74 | 1, 2, 4 |
| B · Proposal exec summary mention | 0.70 | 0.74 | 0.30 | 0.68 | 1 |
| C · Table cell "FCD status: Conditional" | 0.55 | n/a — no vector | 0.65 | 0.57 | 1, 3 |
Chunk C is a number-only cell: semantic auto-zeroes and the blend renormalizes over the lexical signals — (0.4·0.55 + 0.1·0.65) / 0.5 = 0.57 — so it's still retrieved.
Cross-query fusion (top + 0.2 × second) → final ranking
| Rank | Chunk | Fusion | Final |
|---|---|---|---|
| 1 | A | 0.82 + 0.2·0.80 | 0.98 |
| 2 | D | 0.74 + 0.2·0.71 | 0.88 |
| 3 | C | 0.57 + 0.2·0.54 | 0.68 |
| 4 | B | 0.68 (single match) | 0.68 |
A wins on strong semantic + three-query match; D rises on the filename weight and multi-query match; C — a number-only cell with no embedding — still ranks via lexical signals. This is the shape of the trace artifact we'd deliver on a representative Viridon document.
| Layer | Tool / platform | Why |
|---|---|---|
| Hybrid index & ranking | Vespa (open source, self-hosted in-VPC) | BM25 + semantic + n-gram fused in one query; two-phase rank profiles; ANN at scale. |
| Query embedding | Bedrock model (per 1.3) | Same model + dimension as index; in-VPC. |
| Query understanding | LLM — Bedrock or self-hosted open model (distill · expand · significance · optional source-pick) | Turns messy conversational input into high-recall, intent-weighted queries. |
| Precision rerank | Open cross-encoder (e.g. BGE-reranker), self-hosted | Optional shortlist reranking for higher precision. |
| Cross-query fusion | Application layer | Merge, dedupe, multi-match recall boost. |
01 — Ingestion · Indexing · Retrieval
Diligence question 1.5
Looking across the pipeline above, which stages do you consider commoditized (off-the-shelf tooling), and where do you provide differentiated value?
Several stages of the pipeline are genuinely commoditized, and we deliberately use strong off-the-shelf components for them rather than reinventing them. Recognizing what's commodity is a feature — it's what keeps the build lean and lets us concentrate engineering where it actually differentiates. In Figure 1 terms, the commodity sits underneath the boxes; the boxes themselves — how they're composed, tuned, and extended — are where the value is.
The commoditized stages are the foundation models and retrieval primitives. Parsing leans on the Unstructured framework, OCR, and the base layout and table-structure models. Embedding is a commodity capability — Titan, Nova and Cohere on Bedrock and self-hosted open models (BGE, E5) are interchangeable, and the vector math is the same everywhere. Retrieval's core operations — BM25, approximate-nearest-neighbour vector search, n-gram matching — are off-the-shelf Vespa primitives, and basic character-based chunking is a solved problem. The platforms themselves (Unstructured, Vespa, Bedrock) are infrastructure we build on, not things we'd ever rebuild. Reinventing any of these would destroy value, not create it.
Our differentiated value is the orchestration around those commodity pieces, and the Viridon-specific layers on top of them. In parsing: strategy routing per document, the dual image-plus-structure table representation, VLM figure-description indexing, multi-page table reconciliation, and the graceful-degradation fallbacks that keep hard documents from failing silently. In chunking: the two-stage structural-then-semantic design, the tuned semantic chunker, and — the biggest piece — the metadata schema with its typed maps, hard/soft filtering, and BM25-by-choice fields. In embedding: the privacy-first in-VPC deployment posture and the model-agnostic abstraction. In retrieval: the multi-stage LLM-wrapped pipeline (distillation, expansion, term significance, cross-query fusion) and the per-customer rank profiles.
And then there's the layer with no off-the-shelf equivalent at all — the bespoke knowledge-layer modules built for Viridon: the "what changes" map (KL·D), public-doc enrichment (KL·I), the RFI Q&A + SME-delegation memory (KL·J), the standard-terms playbook (KL·K), and the onboarding glossary (KL·L), plus the app-specific tools. No general-purpose tool ships these because they encode Viridon's process, not a generic one. This is exactly the gap an off-the-shelf product leaves: it gives you the commodity retrieval box and stops — which is why, on the diagram, a tool like Glean covers only KL·A.
This division is what makes the platform both efficient and bespoke. We don't pay to rebuild commodity foundations, so the budget goes to the integration, tuning, and Viridon-specific modules that are genuinely differentiated — and those are the parts that sit in Viridon's environment as an owned asset.
| Stage | Commoditized (off-the-shelf) | Our differentiated value |
|---|---|---|
| Parsing | Unstructured framework; OCR (Tesseract); base layout (yolox) & table-structure (table-transformer) models; pdfminer; PyMuPDF | Strategy routing per document; dual image + structure table representation; VLM figure-description indexing; multi-page table reconciliation; graceful-degradation fallbacks |
| Chunking | Recursive / fixed-length character splitters (LangChain) | Two-stage structural → semantic → record-cap design; tuned semantic chunker (95th-pct, own embeddings); cost-aware strategy selection; the standard_document metadata schema (typed maps, hard/soft filtering, BM25-by-choice) |
| Embedding | The embedding model itself (Titan / Nova / Cohere on Bedrock, or self-hosted open models) | Privacy-first in-VPC Bedrock deployment; model-agnostic abstraction; index/query dimension-invariant management; multimodal cross-modal wiring |
| Indexing & retrieval | Vespa platform; BM25, ANN/vector search, n-gram primitives | Multi-stage LLM pipeline (distill, expand, significance); hybrid rank profiles (signal + field weights, empty-field discount, semantic auto-zero); cross-query fusion; per-customer rank tuning |
| Knowledge-layer modules | — no off-the-shelf equivalent — | Net-new for Viridon: KL·D "what changes" map · KL·I public-doc enrichment · KL·J RFI Q&A + SME delegation · KL·K standard-terms playbook · KL·L onboarding glossary · bespoke tools (T1, T3, T4, T6, T10, T12) |
01 — Ingestion · Indexing · Retrieval
The two artifacts requested for this section: a retrieval trace on a representative document, and the third-party tools, platforms, and models in the proposed stack.
Artifact A
For a sample query, the chunks retrieved and how they were ranked. The bars decompose each chunk's relevance into the three weighted signals, so it's visible why each ranked where it did. This expands the inline trace from Section 1.4; we'd run the full version on a Viridon document of your choosing.
Sample query (after distillation + 4-way expansion, ISO_RTO = CAISO filter applied)
"what was the deliverability outcome for Sunrise in CAISO?"
Per-query relevance — how each signal contributes to the blend (0.4·BM25 + 0.5·semantic + 0.1·n-gram)
Note the bottom result: a number-only table cell with no embedding — the semantic segment is absent (auto-zeroed) and the blend renormalizes over the lexical signals, so the cell is still retrieved.
Cross-query fusion (final = top score + 0.2 × second-best) → final order
| Rank | Chunk | Fusion | Final |
|---|---|---|---|
| 1 | Selection report — Deliverability § | 0.82 + 0.2·0.80 | 0.98 |
| 2 | Sunrise_Deliverability_Study.pdf | 0.74 + 0.2·0.71 | 0.88 |
| 3 | Table cell — FCD status | 0.57 + 0.2·0.54 | 0.68 |
| 4 | Proposal exec-summary mention | 0.68 (single match) | 0.68 |
The table cell rises above the single-match proposal mention because it matched two query phrasings and earned the fusion boost — recall amplification for results that surface under multiple framings.
Artifact B
The third-party stack the pipeline builds on. Every component is either open-source (self-hosted in Viridon's VPC) or a managed AWS service reached privately over PrivateLink — nothing requires data to leave Viridon's AWS account. Our differentiated value (Section 1.5) is the integration, tuning, and Viridon-specific modules around these — which are our own, not third-party.
| Component | Type | Deployment | Role |
|---|---|---|---|
| Parsing & structuring | |||
| Unstructured | Library | Open-source · in-VPC | Typed-element parsing across PDF/DOCX/XLSX/HTML |
| yolox | Model | Open-source · in-VPC | Page layout / region detection |
| Table-transformer | Model | Open-source · in-VPC | Table row/column structure recovery |
| Tesseract | Engine | Open-source · in-VPC | OCR for scans & image regions |
| pdfminer | Library | Open-source · in-VPC | Native PDF text-layer extraction (FAST path) |
| PyMuPDF (fitz) | Library | Open-source · in-VPC | Hyperlink annotation extraction |
| pandas | Library | Open-source · in-VPC | Table HTML → per-cell records |
| VLM (vision-language model) | Model | Bedrock (PrivateLink) or self-hosted OSS | Figure & diagram description generation |
| Embedding | |||
| Amazon Bedrock | Platform | AWS · PrivateLink | In-VPC model access (no-train, KMS, no public egress) |
| Titan Text Embeddings V2 | Model | Bedrock (PrivateLink) | Default text embeddings (256/512/1024-d) |
| Nova Multimodal Embeddings | Model | Bedrock (PrivateLink) | Cross-modal text + figure embeddings (lead for figures) |
| Open embedding models (BGE / E5 / GTE) | Model | Open-source · in-VPC | Self-hosted alternative; fully in-VPC |
| Cohere · Titan Multimodal G1 | Model | Bedrock (PrivateLink) | Alternative managed embeddings (eval/privacy-gated) |
| Index & retrieval | |||
| Vespa | Platform | Open-source (Apache 2.0) · self-hosted in-VPC | Hybrid BM25 + semantic + n-gram index; two-phase rank profiles; ANN |
| LLM (query understanding) | Model | Bedrock (PrivateLink) or self-hosted OSS | Query distillation, expansion, term significance, optional source-picking |
| Cross-encoder reranker (BGE-reranker) | Model | Open-source · in-VPC | Optional precision reranking of the shortlist |
| Storage & security | |||
| Amazon S3 | Platform | AWS · in-tenant | Original documents & extracted images |
| AWS KMS | Platform | AWS · in-tenant | Encryption at rest (customer-managed keys optional) |
| AWS PrivateLink | Platform | AWS · in-VPC | Private VPC connectivity; no public-internet egress |
Specific model selections (embedding model, VLM, query/rerank models) are finalized per Viridon's data-privacy requirements and eval results; the architecture treats each as a swappable component, and an open-source self-hosted option exists for every model role.
02 — Orchestration
Diligence question 2.1 · grounded in the demoed workflow
Walk us through how the demoed workflow is structured — the steps, how they connect, and the framework or major dependencies it is built on. More importantly: how do you think about orchestration design, and how do you decide between approaches (routing to specialists, a manager decomposing into parallel sub-tasks, or a single-agent flow)? What about the design you chose seems right for this workflow?
We'll ground this in what we demoed: the Proposal Writing Assistant — Setup workflow, end to end. In Figure 1 that's the Setup workflow in layer 4, marked deterministic, and it's the first phase of a larger AI teammate that runs Setup → Strategy → Drafting → Evaluation. The key design choice: Setup is a deterministic chain, not an agent improvising. The steps are known, ordered, and correctness-critical on a 300-page document — so we use the simplest control flow that does the job reliably, which here means a fixed sequence of tool calls rather than a free-roaming agent.
The demoed sequence: past winning proposals are ingested and broken into typed sections (KL·B, KL·C); a working template is auto-derived and the recurring variables are detected — project_name, sponsor, capacity, key dates (T8, KL·F); a "what changes" map flags the parts of the starting proposal that likely need to change this cycle — learned from how proposals have historically changed — versus the parts safe to keep (KL·D); the new bid's brief and documents are then used to fill the template variables and revise the flagged parts, touching only what likely needs to change so the rest is preserved (T4, KL·D); the ~200-page sponsor selection reports are ingested (KL·B); their win/loss themes are extracted and indexed as a retrievable advice module (KL·E); and finally, in AI editing, when the assistant recommends what to edit and how, it pulls that indexed selection-report advice (and prior sections) via retrieval, comments paragraph-by-paragraph, and proposes edits (T1, T5, drawing on KL·A / KL·E / KL·H). The full step-by-step is below.
On framework and dependencies: the orchestrator (Figure 1, foundation) chains these tools, and both the tools and the knowledge layer are exposed over MCP — the Model Context Protocol, an open standard for connecting AI assistants to tools and data. As the diagram says, the orchestrator can be an MCP client, your Claude / GPT desktop app, or a custom router — that's deliberate, because it decouples the control flow from the tools. Setup runs as a deterministic MCP tool chain; the later phases (Strategy, Drafting, Evaluation) move to orchestrated routing where the path depends on content.
The orchestrator also maintains structured memory across the loop, and how we model that memory is itself a design decision. Rather than carrying one ever-growing chat transcript — which conflates different concerns and quickly overflows the context window — we separate session memory into three kinds: conversation (the natural-language turns that capture user intent and constraints, like "don't delete anything"), working state (a structured scratchpad of the IDs and decisions the workflow has accumulated — the source of truth for where we are), and an episodic trace (an ordered log of every tool call, its arguments, and its outcome). Each planner step receives a deliberately bounded package assembled from these — truncated conversation, current working state, the last few trace steps — rather than everything every time. That separation is what keeps the agent within context limits, keeps machine state reliable across steps (the resumability in 2.2), and makes the whole run auditable (the execution trace in 2.3). This is session-scoped memory for the orchestration loop; cross-session institutional memory is a separate concern, covered in Section 4.
Scaling Setup into a full "AI teammate for proposal writing" is, concretely, two things: implementing the knowledge-layer modules, and building a composable tool set — Read paragraph, Create comment, Draft a section, Identify opportunities, Flow updates across the document, Evaluate against criteria, Aggregate attachments, Web research, Grounded Q&A. Those are exactly T1–T8 and T·Q in Figure 1, plus a few document-editor primitives. The teammate itself is a conversational multi-agent loop living in the multiplayer editor: it routes each turn to the right tool or subagent, drafts / comments / researches like a colleague, and proposes changes a human approves.
How we think about orchestration design comes down to a few principles. Use the simplest control flow that works — deterministic where steps are known, agentic only where they aren't. MCP as the interface, so tools are swappable and the orchestrator is replaceable. Human-in-the-loop at the right gates — the AI proposes, the human disposes; no risky or irreversible action (editing the live document, flowing a change across 300 pages) happens without explicit approval. And guardrails, governance and security throughout: RBAC-scoped retrieval (KL·H), the open-source / in-VPC deployment from our deployment principle, per-tool permissions, and a full execution trace of every step (Section 2.3). We build for performance (parallelize what's parallelizable) and extensibility (tools compose and get reused across mini-apps).
We decide between orchestration approaches by the shape of the work, and the patterns nest rather than compete. A deterministic chain for known, ordered, correctness-critical steps (Setup). A router to specialists when requests are heterogeneous and each needs a different capability (the teammate picking T1 vs T2 vs T7 per turn). A manager that decomposes into parallel sub-tasks when the work splits into independent units (evaluating all sections at once, multi-query retrieval, flowing one change across 300 pages). And a multi-agent loop for interactive, open-ended work with a human present (live drafting). For proposal writing this combination is right because Setup demands reliability, drafting is inherently interactive, evaluation is embarrassingly parallel — and, critically, every specialist tool we build composes: the comment, Q&A, research and retrieval tools built here are reused by the RFI drafter, the legal screener, and the onboarding assistant. That reuse is the whole shared-foundation thesis of Figure 1, and it's why we optimize for composition.
The teammate is a single conversational agent that calls a set of small, well-scoped tools (the same ones in Figure 1, plus editor primitives). Each is built once and reused across apps — that reuse is the point.
| Tool | What it does | Knowledge layer it draws on | Reused by |
|---|---|---|---|
| T·Q · Grounded Q&A | Cited answers over the knowledge layer | KL·A, G, H | every mini-app |
| T1 · Read & comment | Suggests improvements vs. selection-report themes | KL·C, E, G | RFI, legal |
| T2 · Draft a section | From template + structured prior wins | KL·A, B, C, F | RFI drafter |
| T3 · Identify opportunities | Where to differentiate this bid | KL·A, E, G | — |
| T4 · Flow updates | Propagate a change across 300+ pages | KL·B, D | evaluation |
| T5 · Evaluate against criteria | Score a draft vs. what wins | KL·A, E, G, H | — |
| T6 · Aggregate attachments | SME reports into one narrative voice | KL·A, B, H | RFI drafter |
| T7 · Web research & scrape | Live external + public-doc context | KL·B, I | ISO/RTO, all |
| T8 · Build a template | Auto-derive from past proposals | KL·C, D, F | — |
| Editor primitives | Read paragraph · Create comment · Apply approved edit | — | drafting surface |
A single chat transcript doesn't scale and conflates three different concerns. We model the orchestration loop's memory (planner → tool calls → synthesizer) as three separate types on a per-session SessionMemory, so each stays clean and bounded.
| Memory type | What it holds | Answers | How it's used |
|---|---|---|---|
| Conversation | Natural-language user/assistant turns | "What did they ask for?" — intent, constraints, tone | Fed to the planner, truncated by turn count + character budget |
| Working state | Structured scratchpad — IDs & decisions (list_id, task_id, last_search_query) | "Where are we right now?" | Patched / replaced explicitly; in every (size-limited) planner package; the source of truth across steps |
| Episodic trace | Ordered tool-call log — name, redacted args, success/failure, result summary, timestamps | "What happened?" | Written on each tool call; recent summaries go to the planner; powers audit, debug, replay & synthesis |
Two cross-cutting mechanisms keep this within budget built today:
The synthesizer then produces the final answer by reading the goal, the trace, and the observations — not raw MCP blobs — which is what lets the loop stay reliable and within limits while still answering well.
The core idea: separating intent (conversation), state (working state) and history (episodic trace) lets the orchestrator stay within context limits, keep machine state reliable, and still synthesize good answers — the opposite of stuffing one growing transcript into every call.
| Pattern | Use when | In proposal writing |
|---|---|---|
| Deterministic chain | Steps are known, ordered, and correctness-critical | The Setup phase — fixed sequence, fully traceable |
| Router to specialists | Requests are heterogeneous; each needs a different capability | The drafting teammate routing each turn to T1 / T2 / T3 / T7 |
| Manager → parallel sub-tasks | The task splits into independent units that aggregate | Evaluating every section at once; multi-query retrieval; flowing one change across 300 pages |
| Multi-agent loop | Interactive, open-ended, human present | Live drafting in the multiplayer editor |
02 — Orchestration
Diligence question 2.2
How do you think about reliability in a multi-step workflow, and what tools or techniques do you use to achieve it? Specifically: what happens when a step fails or returns low-quality output, how do you validate output between steps, and how do you prevent the workflow from drifting off course?
Our first reliability technique is to minimize the surface area for failure: the most reliable step is a deterministic one, which is why Setup is a fixed chain (2.1) rather than an agent improvising. For the parts that genuinely need an LLM, we treat it as a fallible component and wrap it in four things — validated structured outputs, checkpointed state, risk-classified human approval, and an in-loop reflection step. Together those cover the three failure questions: what happens when a step fails, how we validate between steps, and how we keep the workflow from drifting.
When a step fails or returns low-quality output, we separate two cases. A hard failure (error, timeout, tool exception) triggers a bounded retry with backoff, then a fallback path where one exists — the same pattern as the parsing fallbacks in 1.1, where a failed table inference degrades to image + description rather than crashing — and if it's still failing, we resume from the last checkpoint and, if exhausted, escalate to a human rather than proceed on a broken step. A soft failure (the step runs but the output is malformed, low-quality, or unsupported) is caught by the validation and reflection gates below, then repaired, retried, or escalated. The principle throughout: never silently pass a bad result downstream.
We validate output between steps by making every step emit a structured, typed output against an explicit schema — so we know exactly what the model produced and can check it programmatically before the next step consumes it. Schema validation deterministically catches malformed or hallucinated structure (missing fields, wrong types, out-of-range values); a grounding check verifies that claims which should be supported by retrieved sources actually are (the anti-hallucination contract that feeds our eval harness in Section 3). Each step has an explicit input/output contract, so a downstream step never has to guess what it received.
State management makes failure recoverable. Each step's inputs and outputs are checkpointed and steps are designed to be idempotent (safe to re-run without applying anything twice), so on any failure we know exactly where we left off and resume from the last good checkpoint — we don't re-parse a 300-page proposal or re-embed the corpus because a later step timed out. This matters most for the proposal workflow specifically, which runs over months, not minutes.
We prevent drift with a reflection step in the loop — a pattern we've already implemented in our agentic orchestration work. After a step (or on a cadence), a critic re-checks the work against the original objective and constraints, catches drift, and either re-anchors, re-plans, or halts. The goal and constraints are carried through every step so the agent never loses the thread, scoped tools limit how far it can wander, and explicit termination criteria plus bounded autonomy (caps on tool calls, recursion, and cost) stop a runaway loop.
Finally, the human gate is itself a reliability control. We classify actions by risk and reversibility: read-only and reversible-in-draft actions (search, comment, propose an edit) run autonomously, while consequential or irreversible actions — applying an edit to the live document, flowing a change across 300 pages, anything external — require explicit human approval. And because applied edits are versioned (the document carries a version field), even an approved change is reversible. All of this sits on a full execution trace (Section 2.3), because you can't make reliable what you can't see.
| Failure mode | How we detect it | Response |
|---|---|---|
| Hard failure (error / timeout) | Exception, timeout, tool error | Bounded retry with backoff → fallback path → resume from last checkpoint → escalate if exhausted |
| Malformed output | Schema validation fails | Repair / re-prompt → bounded retries → escalate |
| Low-quality / unsupported output | Critic + grounding check fail | Reflection re-do with feedback → escalate to human if it doesn't converge |
| Drift from the goal | Reflection step vs. objective | Re-anchor / re-plan; halt if it can't get back on track |
| Runaway loop | Step / cost / recursion budget exceeded | Hard stop → surface the partial result and the reason |
| Action class | Examples | Autonomy |
|---|---|---|
| Read-only / retrieval | Search, read a paragraph, grounded Q&A | Autonomous |
| Generative · reversible in draft | Draft a section, propose an edit, generate a template, leave a comment | Autonomous (proposed, not applied) |
| Consequential / irreversible | Apply an edit to the live document, flow a change across the proposal, any external action (send, export) | Human approval required |
Applied edits are versioned, so an approved change can still be rolled back — reversibility is a backstop even past the approval gate.
Reliability isn't a single feature — it's the combination of a deterministic backbone (2.1), validated structured contracts between steps, durable state, an in-loop critic, risk-gated approval, and full observability (2.3). The same eval harness that measures quality (Section 3) doubles as regression protection: when a prompt or model changes, it confirms existing behavior didn't break before the change ships.
02 — Orchestration
Diligence question 2.3
How do you think about observability, and what tooling do you use for it? Specifically: can we and our technical advisor see the full execution trace of a workflow — what each step retrieved, decided, and passed downstream?
Yes — fully, top to bottom. The shift that makes this real is treating a workflow run as spans, not log lines: a run is one trace, each step is a span, and nested tool calls and sub-agents are child spans. That tree is exactly what answers "what each step retrieved, decided, and passed downstream," because those relationships are a hierarchy, not a flat stream. We build it on OpenTelemetry (the open industry standard for tracing software) with LLM-specific semantic conventions, and the backend — Langfuse or Phoenix — is open-source and self-hosted, so the trace store lives inside Viridon's VPC alongside everything else. No traces of Viridon's proposals go to a third-party SaaS.
We design for two audiences with two surfaces over the same captured data. Your technical advisor gets the full span tree and audit logs — every tool call, every retrieval, every decision, validation result and hand-off, with token, cost and latency per span and the exact prompt-template and model version that produced each output, plus replay. Erin and end users get explainability instead of raw internals: a "why did it suggest this?" view that traces any AI recommendation back through the advice it used to the source page. Same data underneath; the advisor sees the engine, the user sees the reason.
Mapping directly to your three words: retrieved is the retrieval trace from Section 1.4 captured on each search span — the distilled and expanded queries, the candidate set, the per-signal scores, and what was filtered out, not just what came back. Decided is the planner's tool choice and the alternatives it weighed, the working-state diff for that step, and the validation/reflection verdict. Passed downstream is the typed output and the working-state delta — the bounded planner package handed to the next step.
The substrate already exists. The episodic trace, working state, and bounded planner package from our memory model (2.1) already record what each step did, what changed, and what was passed on. Observability is mostly turning that into spans and a UI — productionizing what the orchestration loop already captures, not bolting on a parallel logging system.
The feature that matters most for proposal writing is provenance lineage: for each AI claim or proposed edit, we record which retrieved chunk supported it and link that chunk back to its source page. So a recommendation traces cleanly as edit → selection-report advice (KL·E) → chunk on p.142 of the 2023 report. End-to-end answer-to-source lineage is what makes the user-facing explainability trustworthy — and it doubles as a clean artifact for a future data room.
Two things worth flagging for a technical reviewer. Logged "reasoning" is the model's stated rationale — an honest record of what it reported, not a proof of the true cause — so "decided" means the recorded decision plus its stated reasoning. And exact reproducibility is bounded by model nondeterminism: we pin prompt-template and model versions and set seeds where the provider allows, so replaying a trace is fully reliable, but re-generating a hosted model's identical output is not guaranteed. Because traces contain document content, redaction (arguments are already redacted), access control on the trace UI, retention limits, and sampling are part of the design, not afterthoughts.
| Captured | Detail |
|---|---|
| Identity & timing | Span name, parent, start/end, duration |
| Inputs | Tool name, redacted args, the bounded package the step received |
| Retrieval | Distilled + expanded queries, candidate set, per-signal scores, what was dropped, rank profile used |
| Decision | Planner tool choice + alternatives weighed; validation / reflection verdict |
| Output | Typed result + working-state delta passed downstream |
| Cost & version | Tokens, cost, latency; prompt-template + model version |
| Grounding | Which sources supported which claims — the provenance link |
| Audience | Surface | What they see |
|---|---|---|
| You + technical advisor | Full span tree + audit logs (Langfuse / Phoenix, self-hosted) | Every step's retrieval, decision, validation and hand-off; cost / latency; prompt + model versions; replay |
| Erin / end users | In-product explainability view | "Why did it suggest this?" — recommendation → advice used → source page; no raw internals |
Traces aren't only for debugging. We sample production traces into the eval set (Section 3) and monitor online quality signals — grounding-failure rate, retrieval-hit-rate, drift / halt events — not just latency and cost. That's the difference between "we have logs" and "we know it's working", and it's what turns the regression story in 2.2 into a live signal.
Artifact
The full run as an expandable span tree. Click any step to see what it retrieved, decided, and passed downstream. This is a representative render of what the advisor sees in the self-hosted trace UI.
Retrieved
12 past proposals + 3 selection reports from SharePoint (source scope applied)
Decided
HI_RES parse strategy (docs < 999 pp, image indexing on); table inference enabled
Passed downstream
1,840 typed sections + 312 RFP questions → working_state.section_index
Retrieved
12 prior winning proposals, ranked by selection outcome (KL·F)
Decided
Template derived from highest-scoring wins; 47 recurring variables detected — project_name, sponsor, capacity_mw, cod_date …
Passed downstream
Template + variable manifest → working_state.template
Retrieved
The starting proposal + change patterns learned across historical proposals
Decided
Flagged the parts of the starting proposal that likely need to change this cycle (e.g. project specifics, deliverability sections) vs. the parts safe to keep
Passed downstream
"What likely needs to change" map → working_state.change_map
Retrieved
New project brief + 4 supporting documents
Decided
Revise only the flagged parts and fill the template variables; the rest left untouched
Passed downstream
Filled draft v0 → working_state.draft_id (version 1)
Retrieved
3 sponsor selection reports (~200 pp each)
Decided
HI_RES parse; table + figure extraction
Passed downstream
Parsed selection-report records → index
Retrieved
Parsed selection-report records
Decided
Mined 64 win/loss advice entries; indexed as a retrievable advice module
Passed downstream
Advice module → KL·E (now retrievable by the editor)
Retrieved
Selection-report advice + prior winning sections — top chunk: 2023 selection report, p.142 (final score 0.98). See child span for the full retrieval trace.
Decided
Propose 1 comment + 1 edit to §3.2, strengthening deliverability evidence — grounded in KL·E theme "deliverability evidence under-stated vs. winning bids"
Passed downstream
Proposed change-set {comment_1, edit_1} → working_state.pending_changes
Retrieved
Query "deliverability outcome for Sunrise in CAISO" → 4 expanded queries → 4 chunks. Ranked: selection-report §Deliverability p.142 (0.98) · Sunrise_Deliverability_Study.pdf (0.88) · FCD-status table cell (0.68) · exec-summary mention (0.68). Full per-signal breakdown in Section 1 · Artifact A.
Passed downstream
Top 4 ranked chunks → comment + evaluate tools
Decided
Flag §3.2 paragraph; rationale: selection-report advice (KL·E) says deliverability outcomes win on quantified evidence — current draft asserts without figures
Passed downstream
1 proposed comment, with provenance link → p.142
Decided
§3.2 scores 6/10 against winning bids; gap = quantified deliverability outcome
Passed downstream
Score + gap note attached to the change-set
Decided
Grounding check passed — the comment cites a real retrieved source (p.142). On-track vs. goal; no drift. Proceed to human gate.
Retrieved
Pending change-set {comment_1, edit_1} with provenance
Decided
Surface to Erin for approve / deny — no autonomous application of edits
Passed downstream
Awaiting human; nothing applied to the live document yet
03 — Evaluation
Diligence question 3.1
What do you measure to know the system is working, and how do you define each metric? Specifically: how do you treat the distinct failure types — the wrong source being retrieved, an output claim not supported by the retrieved source (hallucination), and low output quality?
We measure each stage of the pipeline separately, on purpose — because a bad final answer is a symptom, and what makes an eval useful is being able to say why it was bad. The three failure types you name aren't interchangeable: they live in different stages and have different fixes, so we attribute every failure to a stage rather than scoring only the end result. That localization is the whole design of the eval.
The wrong source retrieved is a retrieval-stage failure, scored against a labeled set of which chunks are relevant per query. The headline metric is recall@k / hit-rate — did a relevant chunk make the top-k at all — because if it wasn't retrieved, the generator simply can't use it. Around that we track context recall (did we get all the chunks needed) versus context precision (are the relevant ones ranked above the noise), and MRR / nDCG for how high the first relevant chunk landed. We split a recall miss (relevant chunk absent — usually fatal) from a precision miss (irrelevant chunk ranked high — dilutes context) because they have different fixes. For Viridon we add filter correctness — did a scope like ISO_RTO = CAISO actually apply — because the dangerous failure here is cross-project contamination, which scoped retrieval (KL·H) exists to prevent.
An unsupported claim — a hallucination — is a generation-stage failure, measured as faithfulness / groundedness. The key distinction: we don't score "is it true in the world," we score "is every claim entailed by what we retrieved" — the right contract for RAG, because we control the sources. Mechanically we decompose the output into atomic claims and check each against the retrieved context (supported / unsupported / contradicted), plus citation accuracy — does the cited source actually support the claim, which the provenance lineage from 2.3 makes directly checkable. One subtlety: a hallucination is often a retrieval failure in disguise. If context-recall was low, the model filled the gap — so we only call it a generation bug when recall was high and it still invented something. That's why we measure retrieval separately rather than scoring the final answer alone.
Low output quality is the fuzziest and most domain-specific. The generic dimensions are answer relevance and completeness, instruction-following (did it respect "don't touch the boilerplate", length, tone), coherence, and format / schema validity (already enforced by the structured-output validation in 2.2). But the differentiated quality metric for proposal writing is "winning-ness" — scoring whether an edit makes a section more like the sections that have won, built from the selection-report advice in KL·E. That's a quality rubric grounded in what Viridon actually cares about, which no off-the-shelf eval framework gives you.
Around those three sit a broader taxonomy. Because this is agentic, not only RAG (Section 2), we also measure task success / completion, tool-selection accuracy, and trajectory correctness (did it reach the answer for the right reasons, not by luck), alongside operational signals — drift / escalation rate (2.2), latency, and cost. And the single best real-world quality signal for the assistant is human edit-distance / acceptance: how much Erin changes a proposed edit before accepting it. Low edit distance is high quality, measured on live usage with no labeling. The full taxonomy is in the detail below.
Two limitations to be upfront about. Several of these metrics use an LLM as judge, which is itself fallible — so we calibrate it against human labels, reserve it for scale, and keep humans on the high-stakes and subjective calls. And all of it is only as good as the ground truth it's scored against, which is the next question (3.2). Tooling stays in-VPC per the deployment principle: RAGAS-style metric computation (RAGAS is an open-source toolkit for evaluating RAG systems) plus a self-hosted judge model on Bedrock, fed by the execution traces from 2.3.
Each example is checked at three points; the first ✕ is where it breaks, which points to a specific fix. Downstream checks are moot once an upstream stage fails.
| Example query | ① Right source? | ② Faithful to source? | ③ Quality output? | Where it breaks → fix |
|---|---|---|---|---|
| "Sunrise deliverability outcome (CAISO)" | ✓ | ✓ | ✓ | Pass |
| "Interconnection cost, Project Atlas" | ✕recall miss | — | — | Retrieval — relevant chunk absent → tune rank profile / embeddings |
| "Deliverability evidence requirements" | ✓ | ✕unsupported | — | Generation — recall was high, claim invented → tighten grounding / prompt |
| "Summarize selection feedback" | ✓ | ✓ | ✕format | Quality — verbose, ignored format → tune prompt / schema |
| "Costs for Project X" (scoped) | ✕wrong filter | — | — | Scoping — pulled another project → fix filter (KL·H) |
| Tier | Metric | Definition | Targets |
|---|---|---|---|
| Retrieval — "did we find the right thing?" | |||
| Retrieval | Recall@k / hit-rate | Fraction of queries where a relevant chunk is in the top-k | Wrong source |
| Retrieval | Context recall | Did we retrieve all the chunks needed to answer | Wrong source (recall) |
| Retrieval | Context precision | Are relevant chunks ranked above irrelevant ones | Wrong source (precision) |
| Retrieval | MRR / nDCG | Position of the first / all relevant chunks (rank-weighted) | Wrong source |
| Retrieval | Filter correctness | Did hard filters (e.g. ISO/RTO) scope correctly | Cross-project contamination |
| Generation — "did it use what it found honestly?" | |||
| Generation | Faithfulness / groundedness | Fraction of output claims entailed by the retrieved context | Hallucination |
| Generation | Citation accuracy | Does the cited source actually support its claim | Hallucination |
| Quality — "is the output good?" | |||
| Quality | Answer relevance / completeness | Does it answer the question, fully | Low quality |
| Quality | Instruction-following | Respects constraints — boilerplate, length, tone | Low quality |
| Quality | Format / schema validity | Structured output is well-formed (ties to 2.2) | Low quality |
| Quality | "Winning-ness" | Does an edit make a section more like winning sections (from KL·E) | Low quality (domain) |
| Agentic & operational — "did the workflow behave?" | |||
| Agentic | Task success / completion | Did the workflow achieve the goal end-to-end | End-to-end |
| Agentic | Tool-selection accuracy | Was the right tool chosen at each step | Process |
| Agentic | Trajectory correctness | Right steps for the right reasons, not luck | Process |
| Operational | Drift / escalation rate | How often it goes off-track or needs a human (2.2) | Reliability |
| Operational | Latency / cost | Speed, tokens, spend per run | Efficiency |
| Real-world | Human edit-distance / acceptance | How much a user changes a proposed edit before accepting it | Live quality |
Every metric here needs a labeled "right answer" to score against — how we build that ground truth, and how we minimize the SME time it takes, is Section 3.2.
03 — Evaluation
Diligence question 3.2
How do you establish ground truth — the labeled "right answers" evals are scored against — and who builds that set? Where the answer depends on our subject-matter experts, how do you minimize the time required from them?
Ground truth is the real bottleneck in enterprise RAG evaluation, so we treat SME time as the scarce resource we engineer around, not an afterthought. The first move is to stop thinking of "ground truth" as one thing: it has three layers — retrieval truth (which chunks are relevant to a query), answer truth (the correct answer text), and preference / rubric truth (which of two outputs is better, or how it scores on a rubric). They cost very different amounts to label, so matching the cheapest viable label type to each metric is already a major saver — most retrieval and faithfulness checks need no authored answer at all.
The single biggest unlock for Viridon is that your archive is already a labeled dataset. A library of won proposals and ~200-page selection reports isn't just source material — a winning section is a gold answer for "how should this section read," and a selection report is labeled feedback on what was strong and weak. So a large share of the "right answers" already exist in your corpus; the work is extraction, not authoring. That turns ground truth from a cost you'd carry into an asset you already own.
Around that, we draw labels from the cheapest sources first (the ladder below). Reference-free metrics need zero SME input — faithfulness is checked against the retrieved context, not a gold answer, and schema validity is deterministic. Implicit labels from real usage are free and compounding: every time Erin accepts, edits, or rejects a proposed change, that's a label, and the edit diff tells us how it was wrong. Synthetic generation with human verification handles the rest — an LLM drafts (question, answer, source-chunk) triples from your documents, and the SME's job collapses from authoring to approving or correcting, which is several times faster.
Where SMEs are needed, we minimize their time deliberately: approve, don't author (review LLM-drafted labels rather than writing from scratch — the biggest single lever); active learning (we surface the highest-value cases — where the system is uncertain or where the judge and a human disagree — instead of asking them to label at random); a small, stratified golden set plus a large auto-graded set (a few hundred carefully chosen, human-verified examples anchor a calibrated LLM judge that handles the volume); and capture in the natural workflow (a thumbs-up, an accepted edit, or a "this source is wrong" flag in the product is a label given without extra effort). The result is that SME involvement is bounded and front-loaded, and trends toward near-zero as usage-based labels compound.
On who builds it: we build the harness, generate the synthetic candidates, mine the historical corpus, and run and calibrate the judge; your SMEs spend bounded, high-leverage time approving and correcting the golden set and resolving the contested cases; and the product harvests implicit labels continuously. Two caveats. Ground truth isn't static or singular — SMEs disagree and what "wins" shifts as sponsors change — so we measure inter-annotator agreement, version the golden set, and treat it as living rather than a one-time deliverable. And synthetic labels carry a bias risk — an auto-generated test set can be easy in the same ways the system is good, flattering the scores — which we counter by seeding from your real artifacts and keeping a human-authored slice as the hard anchor.
Most coverage comes from the cheap and free sources at the top; the expensive, SME-authored slice is kept small and high-leverage. As usage grows, the free implicit labels compound and the SME share shrinks further.
| Layer | What's labeled | Typical cost | How we get it |
|---|---|---|---|
| Retrieval truth | Which chunks are relevant to a query | Low | Confirm the source, or bootstrap from a known answer's source chunk |
| Answer truth | The correct answer text | High | Mine from won proposals; synthetic + verify; small SME-authored anchor |
| Preference / rubric truth | Which output is better, or its rubric score | Low–medium | A/B preference or rubric — far cheaper than authoring gold; "winning-ness" derived from selection reports (KL·E) |
| Who | Does what |
|---|---|
| BetterBrain | Builds the eval harness; generates synthetic candidates; mines the historical corpus; runs and calibrates the judge |
| Viridon SMEs | Bounded, high-leverage time: approve / correct the golden set, resolve contested cases, set the "winning" rubric |
| The product | Harvests implicit labels continuously (accept / edit / reject) — zero added effort |
The implicit-from-usage labels are also the input to the self-learning loop in Section 4, and they're the same signal as the human edit-distance metric in 3.1 — ground truth and the feedback loop are two views of the same data.
03 — Evaluation
Diligence question 3.3
How are evals run operationally — an automated pipeline, your team, our team, or a hybrid? What is the division of labor, and what ongoing time commitment would you expect from us? Specifically: when a prompt or model changes, how do you confirm the change did not break existing behavior (regression testing)?
Evals run as a hybrid at three cadences, not one — so the answer to "automated, your team, or ours" is: all three, at different speeds. A fast automated suite runs in CI on every prompt or model change and blocks the merge if it regresses (the regression gate). A comprehensive batch runs nightly and on-demand against the full golden set for the thorough scorecard. And continuous online monitoring scores sampled production traffic on reference-free metrics. Automation does the volume; humans do the judgment.
On division of labor: BetterBrain builds and maintains the pipeline, writes the metrics, owns the CI gate, triages regressions, and calibrates the judge. The automated system does the bulk of the work — CI on every change, the nightly batch, and live monitoring — with no human in the loop. Your SMEs touch only the irreducibly human part: periodic review of the golden set and adjudicating the handful of borderline cases CI surfaces.
On your time commitment — concretely, because you asked: almost all of it is upfront, establishing and ratifying the golden set and the "winning" rubric — on the order of 15–20 SME-hours, front-loaded over the first few weeks, and mostly approve-not-author (per 3.2). After that there is no standing commitment: ongoing involvement is ad-hoc only — when the golden set needs an update because the corpus or sponsor criteria changed — averaging under 30 minutes a month, and trending down further as implicit usage labels compound.
On regression testing: the locked, versioned golden set is the regression suite. On any prompt, model, embedding/index, or tool change, we re-run it and compare to the previous baseline. Because outputs are non-deterministic we don't assert string equality — we gate on metric thresholds and no-regression deltas ("no metric dropped more than N% vs. baseline"). The most actionable technique is A/B diffing: surface only the examples that flipped pass↔fail, so a reviewer looks at the handful that changed, not all 500. And we slice and gate per segment (doc type, ISO/RTO, question type), because a sub-segment can tank while the average stays flat — the silent-degradation trap that aggregate-only eval misses.
Two operational notes: The judge model is itself non-deterministic, so a "regression" can be judge noise — we pin the judge's model and prompt versions, average over runs on the golden set, and route borderline flips to a human. And we treat eval cases as version-controlled code — the golden set lives in the repo and changes via review, so the suite evolves with the same rigor as the system. The loop closes with 2.3 and 3.2: production failures caught by monitoring are promoted into the golden set, so the regression suite gets harder exactly where the system is weak. Tooling — Promptfoo, Langfuse or Phoenix — is self-hosted in-VPC per the deployment principle. The full operating model is the plan below.
Artifact
The operating model for the proposal-writing use case: the three cadences, the regression gate, and the SME time budget in one view.
Cadence 1 · automated
Cadence 2 · scheduled
Cadence 3 · continuous
Your time · upfront
~15–20 SME-hours
Front-loaded over the first few weeks — establish & ratify the golden set and "winning" rubric (mostly approve-not-author).
Your time · ongoing
< 30 min / month
Ad-hoc only — golden-set updates when the corpus or sponsor criteria change. No standing commitment.
| What we measure | How we measure it · ground truth | Target (illustrative) |
|---|---|---|
| Setup — template, variables, change detection | ||
| Template coverage | Generated template vs. SME-ratified structure from prior wins | ≥ 95% required sections present |
| Variable detection (precision / recall) | Detected variables (project_name, sponsor, capacity…) vs. SME-labeled set on held-out proposals | P ≥ 0.95 · R ≥ 0.90 |
| Change-flagging (precision / recall) | Parts flagged "likely to change" vs. SME-labeled actual changes across historical proposal pairs | R ≥ 0.90 · P ≥ 0.80 |
| Variable-fill accuracy | Filled field values vs. the new project brief | ≥ 0.95 |
| Update propagation (precision / recall) | All correct locations updated across 300+ pages, nothing else, vs. labeled change-set | R = 1.0 · P ≥ 0.95 |
| Advice & AI editing | ||
| nowAdvice retrieval (recall@k) | Relevant selection-report advice retrieved, vs. labeled edit-context → advice pairs | Recall@5 ≥ 0.90 |
| nowScoping / no contamination | Adversarial cross-project queries — does it ever pull another project's data? | 0 cross-project leaks |
| nowRecommendation grounding (faithfulness + citation) | Every comment's claim checked against its cited retrieved source | Faithful ≥ 0.95 · Citation ≥ 0.98 |
| Drafted-content grounding | Drafted-section claims checked vs. brief + prior wins | ≥ 0.95 |
| "Winning-ness" of edits | LLM-judge rubric from selection-report advice (KL·E) + SME preference on a sample | Edit improves score in ≥ 80% |
| nowBoilerplate preservation | Diff of changed text vs. the change-map — only flagged parts touched | ≥ 0.99 untouched |
| Real-world & operational | ||
| nowHuman acceptance + edit-distance | Live accept / edit / reject on proposed changes (implicit labels) | Acceptance ≥ 70% & rising |
| Workflow completion | Trace status (2.3): valid template + filled draft, no failed step | ≥ 0.98 |
| nowLatency · cost · drift | Production monitoring (2.3) | Within budget · escalation only at human gate |
Targets are illustrative starting points; the actual thresholds are set from the baseline once the golden set is established (3.2) and become the no-regression bar in CI.
Phased rollout — where we start
We don't stand all of this up at once. The highlighted rows are our initial focus — the metrics that most directly safeguard your data, prevent unsupported claims, and reflect real-world usefulness, and that we can put in place quickly. The rest layer in as the golden set and live usage data mature.
04 — Self-learning & institutional memory
Diligence question 4.1
How does the system learn from use over time, and what tools or techniques support this? Specifically: what signal is captured (accept/reject, edits to drafts, explicit corrections); is learning applied live/in-session or through a batch "reflection" process (e.g. nightly), and why that cadence; and does this same feedback feed into how you evaluate the system?
The system learns primarily in the knowledge layer, not in the model weights. When Erin edits or rejects a proposed change, we capture the before→after diff and its context, generate a piece of reusable advice from it ("for CAISO deliverability sections, lead with the quantified outcome"), and index that advice so future retrievals surface it. It's the same machinery as the selection-report advice module (KL·E), and the result is inspectable and correctable — you can read, edit, or delete what the system has "learned" (Section 4.3). And it isn't only advice that improves: the same feedback updates other knowledge-layer components in Figure 1 — the entity & concept map / knowledge graph (KL·G), the "what changes" patterns (KL·D), and the glossary (KL·L) — so a correction can fix an entity or a relationship in the graph, not just a piece of guidance.
On cadence, it's both — split by mechanism. Anything that's just retrieval-time context applies live: a correction becomes an indexed advice entry the very next retrieval can pull, with nothing retrained, and the in-session working memory (2.1) already adapts within a task. Anything that involves synthesis or judgment is deferred to a batch / nightly "reflection": clustering many edits into one durable advice entry, resolving contradictions when SMEs disagree, promoting a pattern only once it's been seen several times, and re-ranking which advice is trusted. Why that cadence: a single edit is noisy, and you don't want one idiosyncratic correction to immediately reshape behavior for everyone — the batch step is deliberate noise control, and it's where the degradation guardrail lives (4.5).
The signal we capture is richer than accept/reject. The most valuable is the edit diff itself — the corrected text tells us not just that a suggestion was wrong but how, and it doubles as a free gold label. Around it: explicit accept / reject / thumbs, behavioral signals (used, ignored, asked a follow-up, re-ran the search), and explicit corrections ("this source is wrong", "this advice doesn't apply here") — the highest-value, lowest-volume signal.
On reinforcement learning, we use the framing deliberately, not the heavy machinery. We treat acceptance as a reward signal and optimize the policy that decides which advice and which retrieval configuration to surface — not the model weights. The genuine reinforcement-style mechanisms on our roadmap are (1) a system that automatically learns which advice and which result-ranking to surface — it tries different options, watches which ones lead Erin to accept the suggestion, and shifts toward the ones that work, essentially self-tuning A/B testing that runs continuously, fully in-VPC, with no model weights touched — and (2) automatic tuning of the prompts and examples against the eval metrics, so they're optimized by measurement rather than by hand. We explicitly do not fine-tune model weights: it would break the inspectability we're selling, complicate the in-VPC deployment, make evals harder, and is the wrong investment at this corpus size. What we do is accumulate the accept/reject preference pairs as an asset — the dataset that would make fine-tuning possible later, to be spent only if the eval gain ever warrants it.
And yes — the same signal feeds evaluation. Every accept / edit / reject is simultaneously a learning signal and an eval label (the human-edit-distance metric in 3.1, the implicit labels in 3.2). That coupling is also the safeguard — a new or updated advice entry is promoted only if it does not regress the golden set, so the feedback loop and the eval loop are the same flywheel: usage → advice → eval-gated promotion → better retrieval → more usage. High-impact advice can require human approval before it goes live, so what the system learns stays governed.
| Signal | What it tells us | Type |
|---|---|---|
| Edit diff (before → after) | How a suggestion was wrong; the corrected text is a free gold label | Implicit · richest |
| Accept / reject / thumbs | Coarse good / bad on a proposed change | Explicit |
| Behavioral | Used, ignored, asked a follow-up, re-ran the search | Implicit |
| Explicit correction | "This source is wrong" · "this advice doesn't apply here" | Explicit · highest-value |
| Mechanism | Cadence | Why |
|---|---|---|
| Index a correction as advice | Live (in-session) | Retrieval-time context, no retraining — next retrieval can use it |
| Same-session adaptation (working memory) | Live | Within-task only; resets per run (2.1) |
| Consolidate many edits into durable advice | Batch (nightly) | Dedup + synthesis; one edit is noisy |
| Resolve contradictions / promote after N | Batch | Judgment + noise control |
| Re-rank which advice is trusted | Batch | Needs aggregate signal |
| Promote a new/updated entry | Batch | Eval-gated — must not regress the golden set (4.5) |
| Approach | Stance |
|---|---|
| Knowledge-layer learning (edit → advice → index) | core · live The primary mechanism; inspectable and correctable |
| Acceptance as a reward signal over the retrieval / advice policy | yes Reinforcement-style, no weight changes |
| Auto-learn which advice / ranking to surface | roadmap Tries options and favors the ones that get accepted — continuous, self-tuning A/B testing; in-VPC, no weights touched |
| Auto-tune the prompts and examples vs. eval metrics | roadmap Optimized by measurement instead of by hand |
| Accumulate accept/reject preference pairs | yes Banked as an asset; spent only if evals justify |
| Fine-tune model weights (RLHF) | not planned Breaks inspectability + in-VPC simplicity; wrong investment at this scale |
04 — Self-learning & institutional memory
Diligence question 4.2
When the system "learns," what concretely changes — retrieval ranking, prompts, a memory/concept store, model weights, or something else?
In one line: what changes is data and configuration, not the model. The core mechanism is the concept / advice store — learning adds, updates, and re-ranks indexed advice entries and the entity & concept map (the knowledge graph), along with the other knowledge-layer modules in Figure 1, which together are the institutional memory we cover in 4.3. Everything else follows from that: because new advice is indexed, retrieval surfaces different, better-grounded context, and on the roadmap the ranking itself self-tunes toward what gets accepted. Prompts and examples are auto-tuned against the eval metrics (roadmap), not edited per interaction. Session memory adapts live within a task and resets per run (2.1). And the eval golden set itself grows as production failures are promoted into it — so the system's measurement improves alongside its behavior. Model weights do not change, by design.
| Candidate | Changes? | What changes |
|---|---|---|
| Memory / concept store | Yes — primary | Advice entries added / updated / re-ranked — the institutional memory (4.3) |
| Concept / entity graph (KL·G) | Yes | New or corrected entities & links — projects, customers, ISO/RTOs, terms (Figure 1) |
| Retrieval ranking | Yes | New advice changes what's surfaced; ranking self-tunes by acceptance (roadmap) |
| Prompts / examples | Yes (roadmap) | Auto-tuned against eval metrics — not edited per interaction |
| Session / working memory | Yes — live | Within-task adaptation; resets per run (2.1) |
| Eval golden set | Yes | Grows as production failures are promoted (3.2 / 4.5) |
| Model weights | No | Unchanged by design — inspectability, in-VPC, eval simplicity |
04 — Self-learning & institutional memory
Diligence question 4.3
What accumulates as institutional memory, and in what form? Specifically: is it human-readable, auditable, and correctable — can we inspect and fix what the system "knows"?
What accumulates is structured, human-readable knowledge — not opaque vectors or model weights. The institutional memory is a set of concept and advice entries in the knowledge layer: the advice mined from edits and selection reports (KL·E), the entity and concept map of customers, projects, ISO/RTOs and terms and how they link (KL·G), the "what changes" patterns (KL·D), and the onboarding glossary (KL·L). Each is a record you can read in plain language, not a number in a tensor.
In form, every entry is a structured record: a plain-language statement of the knowledge, the scope it applies to, its provenance (which edits and sources produced it), confidence and usage stats, a status, and a version history. The anatomy is below — it reads like a note with an audit trail.
To the heart of your question — yes, all of it is human-readable, auditable, and correctable. Everything the system learns is exposed to you in plain language and is fully editable: you can inspect any entry and trace it back to the edits and sources that produced it (the provenance from 2.3), correct or rewrite it, and disable, delete, or pin it. Because learning lives in the concept store and not in the weights (4.2), the entire learned state is open to inspection and repair — there is no hidden knowledge baked into a model you can't read. A curation view makes this a first-class surface (4.4).
Auditability comes from the same structure: each entry carries provenance (what created it, when), version history (what changed), and usage stats (how often it's retrieved and applied, and how often accepted) — so you can audit not just what it knows but why it knows it and whether it's actually being used. A correction simply becomes the new entry (eval-gated before it's trusted, per 4.5). And because the whole memory is a readable, ownable artifact rather than a black box, it transfers with the company — the same owned-asset principle that runs through the architecture.
| Field | What it holds | Illustrative |
|---|---|---|
| Statement | The knowledge, in plain language | "For CAISO deliverability sections, lead with the quantified outcome" |
| Scope | Where it applies | ISO_RTO = CAISO · proposal §deliverability |
| Provenance | What produced it | Derived from 3 accepted edits + selection report p.142 |
| Confidence / usage | How trusted, how often used | Seen 7× · applied 23× · 91% accepted |
| Priority | Optional manual weight — overrides auto-confidence to set precedence | Normal · High · Critical (e.g. pin a must-follow rule) |
| Status | Active / disabled / pinned | Active |
| Version history | What changed & when | v2 — broadened from "Sunrise" to all CAISO (Apr 2026) |
A fully populated example entry and the end-to-end learning loop are provided as the Section 4 artifact.
| Memory | Form | Source |
|---|---|---|
| Advice entries (KL·E) | Plain-language guidance with scope & provenance | Mined from accepted / rejected edits + selection reports |
| Concept & entity map (KL·G) | Entities (customers, projects, ISO/RTOs, terms) and the links between them | Ingestion, usage + corrections |
| "What changes" patterns (KL·D) | Which parts of a proposal historically need changing | Cross-proposal history + accepted edits |
| Onboarding glossary (KL·L) | Company context, terminology, how concepts connect | Corpus + curation |
All of this runs through a curation surface — the upkeep model is Section 4.4.
04 — Self-learning & institutional memory
Diligence question 4.4
How much of this is automated versus requiring human curation, and by whom?
Upkeep is almost entirely automated. The loop in 4.1 does the work with no person in it — generating advice and entity/graph updates from edits, indexing, consolidating and de-duplicating them, scoring confidence, promoting a pattern only once it's been seen enough times, and re-ranking which knowledge is trusted. Humans do not author or maintain the memory by hand.
The human role is oversight by exception. The one thing worth watching for is an incorrect deduction — the system over-generalizing a one-off edit into a rule that shouldn't apply broadly. When that happens, a reviewer corrects, narrows, disables, or deletes the entry (or adds one directly to teach something) through the curation surface (4.3). It's review-and-fix when something looks off, not continuous curation.
By whom, and how little: the curation is done by an SME or power user (e.g. Erin) for the proposal domain, with BetterBrain monitoring the memory's overall health and tuning the loop. The burden stays low because the guardrails catch most bad deductions before they ever reach a human — confidence thresholds, promote-after-N, eval-gating (4.5), and approval on high-impact entries (4.1) — so what reaches manual review is the genuine exceptions. This is consistent with the eval upkeep budget in 3.3: bounded and ad-hoc, well below a standing commitment.
| Task | Owner |
|---|---|
| Generate advice + entity/graph updates from edits & corrections | Automated |
| Index, consolidate, dedup, re-rank; score confidence; promote-after-N | Automated |
| Eval-gate changes before they're trusted (4.5) | Automated |
| Surface low-confidence / contested entries for review | Automated → flags to humans |
| Review flagged entries; correct / narrow over-general deductions | Human — SME / power user |
| Add, edit, disable, delete, prioritize entries as needed | Human — SME / power user |
| Monitor memory health; tune the loop | BetterBrain |
04 — Self-learning & institutional memory
Diligence question 4.5
How do you prevent a feedback loop from reinforcing errors or degrading the system over time?
The core risk in any feedback loop is an echo chamber: the system learns from its own outputs and reinforces its own mistakes until errors quietly become "truth." We prevent that with defense-in-depth — multiple independent safeguards, so that if one misses a problem the next one catches it — and one structural choice does most of the work.
That choice: learning is measured against an independent anchor, not against the model's own recent behavior. A new or updated entry is promoted only if it doesn't regress the golden set (3.3), and the golden set is anchored in external truth — selection reports plus a human-authored slice (3.2) — not in whatever the model did lately. So the loop can't drift to merely agree with itself.
On top of that, noise control and negative signal: we consolidate in batch and promote only after a pattern recurs (promote-after-N), so one idiosyncratic edit can't reshape behavior, and contradictions are resolved rather than stacked (4.1). And we learn from rejections and heavy edits, not just acceptances — so the loop isn't one-sidedly reinforcing what it already does.
Then catch and contain: per-segment regression gating and online monitoring (2.3, 3.3) catch degradation even when the aggregate looks fine. And because the memory is data, not model weights (4.1), with full provenance and version history (4.3), a bad deduction is traceable, reversible, and deletable — the blast radius is bounded and nothing compounds silently. High-impact changes can be rolled out gradually — applied to a small slice first (a "canary") and widened only if it holds — before they're fully trusted.
Finally, knowledge also degrades by going stale — a sponsor's criteria shift, an ISO/RTO rule changes. Entries carry recency and versioning, older ones decay or get re-validated, and periodic re-evaluation keeps the memory current. Taken together, errors can't quietly become truth, drift is caught against an external reference, and the whole memory stays inspectable and reversible.
| Failure mode | Safeguard |
|---|---|
| Errors reinforced as "truth" | Promotion eval-gated against an independent golden set; promote-after-N + confidence thresholds (3.3, 4.1) |
| Echo chamber (learns only from its own outputs) | Anchored to external truth — selection reports + human-authored slice (3.2); rejections & corrections weighted, not just acceptances |
| One noisy edit reshapes behavior | Batch consolidation; promote-after-N; contradictions resolved, not stacked (4.1) |
| Silent / per-segment degradation | Per-segment regression gating + online monitoring (3.3, 2.3) |
| A bad entry compounds invisibly | Full provenance + version history; inspect / disable / delete (4.3) — bounded blast radius |
| Degradation baked in irreversibly | Learning is data, not weights (4.1) — reversible and roll-back-able |
| A bad change ships widely | CI eval-gate + gradual / canary rollout before full trust |
| Stale knowledge (criteria change) | Recency / versioning; decay or re-validation of old entries; periodic re-eval |
Artifact
What is stored, where it lives, and on what cadence it updates — followed by a fully populated example of a single concept/memory entry.
Rejections & heavy edits weighted, not just acceptances · production failures grow the golden set · feedback loop and eval loop are one flywheel
| What's stored | Where (Figure 1) | Cadence |
|---|---|---|
| Raw signal — accept / edit / reject + diff + context | Signal log | Live, on every action |
| Advice entries (guidance from edits + selection reports) | KL·E advice store | Live to index · nightly to consolidate |
| Concept & entity links | KL·G knowledge graph | Nightly (+ live for direct corrections) |
| "What changes" patterns | KL·D | Nightly |
| Glossary / terms | KL·L | Nightly / on curation |
| Trust & ranking of advice | KL·E + reranker (KL·H) | Nightly, eval-gated |
| Preference pairs (accept / reject) | Eval + training-data bank | Live append; spent only if evals justify |
| Eval golden set | Eval store | Grows as production failures are promoted |
↳ Produced by step 3 (nightly consolidation of 3 edits + selection report p.142), eval-gated to v2 — this is what the loop outputs into KL·E
This is the institutional memory in concrete form: human-readable, traceable to the edits and sources that produced it, prioritizable, and correctable — and it lives in your platform as an owned asset, not a black box.