The architecture

The whole picture, end to end.

Here's how we think about the architecture. Your documents feed a Knowledge Layer made of many components. Tools and an orchestrator draw on those components — many-to-many. The orchestrator chains tools into workflows. Workflows compose into the mini-apps your teams use. Color shows what we bring vs. build net-new — but everything on this diagram is fully customizable. Every tool, every knowledge-layer component, workflows, orchestrator, mini-apps — even the pieces we've built before, we customize and tune to exactly what Viridon needs.

Flow ↑ Sources (bottom)  →  Knowledge Layer  →  Tools  →  Workflows  →  Mini-apps (top)

Click any tool, workflow, or mini-app to pin what it uses across the stack below. Click again to clear. The links run many-to-many across the stack.

1

Viridon Sources

your data, owned by you
SOURCESharePoint · proposals · RFIs · contracts · ISO/RTO public docsStays in your tenant. Ingested into the knowledge layer; never used to train other models.

feeds
▲   ingested & structured from   ▲
2

Knowledge Layer

far more than indexing & retrieval — many components
all appsKL·AIndexing & RetrievalHybrid BM25 + vector (Vespa). This is all Glean gives you.
all appsKL·BIngestion & StructuringAny doc type → parsed, chunked, tagged, structured records
KL·CSection & Question ExtractionBreak proposals into sections / RFP questions
KL·D"What Changes" MapWhich parts of a past proposal likely need changing this cycle
KL·ESelection-Report AdviceWin/loss themes mined from ~200-pg sponsor reports
KL·FTemplate GenerationAuto-built proposal templates from past wins
all appsKL·GConcept Mapping & LinkagesConcepts, customers, projects, terms — and how they link
all appsKL·HScoped Retrieval + RerankingProject / client / global scope, RBAC-aware, reranked
KL·I · bespokePublic-Doc EnrichmentISO/RTO transmission plans, deliverability studiesfor you
KL·J · bespokeRFI Q&A + SME DelegationPrior RFI answers & who owned what last timefor you
KL·K · bespokeStandard-Terms PlaybookViridon's clause library & acceptable positionsfor you
KL·L · bespokeOnboarding GlossaryCompany context, tutorials & how concepts connectfor you

powers
▲   tools & orchestrator pull components   ▲
3

Orchestrator & Tools

both read the knowledge layer · marked tools serve every app
all appsT·QGrounded Q&A / chatCited answers over the knowledge layer — the MCP entry point for every team
T1Read & comment on a paragraphSuggests improvements vs. selection-report themes
T2Draft a sectionFrom template + structured prior wins
T3Identify opportunitiesWhere to differentiate this bid
T4Flow updates across 300+ pagesNumbers, vendors, names everywhere
T5Evaluate against criteriaScore a draft vs. what wins
T6Aggregate & match attachmentsSME reports into one narrative voice
all appsT7Web research & scrapeLive external + public-doc context
T8Build a templateAuto-derive from past proposals
T9SME routerLikely owner from past delegation patterns
T10RFI trackerAuto-populate items & assign owners
T11Clause & field extractorCounterparty, dates, term, obligations
T12Screen vs. standard termsFlag only what needs human review
T13Contract trackerRepository of every NDA & agreement

chains
▲   orchestrator chains tools   ▲
4

Workflows

orchestrated or deterministic chains of tools
WF · proposalSetupTemplate + flag what's outdated
WF · proposalStrategyFind angles & framing
WF · proposalDraftingAI teammate drafts & pulls in attachments
WF · proposalEvaluationReview, comment, propose edits
WF · RFIIntake & matchParse questions · find prior Q&A
WF · RFIDraft responsesFrom proposal, SME reports & past RFIs
WF · RFIRoute & trackAssign SME owners · populate tracker
WF · ISO/RTOSME Q&APublic docs + project history for a customer
WF · onboardingNew-joiner chatCompany context, people, terminology

composes
▲   composed of   ▲
5

Mini-apps

what each team touches
APP·1Proposal Writing AssistantOrigination · Erin
APP·2RFI Response DrafterOrigination
APP·3Legal Contract ScreenerLegal (2-person)
APP·4ISO / RTO SME + OnboardingAll teams

Figure 1. High-level overview of the architecture — what's built today and what's net-new for Viridon.

Two things the colors say. First, the components marked all apps — retrieval, ingestion, concept linkages, scoped retrieval, grounded Q&A, web research — are shared infrastructure every mini-app reuses. Second, almost everything is foundation we've built and customize to you; the bespoke pieces — app-specific tools (T1, T3, T4, T6, T10, T12) and knowledge-layer modules KL·I–L — are where we spend the saved time.

Deployment & security principle

Open-source and in-VPC only — nothing leaves Viridon's AWS environment.

We've deliberately constrained the entire stack to components that are either open-source software or managed AWS services reached privately from within Viridon's own account. Parsing, the knowledge layer, the Vespa index, and orchestration all run inside Viridon's VPC — its own private, isolated network within AWS — on EKS / EC2 (Amazon's managed compute); model access goes through Amazon Bedrock (AWS's managed service for running foundation models) over AWS PrivateLink (a private connection that keeps traffic off the public internet). No documents or queries are sent to a third-party SaaS over the public internet, and nothing is locked to a proprietary cloud that we control. The result is a self-contained, owned asset that transfers with the company — the through-line of the whole architecture.

Open-source coreSelf-hosted in VPCBedrock via PrivateLinkNo third-party SaaS egressNo vendor cloud lock-in

01 — Ingestion · Indexing · Retrieval

Parsing

Diligence question 1.1

How would you parse our source documents, and what tools or platforms would you use and why? Specifically: how do you handle complex tables — merged cells, multi-page tables, nested or irregular layouts?

We parse Viridon's corpus with layout-aware parsing rather than naive text extraction, and the distinction matters specifically because of what your documents are: 300+ page proposals with 100+ attachments, ~200-page selection reports, embedded tables and figures, mixed formatting from many authors and source systems. Copying a PDF's text layer and splitting on line breaks scrambles multi-column reading order, flattens tables into meaningless runs of numbers, and silently drops every figure. Instead we render each page, detect its regions with a layout model, recover table structure cell-by-cell, OCR anything without a text layer, and run a vision-language model over figures so that content carried in images becomes searchable text.

The backbone is the Unstructured library, which turns PDF, DOCX, XLSX, and HTML into typed elements — titles, narrative text, tables, images — each carrying its text, metadata, and a bounding box locating it on the page. Three models do the heavy lifting underneath: a layout-detection model (yolox) that finds regions and preserves reading order on complex pages; a table-transformer that recovers row and column structure, including merged cells, and emits each table as addressable HTML; and OCR (optical character recognition — turning page images into machine-readable text) for scanned pages and image regions. We route each document to the right parsing strategy — high-resolution layout-plus-OCR for the messy real-world proposals, a cheaper native-text path for clean digital PDFs, and full-page OCR for scans — rather than forcing a single mode across the corpus.

For complex tables, merged cells and spans are recovered by the table-transformer and preserved all the way through to per-cell records, so a value like "$42M" stays bound to its row ("Phase 2 capex") and column ("2027 estimate"). Nested and irregular tables that defeat structure inference degrade gracefully: we retry without forcing a grid and keep the table as both an image and a VLM-written description, so content is never lost even when the exact grid can't be recovered. Multi-page tables are the one piece that's net-new for Viridon — because we split documents per page for throughput, a table spanning two pages first appears as two elements, and we add a reconciliation pass that detects the continuation (matching headers and column count, no intervening narrative text) and stitches them into one logical table before indexing.

Figures get dedicated handling because so much of the signal in transmission proposals and ISO/RTO studies lives in diagrams, not prose — one-line diagrams, substation layouts, deliverability-study charts. Every figure is extracted and, whether or not it carries any extractable text, passed to a vision-language model — an AI that looks at an image and describes it in words. We index that description alongside a pointer back to the original image. A query about a deliverability constraint near a given substation can then retrieve a one-line diagram that contains zero searchable text. Optionally, multi-modal embeddings can make the image itself retrievable by visual similarity — we'd add that only if evaluation shows figure-heavy queries underperforming, rather than paying for it speculatively.

The commodity stages — raw OCR and the base layout and table models — we take off the shelf rather than rebuild. Our differentiated value is the orchestration around them: strategy routing per document, the dual image-plus-structure table representation, VLM figure indexing, multi-page reconciliation, and the fallbacks that keep your hardest documents parsing into something useful instead of failing silently.

Additional detail models, table handling, VLM, multi-modal, full tool stack

The backbone — Unstructured

The parsing layer is built on the Unstructured Python library (unstructured 0.17.0 + unstructured-inference 0.8.10). It converts PDF, DOCX, XLSX, and HTML into a stream of typed elementsTitle, NarrativeText, Table, Image, PageBreak — each carrying text, metadata, and (in the high-resolution path) a bounding-box polygon. Everything downstream — chunking, table handling, figure description, retrieval scope — keys off those types and coordinates, which is why getting this layer right is what makes the rest possible. For a 300-page proposal the document is pulled from S3, split into one PDF per page, then parsed in parallel page-buckets — a throughput decision with one consequence we handle explicitly (multi-page tables, below).

Strategy selection — three modes, chosen per document built today

  • HI_RES — default for PDFs up to ~999 pages with image indexing on. Pages are rendered to images, the layout model finds regions, each region is OCR'd. This is what handles the messy real-world proposals.
  • FAST — for very large PDFs (>999 pages) or when image indexing is off. Reads the native text layer via pdfminer, no layout model. Much cheaper, appropriate where layout is simple.
  • OCR_ONLY — full-page OCR for scanned documents with no text layer, common in older contracts and counterparty NDAs. Also the fallback in the lightweight read path.

The models, and what each one does

Model / componentRoleWhy it's needed for Viridon's docs
yoloxLayout detection — finds text blocks, titles, tables, figures as bounding boxes on the rendered pageMulti-column layouts, sidebars, callouts, headers/footers — positional detection preserves reading order instead of interleaving columns.
Table-transformerRecovers row/column structure including spans for each detected table; emits text_as_htmlTurns a table from a meaningless run of numbers into cell-addressable structure.
OCR (Tesseract)Reads text out of rendered regions and scanned pagesSelection reports and attachments are frequently scans or image-based exports.
pdfminerNative text-layer extraction (FAST path)Cheap, accurate path for clean digital PDFs where the layout model isn't warranted.
PyMuPDF (fitz)Extracts hyperlink annotations, attaches each URL to the nearest element by coordinateProposals and contracts cross-reference prior sections and external docs by link.

All of the above run in the pipeline today (Figure 1).

Complex tables — merged cells, nesting, irregular layouts

  • Merged cells and spans built today — the table-transformer recovers rowspan/colspan and we preserve them in the HTML representation, then parse that HTML with pandas into per-row/per-column child records, so each cell is indexable with its row and column context rather than as a floating value.
  • Two representations, kept together built today — for every table we retain both the structured cell records and the original table rendered as an image (uploaded to S3, base64 in the payload). Retrieval can hit a specific cell value; a human or a VLM can still see the original visual. We never collapse the table to only one of the two.
  • Irregular / nested layouts built today — when structure inference fails on a genuinely irregular table, the pipeline retries with infer_table_structure=False and falls back to the table image plus a VLM description. Graceful degradation — we may not perfectly recover the grid, but we never drop the content.
  • Multi-page tables net-new for Viridon — because we split the PDF per page for parallel throughput, a table spanning pages N and N+1 is first detected as two separate Table elements. We add a reconciliation pass: detect a table at the bottom of page N and the top of page N+1, confirm matching column count and header signature with no intervening narrative, and stitch them into one logical table before chunking — carrying the header forward onto continuation rows. Where the match is ambiguous, we keep both the stitched table and the per-page originals so nothing is lost to a wrong guess.

Figures and images — VLM descriptions indexed alongside the image built today

A large share of the signal in transmission proposals and ISO/RTO studies lives in figures, not prose. The pipeline extracts every image/figure region, uploads it to S3, and keeps the base64 in metadata. If the region's OCR text comes back empty — i.e. it's a true diagram, not a text box — we run a Vision Language Model over it — on Amazon Bedrock (reached privately over PrivateLink) or a self-hosted open model — to generate a natural-language description of what the figure shows, and we index that description as searchable text paired with a pointer back to the original image. The practical effect: a query like "deliverability constraint near the X substation" can retrieve a figure that contains zero extractable text, because the VLM's description of it is in the index.

Multi-modal embeddings optional

Today we embed text — including the VLM-generated figure descriptions — into a single text vector space (Vespa). The figure is therefore retrievable via its description. An optional extension adds a multi-modal embedding (a CLIP-style model) so the image itself is embedded into a shared vector space and retrievable by visual similarity, not only through its description. The trade-off is additional indexing infrastructure and cost; in practice the description-indexing approach already captures most of the retrieval value, so we'd enable this only if the eval harness shows figure-heavy queries underperforming.

Other file types built today

  • DOCX — a custom Docx picture partitioner extracts embedded images, so figures inside Word proposals get the same image + VLM treatment.
  • XLSX — sheet- and cell-level elements, relevant for model outputs and trackers.
  • HTML — used in the web-research / scraper path for public ISO/RTO context.

Tools & platforms in the parsing stack

LayerTool / modelWhy this one
Parsing frameworkUnstructured (0.17.0 / inference 0.8.10)Typed-element output with coordinates; one interface across PDF/DOCX/XLSX/HTML; proven on long, messy documents.
Layout detectionyoloxRecovers reading order and region types on complex multi-column pages.
Table structuretable-transformerCell-level structure + spans → addressable HTML.
OCRTesseract (via Unstructured)Reads scanned pages and image regions with no text layer.
Native textpdfminerCheap, accurate path for clean digital PDFs.
HyperlinksPyMuPDF (fitz)Recovers and re-attaches link annotations by coordinate.
Table → recordspandasTurns table HTML into per-cell row/column records.
Figure understandingVLM — Bedrock or self-hosted open modelDescribes figures so image-only content becomes searchable text.
Embedding + indextext embedding model → VespaEmbeds text + figure descriptions into the hybrid retrieval index (KL·A).
Object storeS3Original assets stay in your tenant.

Edge vs. commodity — preview of 1.5

Raw OCR and the base layout/table models are commoditized — we use strong off-the-shelf components there. Our differentiated value is the orchestration around them: strategy routing, the dual image-plus-structure table representation, VLM figure-description indexing, multi-page table reconciliation, and the graceful-degradation fallbacks that mean Viridon's hardest documents still parse into something useful.

01 — Ingestion · Indexing · Retrieval

Chunking & metadata

Diligence question 1.2

What is your approach to chunking these documents, and what tools would you use and why? Specifically: what metadata schema would you attach to chunks (e.g. document type, ISO/RTO, date, section), and how would that metadata be created — automated extraction, manual tagging, or a mix?

We chunk in two stages. First, structural chunking from the parse (Figure 1, KL·B) breaks a document into typed elements — narrative blocks, table rows, figure descriptions — so each chunk already knows what kind of thing it is and where it sits. Then each element's text is split by a semantic chunker into topic-coherent pieces, and each piece becomes one embedding (a numeric representation of its meaning, so related passages land near each other). We do this because fixed-size windows cut through the middle of an argument or a table; topic-aware splits keep a retrievable chunk on a single idea, which is what drives retrieval quality on long proposals and ~200-page selection reports.

We don't use one chunker for everything — we match the strategy to the content to control both quality and cost. Long-form prose goes through semantic chunking (the primary strategy); short, already-structured fields (table rows, short metadata snippets) use cheaper recursive or fixed-length splitting, because running the full semantic pass on a one-line field spends embedding calls for no retrieval gain. Each Vespa record is capped at five embeddings so no single record becomes a bag of unrelated vectors. The tooling is a custom semantic chunker (Greg Kamradt's percentile-breakpoint method, run on our own embeddings rather than a third-party wrapper), LangChain's recursive character splitter for structured content, and Vespa as the index.

Every chunk lands in our standard_document schema, which carries a deep metadata set on each record: document type and subtype; page number and on-page coordinates (for exact citation); section and hierarchy IDs; table position (row, column, is-table-root); figure pointers; created/updated dates and a version; folder path; and access scope (organization, owner, collaborators). The schema also includes typed key/value maps — string→string, string→int, string→double — so Viridon-specific dimensions like ISO/RTO, sponsor, project, or filing date attach to chunks without a schema change when a new dimension shows up. Maps are how we store arbitrary per-client metadata as dictionaries rather than hard-coding columns.

That metadata drives two kinds of filtering. Hard filtering uses the exact-match maps — "only chunks where ISO_RTO = CAISO and document_type = selection report" — applied before ranking to scope the search precisely. Soft filtering uses metadata that's full-text indexed (titles, filenames, folder path, short structured fields): it's folded into the hybrid score via BM25 — the standard keyword-matching algorithm — so a matching project name boosts a chunk's relevance without excluding anything. (Hybrid search blends this keyword signal with vector search, which matches on meaning rather than exact words.) And because we set this per field, metadata can be made BM25-searchable or kept filter-only by choice — ISO/RTO can be a hard filter, a soft ranking signal, or both, depending on how you want it to behave.

Metadata is created mostly automatically, with light human curation. The parser emits structure, coordinates, page, and hierarchy; file and source systems give dates, folder path, and access scope; a classifier assigns document type/subtype; and an LLM extraction pass pulls Viridon-specific values (ISO/RTO, sponsor, project, key dates) out of the content into the maps. Humans confirm the taxonomies and correct the occasional misclassification — corrections feed back rather than being re-done each time. This whole step is the Ingestion & Structuring component (KL·B) feeding the index (KL·A).

Additional detail two-stage model, chunkers, full metadata schema, hard vs soft filtering

The two-stage model built today

Most file indexing (PDF, DOCX, TXT, PPTX, XLSX) follows the same shape: a structured element's text → SemanticChunker → a long_text_data[] array (each entry independently embedded) → break_element_arr_into_semi_equal_lengths (max 5) → one or more Vespa records. Semantic chunking decides how to split a block's text for embedding; the structural pass before it decides what each block is. The record-splitter is not a text splitter — it caps each record at five embeddings and distributes the chunks evenly, so retrieval granularity stays clean and no single record stores too many vectors.

1 · Semantic chunking — primary strategy built today

The default for almost all document indexing, based on Greg Kamradt's percentile-breakpoint approach, customized to run on our own encode_many rather than a LangChain embedding wrapper. The algorithm:

  • Split text into sentences by regex (((?<=[.?!])\s+|\n)).
  • Combine each sentence with its neighbors (buffer size 1) into "combined sentences".
  • Embed the combined sentences in batches of 500.
  • Compute cosine distance between adjacent combined sentences; mark a breakpoint wherever distance exceeds the 95th-percentile threshold.
  • Group the sentences between breakpoints into chunks.
  • Post-process with a recursive character split to enforce ~32–512 char bounds, merge tiny chunks, and split oversized ones.

Used by PDF, DOCX, TXT, PPTX, XLSX and the shared embedding helper — i.e. the whole Viridon corpus.

2 · Fallback splitters built today

  • Recursive character split (LangChain, splitting on paragraph → line → word → char to a 512-char cap) — for structured or already-sectioned content: heading/content blocks, table rows, short fields, where fixed sizes are predictable and embedding cost should stay low.
  • Fixed-length (20k char window, 200 overlap, no semantics) — only where content is already short structured snippets and the semantic pass would add cost without benefit.

Cost discipline: semantic chunking embeds every combined-sentence pair for the distance calculation and then the final chunks — materially more embedding calls than recursive/fixed splitting. Reserving it for long-form prose and using cheaper splitters for short structured fields is a deliberate cost choice.

3 · Structural chunking first built today

Before any of the above, documents are chunked by structure: tables become row/column children (carrying row_num, col_num, is_table_root), images get VLM descriptions, and hierarchy is captured via group IDs. Only the resulting text (or image description) then goes through semantic chunking.

The metadata schema — standard_document built today

Each chunk is one standard_document record (inheriting a shared base schema). The fields map directly onto everything Viridon needs:

NeedSchema field(s)How it's createdIndexing
Document typedocument_type, document_subtypeClassifier at ingest (proposal, selection report, RFI, NDA, ISO/RTO study)filter
ISO/RTO, sponsor, project, datesstring_string_hard_filter_map, string_int_hard_filter_map, string_double_hard_filter_mapLLM extraction + source metadata — no schema change for new dimensionshard filter (exact)
Section / hierarchygroup_id, parent_group_id, hierarchy_group_idsStructural parsefilter / scope
Page & locationpage_number, coordinates (map<string,float>)Parser (Figure 1)filter / citation
Table positionrow_num, col_num, is_table_root, is_number_onlyTable parserfilter
Figuresimage_uuid, image_aws_key, is_image_usefulImage + VLM pipelinefilter
Date / recency / versiondocument_created_at, created_at, updated_at, versionFile & source metadataattribute (filter + rank)
Location in tenantfolder_path, folder_path_idsSharePoint treeBM25 + filter
Access scope (RBAC — role-based access control)organization_id(s), owner_id, collaborator_idsSource / SSOfilter (→ KL·H)
Titlesname, filenameSourceBM25
Short structured fieldsshort_text_field_data (weightedset)ExtractionBM25 (weighted)
The chunk text itselflong_text_data (array<string>) + long_text_embeddingsSemantic chunkerBM25 + semantic

Hard filtering — exact metadata scoping

The three typed maps (string→string, string→int, string→double) are stored with exact match on both key and value, as fast-search attributes. That's what lets a query say "restrict to ISO_RTO = CAISO, document_type = selection report, year ≥ 2023" and have Vespa narrow the candidate set before ranking. Because they're maps, adding a new metadata dimension is a data change, not a schema migration.

Soft filtering — metadata as a ranking signal, BM25-searchable by choice

Fields marked enable-bm25 (long_text_data, name, filename, folder_path, short_text_field_data) are full-text searchable and carry per-field weights in the rank profile (e.g. name/filename weighted ~2× body text), so matching metadata boosts relevance rather than excluding. Fields marked attribute-only (document type, page, the hard-filter maps) are not BM25-searched — they're pure filters. This is a per-field choice: any piece of metadata can be made a soft ranking signal (indexed), a hard filter (attribute), or both. There are also trigram (gram-size 3) variants of the indexed fields for fuzzy/typo-tolerant matching, and an empty-field discount so a chunk isn't unfairly penalized for missing an optional field. The full hybrid rank profile — BM25 / semantic / n-gram weighting, multi-vector closeness over long_text_embeddings, and reranking — is covered in Section 1.4 (Retrieval).

How metadata is created — automated, extracted, or curated

  • Automated from the pipeline — page, coordinates, hierarchy, table position, chunk IDs, figure pointers (the parser); dates, folder path, access scope (file & source systems).
  • LLM-extracted — ISO/RTO, sponsor, project, referenced dates and other Viridon dimensions, pulled from content into the maps and short fields during ingestion.
  • Human-curated (light) — confirming the document-type and ISO/RTO taxonomies and correcting misclassifications; corrections feed back so they're not repeated. The taxonomy is configured per client configured for Viridon, on top of the mechanism that already exists.

01 — Ingestion · Indexing · Retrieval

Embedding

Diligence question 1.3

What is your approach to embedding, and which model(s) or platform(s) would you use and why?

For Viridon we'd run a privacy-first, model-agnostic embedding layer on Amazon Bedrock, so every document is embedded entirely within Viridon's own AWS environment. We treat the embedding model as a swappable component rather than a fixed dependency — the retrieval architecture doesn't change when the model does — but the deployment posture (in-tenant, no data egress) is the part we'd hold fixed, because it's important in ensuring this system becomes an asset in your data room at time of exit.

The reason to anchor on Bedrock is data control. Accessed through an AWS PrivateLink VPC endpoint, embedding traffic stays on the AWS network within Viridon's chosen region and never crosses the public internet. Bedrock does not use inputs or outputs to train any model, and does not share them with model providers; data is encrypted in transit and at rest, optionally under Viridon's own KMS keys; and the service carries the compliance coverage a Blackstone-backed infrastructure company will be asked about (SOC 1/2/3, ISO 27001 and family, HIPAA-eligible, GDPR, FedRAMP). The practical consequence for the exit story: the embedding index is an owned asset with no third-party data exposure that could reprice a deal.

On model choice, the default for text is Amazon Titan Text Embeddings V2 — optimized for RAG (retrieval-augmented generation: answering from retrieved documents), multilingual, with selectable output dimensions (256 / 512 / 1024) and unit-normalized vectors. For the figure-heavy ISO/RTO and transmission material, we'd use a multimodal embedding model — Amazon Nova Multimodal Embeddings (a single model spanning text, documents, images, video and audio, with cross-modal retrieval) or Titan Multimodal Embeddings G1. This is what makes the multimodal retrieval we flagged in Parsing (1.1) real: a one-line diagram or deliverability chart is embedded into the same space as the surrounding text, so a text query can retrieve the figure directly, not only via its written description. If a privacy requirement or an eval result points elsewhere, Cohere embeddings on Bedrock — or self-hosted open-source models (e.g. BGE, E5, GTE) that run entirely inside the VPC — are also available; the choice is eval-gated and privacy-gated, not locked.

Two engineering invariants govern the layer. First, index and query must use the same model and the same dimension — both sides are wired to the chosen Bedrock model, and the vector index's tensor dimension is set to match. Second, we pick the output dimension to balance accuracy against storage and latency; these models use Matryoshka-style dimensions (embeddings trained so they can be safely truncated to a shorter length), so e.g. dropping from 1024 to 512 keeps ~99% of retrieval accuracy at half the storage. This is the detail behind the Indexing & Retrieval component (KL·A) in Figure 1.

Finally, embeddings are one signal of three, not the whole story. Retrieval blends semantic (vector closeness) with BM25 (lexical) and n-gram (fuzzy) matching, and the semantic weight is automatically zeroed for chunks that have no vector (e.g. number-only table cells) so lexical matching still works. That design is what lets the embedding model be swapped without rebuilding retrieval — the full ranker is covered in Section 1.4.

Additional detail Bedrock privacy, model options, dimensions, cross-modal retrieval

Why Amazon Bedrock — data privacy in depth recommended for Viridon

  • Network isolation — AWS PrivateLink creates an interface VPC endpoint in Viridon's subnets; embedding traffic stays on the AWS network within the chosen region and never traverses the public internet.
  • No training, no sharing, no retention — inputs and outputs are not used to train any foundation model, and not shared with model providers. Providers have no access to the isolated model-deployment accounts.
  • Encryption & keys — TLS in transit, KMS at rest, with the option of customer-managed keys.
  • Residency — data stays in the chosen region; no cross-region movement unless explicitly opted in.
  • Compliance — SOC 1/2/3, ISO 27001 / 27017 / 27018 / 27701, CSA STAR L2, HIPAA-eligible, GDPR, FedRAMP — the documentation a sponsor or acquirer's diligence will expect.

Recommended models

ModelModalityOutput dimsWhy / when
Titan Text Embeddings V2 titan-embed-text-v2:0Text (8k tokens, 100+ languages)256 / 512 / 1024Default for prose. RAG-tuned, normalized, flexible dimension for storage/latency control.
Amazon Nova Multimodal EmbeddingsText · document · image · video · audio (unified)256 / 384 / 1024 / 3072Best fit for figure-heavy ISO/RTO + transmission docs; cross-modal retrieval in one space.
Titan Multimodal Embeddings G1 titan-embed-image-v1Text + image (shared space)256 / 384 / 1024Lighter multimodal option for image-by-text / image-by-image search.
Cohere (on Bedrock) · open models (BGE / E5 / GTE)Text (model-dependent)model-dependentEval/privacy-gated alternatives. Open models run fully self-hosted in-VPC.

Multimodal & cross-modal retrieval for Viridon

A large share of Viridon's signal lives in diagrams — one-lines, substation layouts, deliverability charts. In Parsing (1.1) we make those searchable by indexing a VLM-written description of each figure. A multimodal embedding model goes one step further: it embeds the figure itself into the same vector space as text, so a text query retrieves the image by visual-semantic similarity, not only through its description. Nova Multimodal Embeddings is explicitly designed for exactly this — searching documents that mix infographics and text — which is why it's our lead recommendation where figure retrieval matters.

Dimensions & the index invariant

The embedding dimension must match the vector index. Bedrock's models expose Matryoshka-style dimensions, so we choose a point on the accuracy/cost curve — typically 1024, or 512 where storage and latency matter and ~99% of accuracy is retained — and set the index tensor dimension to match. The one hard rule: the same model and dimension are used at index time and query time. Changing the model later means re-embedding the corpus and updating the index dimension, which is a deliberate, eval-gated migration rather than a silent swap.

Throughput

Bedrock offers asynchronous / batch embedding jobs for indexing large corpora (the 300+ page proposals) and a latency-optimized path for query-time embedding, which maps cleanly onto our parallel indexing design and keeps interactive search fast.

Hybrid retrieval recap built today

Embeddings feed the semantic leg of a three-signal hybrid ranker (semantic + BM25 + n-gram, default blend ≈ 0.5 / 0.4 / 0.1). Semantic weight auto-zeroes for chunks with no vector so lexical and fuzzy matching still operate. Because retrieval is hybrid and the embedding model is abstracted behind a single encode interface, the model is genuinely swappable — the retrieval architecture, covered in 1.4, is unchanged by the choice.

01 — Ingestion · Indexing · Retrieval

Retrieval

Diligence question 1.4

Walk us through your retrieval approach, including the tools or platforms used and why. Specifically: is retrieval semantic-only or hybrid (keyword + semantic), and what signals drive ranking?

Retrieval is hybrid, not semantic-only, and it's a multi-stage pipeline rather than a single vector lookup. Every search fuses three signals inside one Vespa query — BM25 (lexical/keyword), semantic (vector nearest-neighbour over the chunk embeddings), and n-gram (character-trigram, typo-tolerant) — and several lightweight LLM steps wrap around that core to handle the messiness of real enterprise questions. Vespa (open-source, self-hosted in Viridon's VPC) is the engine because it does true hybrid retrieval and two-phase ranking in a single query; the query is embedded with the same Bedrock model used at index time (Section 1.3).

Before anything is searched, two things happen. Query distillation rewrites a conversational message into a standalone query — "what about their deliverability?" becomes "Sunrise project deliverability study outcome" using the recent conversation history, with explicit instructions not to over-interpret domain terms. Then multi-query expansion generates three additional phrasings (synonyms, full-forms/abbreviations like "CAISO" ↔ "California ISO", and different angles), so we typically search with four parallel queries. This lifts recall on under-specified questions, which is the common failure mode on a corpus this varied.

Each query then runs hybrid search in Vespa, scoped by organization, optional source selection, and hard metadata filters (the maps from 1.2 — e.g. ISO_RTO = CAISO). This is the Scoped Retrieval component (KL·H) in Figure 1. The lexical legs run over text plus high-value fields (name, filename, folder path); the semantic leg runs an approximate-nearest-neighbour search over the multi-vector chunk embeddings, with a candidate set of ~500 before ranking.

Ranking is where the signals combine, and it's fully configurable through Vespa rank profiles. The default profile blends semantic at 0.5, BM25 at 0.4, and n-gram at 0.1 in a first phase, then re-ranks the top 500 in a global phase with each signal linearly normalized (a master profile offers reciprocal-rank fusion — a standard method for merging several ranked lists — as an alternative). On top of that, field-level weights mean a match in a document's name or filename (weight 300) outranks the same match in body text (150), an empty-field discount keeps a chunk from being penalized for missing optional metadata, and the semantic leg auto-zeroes for chunks that have no embedding (number-only table cells) so lexical matching still surfaces them. A per-word significance step scores each query term HIGH / MEDIUM / LOW (1.0 / 0.5 / 0.01) so filler words don't pollute the lexical match while the full-query embedding stays intact.

After the parallel searches return, an application fusion layer merges and de-duplicates across the four queries and boosts documents that matched multiple phrasings (final = top score + 0.2 × second-best) — deliberate recall amplification for results that show up under several framings. Optional precision passes sit on top: Vespa's global-phase rerank is always on; a self-hosted open cross-encoder reranker (e.g. BGE-reranker) — a slower, more accurate model that re-scores the shortlist for precision — can be added; and an optional LLM source-picker can read the shortlist and return an ordered set of the most relevant sources before the answer is generated.

Finally, custom ranking is a first-class lever, not an afterthought — the rank profile is tuned per customer. The signal weights, the field weights, significance on/off, linear-norm vs reciprocal-rank fusion, scope defaults, and image inclusion are all configurable per request or per org without code changes, and we tune them against Viridon's eval set (Section 3). A worked retrieval trace on a representative document is below; we'd deliver the full trace on a Viridon document of your choosing as the requested artifact.

Additional detail pipeline, rank profiles & weights, fusion, worked retrieval trace

The retrieval pipeline, end to end built today

  • 1 · Query distillation — an LLM rewrites the conversational message into a standalone query using the last several turns; streamed to the user as "Generating search query…".
  • 2 · Multi-query expansion — an LLM produces 3 extra phrasings (synonyms, full-forms/abbreviations, alternate angles); the original is appended → ~4 parallel queries.
  • 3 · Term significance — an LLM scores each word HIGH (1.0), MEDIUM (0.5) or LOW (0.01), seeing both the original question and the search term so it weights by actual intent.
  • 4 · Hybrid Vespa search — BM25 + semantic + n-gram fused in one YQL query per term, scoped and filtered.
  • 5 · Rank profile — two-phase weighted ranking inside Vespa.
  • 6 · Cross-query fusion — merge, dedupe, multi-match boost.
  • 7 · Enrichment — hits become multimodal context (text, images via presigned URLs, table structure) for the answer LLM; optional cross-encoder rerank / LLM source-pick.

Hybrid search — three modes in one query

A single Vespa request combines all three retrieval modes, plus the per-word significance weights injected into the lexical legs:

or userInput(@query) // BM25 (lexical)or ({defaultIndex:"grams"} userInput(@query)) // n-gram (fuzzy)or ({targetHits:500} nearestNeighbor(long_text_embeddings, prompt_embedding)) // semantic (ANN)

The semantic leg uses approximate nearest-neighbour over the multi-vector chunk embeddings with a ~500-candidate target; the final hits count returned is tunable (e.g. 30 for the API, up to 500 for the full RAG path).

Rank profiles & the signals that drive ranking built today

Two phases. First-phase computes a weighted blend of the three signals; global-phase re-ranks the top 500 with each signal linearly normalized (the master profile swaps in reciprocal-rank fusion when use_reciprocal_rank is set). The defaults on standard_document:

SignalDefault weightMechanism
Semantic0.5closeness of query embedding vs the chunk's multi-vector long_text_embeddings
BM25 (lexical)0.4BM25 over long_text_data, name, filename, folder_path, short fields
N-gram0.1nativeRank over the trigram fields — typo / partial-token tolerance

Inside the BM25 leg, fields are weighted so identity matches win: name 300, filename 300, folder_path 200, body text 150, short fields 150. Two guards matter: an empty-field discount (×0.1) so a chunk isn't penalized for lacking an optional field, and a semantic auto-zero — if a chunk has no embedding (number-only cells), its semantic weight drops to 0 and the blend renormalizes over the lexical signals so the chunk is still retrievable. All weights are inputs, so they're overridable per request.

Term significance — which words matter

Per-word weights are injected into the YQL so noisy lexical matches on filler words are suppressed while the full-query embedding is untouched:

default contains ({weight:1.0, significance:HIGH} "deliverability") default contains ({weight:0.01, significance:LOW} "the")

HIGH (1.0) for names/entities, MEDIUM (0.5) for secondary context, LOW (0.01) for stopwords/generic terms. Significance can be disabled per org where it isn't helping.

Cross-query fusion built today

Within each query's result set, signals are linearly normalized and re-scored with the per-hit weights; across the four queries, hits are de-duplicated by ID and a document that matched multiple phrasings is boosted: final = top score + 0.2 × second-highest score. This intentionally amplifies recall for documents that surface under several framings of the same question.

Reranking & per-customer tuning

  • Vespa global-phase rerank — top-500 re-score, always on. built
  • Cross-encoder reranker — a self-hosted open cross-encoder reranker (e.g. BGE-reranker / mxbai-rerank) for higher precision on the shortlist; addable to the main path and runs fully in-VPC. optional
  • LLM source-picker — optionally an LLM reads the shortlist and returns an ordered most-relevant set before answer generation. optional · built
  • Per-customer rank profile — signal weights, field weights, significance on/off, fusion method, scope and image-inclusion defaults — all tuned to Viridon against the eval set, no code change. tuned for Viridon

Worked retrieval trace — representative illustrative; full artifact on a Viridon doc

User message (with history)

"what was the deliverability outcome for Sunrise in CAISO?"

After distillation + expansion → 4 parallel queries

Sunrise project deliverability study outcome CAISO Sunrise transmission deliverability assessment California ISO Sunrise full capacity deliverability status CAISO Sunrise interconnection deliverability result

Term significance (query 1)

Sunrise · 1.0project · 0.01deliverability · 1.0study · 0.5outcome · 0.5CAISO · 1.0

Hard filter applied: ISO_RTO = CAISO.

Candidate chunks — per-signal scores → blended (0.4·bm25 + 0.5·sem + 0.1·ngram)

ChunkBM25SemanticN-gramBlendedMatched queries
A · Selection report, "Deliverability" §, p.1420.820.910.400.82 1, 2, 3
D · file Sunrise_Deliverability_Study.pdf0.88
(filename wt 300)
0.660.550.74 1, 2, 4
B · Proposal exec summary mention0.700.740.300.68 1
C · Table cell "FCD status: Conditional"0.55n/a — no vector0.650.57 1, 3

Chunk C is a number-only cell: semantic auto-zeroes and the blend renormalizes over the lexical signals — (0.4·0.55 + 0.1·0.65) / 0.5 = 0.57 — so it's still retrieved.

Cross-query fusion (top + 0.2 × second) → final ranking

RankChunkFusionFinal
1A0.82 + 0.2·0.800.98
2D0.74 + 0.2·0.710.88
3C0.57 + 0.2·0.540.68
4B0.68 (single match)0.68

A wins on strong semantic + three-query match; D rises on the filename weight and multi-query match; C — a number-only cell with no embedding — still ranks via lexical signals. This is the shape of the trace artifact we'd deliver on a representative Viridon document.

Tools & platforms in the retrieval stack

LayerTool / platformWhy
Hybrid index & rankingVespa (open source, self-hosted in-VPC)BM25 + semantic + n-gram fused in one query; two-phase rank profiles; ANN at scale.
Query embeddingBedrock model (per 1.3)Same model + dimension as index; in-VPC.
Query understandingLLM — Bedrock or self-hosted open model (distill · expand · significance · optional source-pick)Turns messy conversational input into high-recall, intent-weighted queries.
Precision rerankOpen cross-encoder (e.g. BGE-reranker), self-hostedOptional shortlist reranking for higher precision.
Cross-query fusionApplication layerMerge, dedupe, multi-match recall boost.

01 — Ingestion · Indexing · Retrieval

Edge vs. commodity

Diligence question 1.5

Looking across the pipeline above, which stages do you consider commoditized (off-the-shelf tooling), and where do you provide differentiated value?

Several stages of the pipeline are genuinely commoditized, and we deliberately use strong off-the-shelf components for them rather than reinventing them. Recognizing what's commodity is a feature — it's what keeps the build lean and lets us concentrate engineering where it actually differentiates. In Figure 1 terms, the commodity sits underneath the boxes; the boxes themselves — how they're composed, tuned, and extended — are where the value is.

The commoditized stages are the foundation models and retrieval primitives. Parsing leans on the Unstructured framework, OCR, and the base layout and table-structure models. Embedding is a commodity capability — Titan, Nova and Cohere on Bedrock and self-hosted open models (BGE, E5) are interchangeable, and the vector math is the same everywhere. Retrieval's core operations — BM25, approximate-nearest-neighbour vector search, n-gram matching — are off-the-shelf Vespa primitives, and basic character-based chunking is a solved problem. The platforms themselves (Unstructured, Vespa, Bedrock) are infrastructure we build on, not things we'd ever rebuild. Reinventing any of these would destroy value, not create it.

Our differentiated value is the orchestration around those commodity pieces, and the Viridon-specific layers on top of them. In parsing: strategy routing per document, the dual image-plus-structure table representation, VLM figure-description indexing, multi-page table reconciliation, and the graceful-degradation fallbacks that keep hard documents from failing silently. In chunking: the two-stage structural-then-semantic design, the tuned semantic chunker, and — the biggest piece — the metadata schema with its typed maps, hard/soft filtering, and BM25-by-choice fields. In embedding: the privacy-first in-VPC deployment posture and the model-agnostic abstraction. In retrieval: the multi-stage LLM-wrapped pipeline (distillation, expansion, term significance, cross-query fusion) and the per-customer rank profiles.

And then there's the layer with no off-the-shelf equivalent at all — the bespoke knowledge-layer modules built for Viridon: the "what changes" map (KL·D), public-doc enrichment (KL·I), the RFI Q&A + SME-delegation memory (KL·J), the standard-terms playbook (KL·K), and the onboarding glossary (KL·L), plus the app-specific tools. No general-purpose tool ships these because they encode Viridon's process, not a generic one. This is exactly the gap an off-the-shelf product leaves: it gives you the commodity retrieval box and stops — which is why, on the diagram, a tool like Glean covers only KL·A.

This division is what makes the platform both efficient and bespoke. We don't pay to rebuild commodity foundations, so the budget goes to the integration, tuning, and Viridon-specific modules that are genuinely differentiated — and those are the parts that sit in Viridon's environment as an owned asset.

Additional detail stage-by-stage commodity vs. value, and the net-new layer

Commodity vs. differentiated, stage by stage

StageCommoditized (off-the-shelf)Our differentiated value
ParsingUnstructured framework; OCR (Tesseract); base layout (yolox) & table-structure (table-transformer) models; pdfminer; PyMuPDFStrategy routing per document; dual image + structure table representation; VLM figure-description indexing; multi-page table reconciliation; graceful-degradation fallbacks
ChunkingRecursive / fixed-length character splitters (LangChain)Two-stage structural → semantic → record-cap design; tuned semantic chunker (95th-pct, own embeddings); cost-aware strategy selection; the standard_document metadata schema (typed maps, hard/soft filtering, BM25-by-choice)
EmbeddingThe embedding model itself (Titan / Nova / Cohere on Bedrock, or self-hosted open models)Privacy-first in-VPC Bedrock deployment; model-agnostic abstraction; index/query dimension-invariant management; multimodal cross-modal wiring
Indexing & retrievalVespa platform; BM25, ANN/vector search, n-gram primitivesMulti-stage LLM pipeline (distill, expand, significance); hybrid rank profiles (signal + field weights, empty-field discount, semantic auto-zero); cross-query fusion; per-customer rank tuning
Knowledge-layer modules— no off-the-shelf equivalent —Net-new for Viridon: KL·D "what changes" map · KL·I public-doc enrichment · KL·J RFI Q&A + SME delegation · KL·K standard-terms playbook · KL·L onboarding glossary · bespoke tools (T1, T3, T4, T6, T10, T12)

Why the division is deliberate

  • Commodity stays commodity — foundation models and search primitives improve rapidly and are interchangeable; building our value on them (not instead of them) means Viridon inherits those improvements for free and isn't locked to one vendor.
  • Value concentrates in composition and context — almost all of the differentiation is in how the commodity pieces are routed, tuned, and fused, and in the modules that encode Viridon's specific documents and process.
  • The off-the-shelf ceiling — a packaged tool delivers the commodity retrieval box and nothing above it; it can't be shaped to the proposal-writing workflow, and the bespoke knowledge-layer modules simply don't exist in it.

01 — Ingestion · Indexing · Retrieval

Artifacts

The two artifacts requested for this section: a retrieval trace on a representative document, and the third-party tools, platforms, and models in the proposed stack.

Artifact A

Retrieval trace — representative document

For a sample query, the chunks retrieved and how they were ranked. The bars decompose each chunk's relevance into the three weighted signals, so it's visible why each ranked where it did. This expands the inline trace from Section 1.4; we'd run the full version on a Viridon document of your choosing.

Representative doc · 2024 CAISO selection report · ≈198 pp · embedded tables + deliverability figures

Sample query (after distillation + 4-way expansion, ISO_RTO = CAISO filter applied)

"what was the deliverability outcome for Sunrise in CAISO?"

Per-query relevance — how each signal contributes to the blend (0.4·BM25 + 0.5·semantic + 0.1·n-gram)

1
Selection report — "Deliverability" section
p.142 · narrative · matched queries 1, 2, 3
blended0.82
2
File: Sunrise_Deliverability_Study.pdf
filename match (field weight 300) · matched queries 1, 2, 4
blended0.74
3
Proposal — executive summary mention
p.4 · narrative · matched query 1
blended0.68
4
Table cell — "FCD status: Conditional"
p.151 · number-only cell · no embedding · matched queries 1, 3
blended0.57
Semantic (0.5) BM25 (0.4) N-gram (0.1)

Note the bottom result: a number-only table cell with no embedding — the semantic segment is absent (auto-zeroed) and the blend renormalizes over the lexical signals, so the cell is still retrieved.

Cross-query fusion (final = top score + 0.2 × second-best) → final order

RankChunkFusionFinal
1Selection report — Deliverability §0.82 + 0.2·0.800.98
2Sunrise_Deliverability_Study.pdf0.74 + 0.2·0.710.88
3Table cell — FCD status0.57 + 0.2·0.540.68
4Proposal exec-summary mention0.68 (single match)0.68

The table cell rises above the single-match proposal mention because it matched two query phrasings and earned the fusion boost — recall amplification for results that surface under multiple framings.

Artifact B

Third-party tools, platforms & models

The third-party stack the pipeline builds on. Every component is either open-source (self-hosted in Viridon's VPC) or a managed AWS service reached privately over PrivateLink — nothing requires data to leave Viridon's AWS account. Our differentiated value (Section 1.5) is the integration, tuning, and Viridon-specific modules around these — which are our own, not third-party.

ComponentTypeDeploymentRole
Parsing & structuring
UnstructuredLibraryOpen-source · in-VPCTyped-element parsing across PDF/DOCX/XLSX/HTML
yoloxModelOpen-source · in-VPCPage layout / region detection
Table-transformerModelOpen-source · in-VPCTable row/column structure recovery
TesseractEngineOpen-source · in-VPCOCR for scans & image regions
pdfminerLibraryOpen-source · in-VPCNative PDF text-layer extraction (FAST path)
PyMuPDF (fitz)LibraryOpen-source · in-VPCHyperlink annotation extraction
pandasLibraryOpen-source · in-VPCTable HTML → per-cell records
VLM (vision-language model)ModelBedrock (PrivateLink) or self-hosted OSSFigure & diagram description generation
Embedding
Amazon BedrockPlatformAWS · PrivateLinkIn-VPC model access (no-train, KMS, no public egress)
Titan Text Embeddings V2ModelBedrock (PrivateLink)Default text embeddings (256/512/1024-d)
Nova Multimodal EmbeddingsModelBedrock (PrivateLink)Cross-modal text + figure embeddings (lead for figures)
Open embedding models (BGE / E5 / GTE)ModelOpen-source · in-VPCSelf-hosted alternative; fully in-VPC
Cohere · Titan Multimodal G1ModelBedrock (PrivateLink)Alternative managed embeddings (eval/privacy-gated)
Index & retrieval
VespaPlatformOpen-source (Apache 2.0) · self-hosted in-VPCHybrid BM25 + semantic + n-gram index; two-phase rank profiles; ANN
LLM (query understanding)ModelBedrock (PrivateLink) or self-hosted OSSQuery distillation, expansion, term significance, optional source-picking
Cross-encoder reranker (BGE-reranker)ModelOpen-source · in-VPCOptional precision reranking of the shortlist
Storage & security
Amazon S3PlatformAWS · in-tenantOriginal documents & extracted images
AWS KMSPlatformAWS · in-tenantEncryption at rest (customer-managed keys optional)
AWS PrivateLinkPlatformAWS · in-VPCPrivate VPC connectivity; no public-internet egress

Specific model selections (embedding model, VLM, query/rerank models) are finalized per Viridon's data-privacy requirements and eval results; the architecture treats each as a swappable component, and an open-source self-hosted option exists for every model role.

02 — Orchestration

Orchestration design

Diligence question 2.1 · grounded in the demoed workflow

Walk us through how the demoed workflow is structured — the steps, how they connect, and the framework or major dependencies it is built on. More importantly: how do you think about orchestration design, and how do you decide between approaches (routing to specialists, a manager decomposing into parallel sub-tasks, or a single-agent flow)? What about the design you chose seems right for this workflow?

We'll ground this in what we demoed: the Proposal Writing Assistant — Setup workflow, end to end. In Figure 1 that's the Setup workflow in layer 4, marked deterministic, and it's the first phase of a larger AI teammate that runs Setup → Strategy → Drafting → Evaluation. The key design choice: Setup is a deterministic chain, not an agent improvising. The steps are known, ordered, and correctness-critical on a 300-page document — so we use the simplest control flow that does the job reliably, which here means a fixed sequence of tool calls rather than a free-roaming agent.

The demoed sequence: past winning proposals are ingested and broken into typed sections (KL·B, KL·C); a working template is auto-derived and the recurring variables are detected — project_name, sponsor, capacity, key dates (T8, KL·F); a "what changes" map flags the parts of the starting proposal that likely need to change this cycle — learned from how proposals have historically changed — versus the parts safe to keep (KL·D); the new bid's brief and documents are then used to fill the template variables and revise the flagged parts, touching only what likely needs to change so the rest is preserved (T4, KL·D); the ~200-page sponsor selection reports are ingested (KL·B); their win/loss themes are extracted and indexed as a retrievable advice module (KL·E); and finally, in AI editing, when the assistant recommends what to edit and how, it pulls that indexed selection-report advice (and prior sections) via retrieval, comments paragraph-by-paragraph, and proposes edits (T1, T5, drawing on KL·A / KL·E / KL·H). The full step-by-step is below.

On framework and dependencies: the orchestrator (Figure 1, foundation) chains these tools, and both the tools and the knowledge layer are exposed over MCP — the Model Context Protocol, an open standard for connecting AI assistants to tools and data. As the diagram says, the orchestrator can be an MCP client, your Claude / GPT desktop app, or a custom router — that's deliberate, because it decouples the control flow from the tools. Setup runs as a deterministic MCP tool chain; the later phases (Strategy, Drafting, Evaluation) move to orchestrated routing where the path depends on content.

The orchestrator also maintains structured memory across the loop, and how we model that memory is itself a design decision. Rather than carrying one ever-growing chat transcript — which conflates different concerns and quickly overflows the context window — we separate session memory into three kinds: conversation (the natural-language turns that capture user intent and constraints, like "don't delete anything"), working state (a structured scratchpad of the IDs and decisions the workflow has accumulated — the source of truth for where we are), and an episodic trace (an ordered log of every tool call, its arguments, and its outcome). Each planner step receives a deliberately bounded package assembled from these — truncated conversation, current working state, the last few trace steps — rather than everything every time. That separation is what keeps the agent within context limits, keeps machine state reliable across steps (the resumability in 2.2), and makes the whole run auditable (the execution trace in 2.3). This is session-scoped memory for the orchestration loop; cross-session institutional memory is a separate concern, covered in Section 4.

Scaling Setup into a full "AI teammate for proposal writing" is, concretely, two things: implementing the knowledge-layer modules, and building a composable tool set — Read paragraph, Create comment, Draft a section, Identify opportunities, Flow updates across the document, Evaluate against criteria, Aggregate attachments, Web research, Grounded Q&A. Those are exactly T1–T8 and T·Q in Figure 1, plus a few document-editor primitives. The teammate itself is a conversational multi-agent loop living in the multiplayer editor: it routes each turn to the right tool or subagent, drafts / comments / researches like a colleague, and proposes changes a human approves.

How we think about orchestration design comes down to a few principles. Use the simplest control flow that works — deterministic where steps are known, agentic only where they aren't. MCP as the interface, so tools are swappable and the orchestrator is replaceable. Human-in-the-loop at the right gates — the AI proposes, the human disposes; no risky or irreversible action (editing the live document, flowing a change across 300 pages) happens without explicit approval. And guardrails, governance and security throughout: RBAC-scoped retrieval (KL·H), the open-source / in-VPC deployment from our deployment principle, per-tool permissions, and a full execution trace of every step (Section 2.3). We build for performance (parallelize what's parallelizable) and extensibility (tools compose and get reused across mini-apps).

We decide between orchestration approaches by the shape of the work, and the patterns nest rather than compete. A deterministic chain for known, ordered, correctness-critical steps (Setup). A router to specialists when requests are heterogeneous and each needs a different capability (the teammate picking T1 vs T2 vs T7 per turn). A manager that decomposes into parallel sub-tasks when the work splits into independent units (evaluating all sections at once, multi-query retrieval, flowing one change across 300 pages). And a multi-agent loop for interactive, open-ended work with a human present (live drafting). For proposal writing this combination is right because Setup demands reliability, drafting is inherently interactive, evaluation is embarrassingly parallel — and, critically, every specialist tool we build composes: the comment, Q&A, research and retrieval tools built here are reused by the RFI drafter, the legal screener, and the onboarding assistant. That reuse is the whole shared-foundation thesis of Figure 1, and it's why we optimize for composition.

The demoed Setup workflow — step by step

1
Upload past proposals → ingest & section-extract
Past winning proposals are parsed into typed sections and RFP questions, ready for templating and retrieval.
past proposalsKL·B ingestionKL·C section extraction
2
Generate template + detect variables
Auto-derive a working template from prior wins and detect the variables that recur throughout a proposal — project_name, sponsor, capacity, dates.
T8 build templateKL·F template generation
3
Flag what likely needs to change
Look at the past proposal we're starting from and flag the parts that likely need to change this cycle — learned from how proposals have historically changed — versus the parts safe to keep.
KL·D "what changes" map
4
Fill variables & revise flagged parts
Use the new bid's brief and documents to fill the template variables and revise the parts the map flagged — touching only what likely needs to change, so the rest is preserved.
new brief + docsT4 flow updatesKL·D
5
Upload selection reports → ingest
Ingest the ~200-page sponsor selection reports alongside the proposal corpus.
selection reportsKL·B ingestion
6
Extract & index advice
Mine win/loss themes and actionable advice from the selection reports and index it as a retrievable knowledge module.
KL·E selection-report advice
7
AI editing via RAG
When recommending what to edit and how, the assistant retrieves the indexed selection-report advice (and relevant prior sections), comments paragraph-by-paragraph, and proposes edits grounded in what has won before.
T1 read & commentT5 evaluateKL·AKL·EKL·H
8
Human approves
Every proposed change is surfaced for the human to approve or deny before it touches the document — the gate before any edit is applied.
human-in-the-loop gate
Input / gate Tool Knowledge-layer module Human approval
Additional detail the teammate tool set, design principles, choosing between patterns

Building the "AI teammate" — composable tools on the knowledge layer

The teammate is a single conversational agent that calls a set of small, well-scoped tools (the same ones in Figure 1, plus editor primitives). Each is built once and reused across apps — that reuse is the point.

ToolWhat it doesKnowledge layer it draws onReused by
T·Q · Grounded Q&ACited answers over the knowledge layerKL·A, G, Hevery mini-app
T1 · Read & commentSuggests improvements vs. selection-report themesKL·C, E, GRFI, legal
T2 · Draft a sectionFrom template + structured prior winsKL·A, B, C, FRFI drafter
T3 · Identify opportunitiesWhere to differentiate this bidKL·A, E, G
T4 · Flow updatesPropagate a change across 300+ pagesKL·B, Devaluation
T5 · Evaluate against criteriaScore a draft vs. what winsKL·A, E, G, H
T6 · Aggregate attachmentsSME reports into one narrative voiceKL·A, B, HRFI drafter
T7 · Web research & scrapeLive external + public-doc contextKL·B, IISO/RTO, all
T8 · Build a templateAuto-derive from past proposalsKL·C, D, F
Editor primitivesRead paragraph · Create comment · Apply approved editdrafting surface

Orchestration design — the principles

  • Simplest control flow that works — deterministic chains where steps are known and correctness matters; agentic routing only where the path genuinely depends on content. Determinism buys reliability, observability, and speed.
  • MCP as the interface — tools and the knowledge layer are exposed over MCP, so the orchestrator (MCP client / Claude / GPT desktop / custom router) is decoupled from the tools. Either side can be swapped without rewriting the other.
  • Human-in-the-loop at the right gates — the AI proposes; a human approves or denies before any consequential action. No edit to the live document, no change flowed across the proposal, no external action without explicit approval (Figure 1's Evaluation phase makes this gate explicit).
  • Guardrails, governance & security — RBAC-aware scoped retrieval (KL·H) so the agent only ever sees what the user may see; the open-source / in-VPC deployment so nothing leaves Viridon's environment; per-tool permissioning; and a full, inspectable execution trace (Section 2.3).
  • Performance — parallelize independent work (multi-query retrieval, section-parallel evaluation), keep deterministic chains to avoid wasted LLM round-trips, cache where safe.
  • Extensibility — new use cases reuse existing tools and add a few; the orchestration pattern stays the same. This is what makes each subsequent mini-app a fraction of the first.

Orchestration memory — session-scoped, three-way split built today

A single chat transcript doesn't scale and conflates three different concerns. We model the orchestration loop's memory (planner → tool calls → synthesizer) as three separate types on a per-session SessionMemory, so each stays clean and bounded.

Memory typeWhat it holdsAnswersHow it's used
ConversationNatural-language user/assistant turns"What did they ask for?" — intent, constraints, toneFed to the planner, truncated by turn count + character budget
Working stateStructured scratchpad — IDs & decisions (list_id, task_id, last_search_query)"Where are we right now?"Patched / replaced explicitly; in every (size-limited) planner package; the source of truth across steps
Episodic traceOrdered tool-call log — name, redacted args, success/failure, result summary, timestamps"What happened?"Written on each tool call; recent summaries go to the planner; powers audit, debug, replay & synthesis

Two cross-cutting mechanisms keep this within budget built today:

  • Compression & summarization — when a tool output exceeds a threshold (~4,000 chars), it's compressed (heuristic or optional LLM) and the salient IDs are merged into working state, so large MCP payloads don't blow the context window or drown the signal.
  • Bounded planner package — each planner step gets a deliberately size-limited view (truncated conversation + working state + last N trace steps + recent tool summaries), enforcing intentional context propagation rather than "send everything every time".

The synthesizer then produces the final answer by reading the goal, the trace, and the observations — not raw MCP blobs — which is what lets the loop stay reliable and within limits while still answering well.

Memory — planned / partial

  • Long-term memory planned — user preferences, standing instructions, and org facts retrieved by ID or embedding into the planner prompt, so the agent can remember across sessions ("always use workspace X", "never post to #general"). Documented but not yet built in the MCP client; it's the bridge to the institutional memory in Section 4.
  • Conversation summarization planned — a rolling summary of older turns instead of dropping them. Truncation is built today; summarization preserves early context more cheaply and is the next step.
  • Subagent memory partial — each subagent runs its own plan–act loop with its own memory and returns a bounded result; the parent merges the child's working state (under subagent_last), not the full child trace, so delegation stays scoped and doesn't pollute parent context.

The core idea: separating intent (conversation), state (working state) and history (episodic trace) lets the orchestrator stay within context limits, keep machine state reliable, and still synthesize good answers — the opposite of stuffing one growing transcript into every call.

Choosing between orchestration patterns

PatternUse whenIn proposal writing
Deterministic chainSteps are known, ordered, and correctness-criticalThe Setup phase — fixed sequence, fully traceable
Router to specialistsRequests are heterogeneous; each needs a different capabilityThe drafting teammate routing each turn to T1 / T2 / T3 / T7
Manager → parallel sub-tasksThe task splits into independent units that aggregateEvaluating every section at once; multi-query retrieval; flowing one change across 300 pages
Multi-agent loopInteractive, open-ended, human presentLive drafting in the multiplayer editor

Why this design is right for the workflow

  • Reliability where it matters — Setup is deterministic, so a long, high-stakes document is processed the same way every time, with a clean trace and no agent drift.
  • Fit to the human reality of drafting — drafting is iterative and collaborative, so a single conversational teammate with approval gates matches how Erin actually works, rather than forcing an autonomous agent onto a human process.
  • Speed where the work parallelizes — evaluation and retrieval fan out, so review of a full proposal doesn't run serially.
  • Composition over a monolith — building specialist tools (not one giant agent) means the proposal work directly powers the RFI drafter, legal screener, and onboarding assistant. We optimize for extensibility because the second app should cost a fraction of the first.

02 — Orchestration

Reliability

Diligence question 2.2

How do you think about reliability in a multi-step workflow, and what tools or techniques do you use to achieve it? Specifically: what happens when a step fails or returns low-quality output, how do you validate output between steps, and how do you prevent the workflow from drifting off course?

Our first reliability technique is to minimize the surface area for failure: the most reliable step is a deterministic one, which is why Setup is a fixed chain (2.1) rather than an agent improvising. For the parts that genuinely need an LLM, we treat it as a fallible component and wrap it in four things — validated structured outputs, checkpointed state, risk-classified human approval, and an in-loop reflection step. Together those cover the three failure questions: what happens when a step fails, how we validate between steps, and how we keep the workflow from drifting.

When a step fails or returns low-quality output, we separate two cases. A hard failure (error, timeout, tool exception) triggers a bounded retry with backoff, then a fallback path where one exists — the same pattern as the parsing fallbacks in 1.1, where a failed table inference degrades to image + description rather than crashing — and if it's still failing, we resume from the last checkpoint and, if exhausted, escalate to a human rather than proceed on a broken step. A soft failure (the step runs but the output is malformed, low-quality, or unsupported) is caught by the validation and reflection gates below, then repaired, retried, or escalated. The principle throughout: never silently pass a bad result downstream.

We validate output between steps by making every step emit a structured, typed output against an explicit schema — so we know exactly what the model produced and can check it programmatically before the next step consumes it. Schema validation deterministically catches malformed or hallucinated structure (missing fields, wrong types, out-of-range values); a grounding check verifies that claims which should be supported by retrieved sources actually are (the anti-hallucination contract that feeds our eval harness in Section 3). Each step has an explicit input/output contract, so a downstream step never has to guess what it received.

State management makes failure recoverable. Each step's inputs and outputs are checkpointed and steps are designed to be idempotent (safe to re-run without applying anything twice), so on any failure we know exactly where we left off and resume from the last good checkpoint — we don't re-parse a 300-page proposal or re-embed the corpus because a later step timed out. This matters most for the proposal workflow specifically, which runs over months, not minutes.

We prevent drift with a reflection step in the loop — a pattern we've already implemented in our agentic orchestration work. After a step (or on a cadence), a critic re-checks the work against the original objective and constraints, catches drift, and either re-anchors, re-plans, or halts. The goal and constraints are carried through every step so the agent never loses the thread, scoped tools limit how far it can wander, and explicit termination criteria plus bounded autonomy (caps on tool calls, recursion, and cost) stop a runaway loop.

Plan Act · call tool Validate output Reflect vs. goal ↻ on track → continue · drifting → re-anchor / re-plan · off-track or budget exceeded → halt & escalate to human

Finally, the human gate is itself a reliability control. We classify actions by risk and reversibility: read-only and reversible-in-draft actions (search, comment, propose an edit) run autonomously, while consequential or irreversible actions — applying an edit to the live document, flowing a change across 300 pages, anything external — require explicit human approval. And because applied edits are versioned (the document carries a version field), even an approved change is reversible. All of this sits on a full execution trace (Section 2.3), because you can't make reliable what you can't see.

Additional detail failure handling, validation, state, reflection, risk-gating

Failure handling — by failure mode

Failure modeHow we detect itResponse
Hard failure (error / timeout)Exception, timeout, tool errorBounded retry with backoff → fallback path → resume from last checkpoint → escalate if exhausted
Malformed outputSchema validation failsRepair / re-prompt → bounded retries → escalate
Low-quality / unsupported outputCritic + grounding check failReflection re-do with feedback → escalate to human if it doesn't converge
Drift from the goalReflection step vs. objectiveRe-anchor / re-plan; halt if it can't get back on track
Runaway loopStep / cost / recursion budget exceededHard stop → surface the partial result and the reason

Inter-step validation

  • Structured outputs — every step returns typed data against a schema, so the output is machine-checkable, not free text that the next step has to parse hopefully.
  • Schema validation — required fields, types, enums and ranges are enforced deterministically; a violation is a caught failure, not a downstream surprise.
  • Grounding checks — outputs that should be source-supported are verified against the retrieved evidence; unsupported claims are flagged before they propagate.
  • Explicit contracts — each step declares its input and output shape, so steps compose safely and a change to one can't quietly break the next.

State & resumability

  • Checkpointing — each step's I/O is persisted; a failure resumes from the last good step rather than restarting expensive work (parsing, embedding).
  • Idempotency — steps are safe to re-run, so retries and resumes don't duplicate or corrupt work.
  • Long-running by design — the proposal workflow spans months; durable state is what makes that survivable across interruptions.

Reflection & bounded autonomy

  • Critic in the loop — an evaluator re-checks each step's work against the goal and constraints (already implemented in our agentic orchestration products).
  • Goal anchoring — the original objective and constraints travel with the workflow so the agent doesn't lose the thread across many steps.
  • Scoped tools + termination criteria — limited tool surface and explicit stop conditions keep loops convergent.
  • Budgets — caps on tool calls, recursion depth and cost catch a runaway before it does damage or burns spend.
  • Abstention — a step can report low confidence and escalate rather than guess; "I'm not sure, here's why" beats a confident hallucination.

Risk-classified human approval

Action classExamplesAutonomy
Read-only / retrievalSearch, read a paragraph, grounded Q&AAutonomous
Generative · reversible in draftDraft a section, propose an edit, generate a template, leave a commentAutonomous (proposed, not applied)
Consequential / irreversibleApply an edit to the live document, flow a change across the proposal, any external action (send, export)Human approval required

Applied edits are versioned, so an approved change can still be rolled back — reversibility is a backstop even past the approval gate.

How it connects

Reliability isn't a single feature — it's the combination of a deterministic backbone (2.1), validated structured contracts between steps, durable state, an in-loop critic, risk-gated approval, and full observability (2.3). The same eval harness that measures quality (Section 3) doubles as regression protection: when a prompt or model changes, it confirms existing behavior didn't break before the change ships.

02 — Orchestration

Observability

Diligence question 2.3

How do you think about observability, and what tooling do you use for it? Specifically: can we and our technical advisor see the full execution trace of a workflow — what each step retrieved, decided, and passed downstream?

Yes — fully, top to bottom. The shift that makes this real is treating a workflow run as spans, not log lines: a run is one trace, each step is a span, and nested tool calls and sub-agents are child spans. That tree is exactly what answers "what each step retrieved, decided, and passed downstream," because those relationships are a hierarchy, not a flat stream. We build it on OpenTelemetry (the open industry standard for tracing software) with LLM-specific semantic conventions, and the backend — Langfuse or Phoenix — is open-source and self-hosted, so the trace store lives inside Viridon's VPC alongside everything else. No traces of Viridon's proposals go to a third-party SaaS.

We design for two audiences with two surfaces over the same captured data. Your technical advisor gets the full span tree and audit logs — every tool call, every retrieval, every decision, validation result and hand-off, with token, cost and latency per span and the exact prompt-template and model version that produced each output, plus replay. Erin and end users get explainability instead of raw internals: a "why did it suggest this?" view that traces any AI recommendation back through the advice it used to the source page. Same data underneath; the advisor sees the engine, the user sees the reason.

Mapping directly to your three words: retrieved is the retrieval trace from Section 1.4 captured on each search span — the distilled and expanded queries, the candidate set, the per-signal scores, and what was filtered out, not just what came back. Decided is the planner's tool choice and the alternatives it weighed, the working-state diff for that step, and the validation/reflection verdict. Passed downstream is the typed output and the working-state delta — the bounded planner package handed to the next step.

The substrate already exists. The episodic trace, working state, and bounded planner package from our memory model (2.1) already record what each step did, what changed, and what was passed on. Observability is mostly turning that into spans and a UI — productionizing what the orchestration loop already captures, not bolting on a parallel logging system.

The feature that matters most for proposal writing is provenance lineage: for each AI claim or proposed edit, we record which retrieved chunk supported it and link that chunk back to its source page. So a recommendation traces cleanly as edit → selection-report advice (KL·E) → chunk on p.142 of the 2023 report. End-to-end answer-to-source lineage is what makes the user-facing explainability trustworthy — and it doubles as a clean artifact for a future data room.

Two things worth flagging for a technical reviewer. Logged "reasoning" is the model's stated rationale — an honest record of what it reported, not a proof of the true cause — so "decided" means the recorded decision plus its stated reasoning. And exact reproducibility is bounded by model nondeterminism: we pin prompt-template and model versions and set seeds where the provider allows, so replaying a trace is fully reliable, but re-generating a hosted model's identical output is not guaranteed. Because traces contain document content, redaction (arguments are already redacted), access control on the trace UI, retention limits, and sampling are part of the design, not afterthoughts.

Additional detail tooling, per-span capture, the two surfaces, provenance, limits

Tooling self-hosted in-VPC

  • OpenTelemetry + LLM semantic conventions (OpenInference / OpenLLMetry) — an open standard, so we're not locked to one vendor's trace format.
  • Langfuse or Phoenix as the trace backend & UI — both open-source and self-hostable, deployed inside Viridon's VPC per the deployment principle.
  • Replay — because state is checkpointed (2.2), a run can be re-executed from any step for time-travel debugging.

What's captured per span

CapturedDetail
Identity & timingSpan name, parent, start/end, duration
InputsTool name, redacted args, the bounded package the step received
RetrievalDistilled + expanded queries, candidate set, per-signal scores, what was dropped, rank profile used
DecisionPlanner tool choice + alternatives weighed; validation / reflection verdict
OutputTyped result + working-state delta passed downstream
Cost & versionTokens, cost, latency; prompt-template + model version
GroundingWhich sources supported which claims — the provenance link

Two audiences, two surfaces

AudienceSurfaceWhat they see
You + technical advisorFull span tree + audit logs (Langfuse / Phoenix, self-hosted)Every step's retrieval, decision, validation and hand-off; cost / latency; prompt + model versions; replay
Erin / end usersIn-product explainability view"Why did it suggest this?" — recommendation → advice used → source page; no raw internals

Observability → evaluation loop

Traces aren't only for debugging. We sample production traces into the eval set (Section 3) and monitor online quality signals — grounding-failure rate, retrieval-hit-rate, drift / halt events — not just latency and cost. That's the difference between "we have logs" and "we know it's working", and it's what turns the regression story in 2.2 into a live signal.

Honest limits

  • Stated rationale ≠ proof — we record the model's reported decision and reasoning; it's a faithful record of what it said, not a guarantee of the underlying cause.
  • Bounded reproducibility — version-pinning and seeds make trace replay reliable; a hosted model's exact output isn't guaranteed identical on re-run.
  • Sensitive content — traces hold document text, so redaction, RBAC on the trace UI, retention limits, and sampling are designed in from the start.

Artifact

End-to-end execution trace — demoed Setup workflow

The full run as an expandable span tree. Click any step to see what it retrieved, decided, and passed downstream. This is a representative render of what the advisor sees in the self-hosted trace UI.

TRACE · Proposal Writing Assistant — Setupdeterministic8 steps · 41.6s
1 · Ingest & section-extractKL·B / C18.2s

Retrieved

12 past proposals + 3 selection reports from SharePoint (source scope applied)

Decided

HI_RES parse strategy (docs < 999 pp, image indexing on); table inference enabled

Passed downstream

1,840 typed sections + 312 RFP questions → working_state.section_index

2 · Generate template + detect variablesT8 · KL·F6.1s

Retrieved

12 prior winning proposals, ranked by selection outcome (KL·F)

Decided

Template derived from highest-scoring wins; 47 recurring variables detected — project_name, sponsor, capacity_mw, cod_date

Passed downstream

Template + variable manifest → working_state.template

3 · Flag what likely needs to changeKL·D3.4s

Retrieved

The starting proposal + change patterns learned across historical proposals

Decided

Flagged the parts of the starting proposal that likely need to change this cycle (e.g. project specifics, deliverability sections) vs. the parts safe to keep

Passed downstream

"What likely needs to change" map → working_state.change_map

4 · Fill variables for the new bidT4 · KL·D5.0s

Retrieved

New project brief + 4 supporting documents

Decided

Revise only the flagged parts and fill the template variables; the rest left untouched

Passed downstream

Filled draft v0 → working_state.draft_id (version 1)

5 · Ingest selection reportsKL·B9.7s

Retrieved

3 sponsor selection reports (~200 pp each)

Decided

HI_RES parse; table + figure extraction

Passed downstream

Parsed selection-report records → index

6 · Extract & index adviceKL·E7.3s

Retrieved

Parsed selection-report records

Decided

Mined 64 win/loss advice entries; indexed as a retrievable advice module

Passed downstream

Advice module → KL·E (now retrievable by the editor)

7 · AI editing via RAGT1 · T511.2s

Retrieved

Selection-report advice + prior winning sections — top chunk: 2023 selection report, p.142 (final score 0.98). See child span for the full retrieval trace.

Decided

Propose 1 comment + 1 edit to §3.2, strengthening deliverability evidence — grounded in KL·E theme "deliverability evidence under-stated vs. winning bids"

Passed downstream

Proposed change-set {comment_1, edit_1} → working_state.pending_changes

7.1 · Retrieve (hybrid)KL·A / E / H1.4s

Retrieved

Query "deliverability outcome for Sunrise in CAISO" → 4 expanded queries → 4 chunks. Ranked: selection-report §Deliverability p.142 (0.98) · Sunrise_Deliverability_Study.pdf (0.88) · FCD-status table cell (0.68) · exec-summary mention (0.68). Full per-signal breakdown in Section 1 · Artifact A.

Passed downstream

Top 4 ranked chunks → comment + evaluate tools

7.2 · T1 · Read & commentT14.6s

Decided

Flag §3.2 paragraph; rationale: selection-report advice (KL·E) says deliverability outcomes win on quantified evidence — current draft asserts without figures

Passed downstream

1 proposed comment, with provenance link → p.142

7.3 · T5 · Evaluate vs. criteriaT53.8s

Decided

§3.2 scores 6/10 against winning bids; gap = quantified deliverability outcome

Passed downstream

Score + gap note attached to the change-set

7.4 · Validate & reflectguard1.4s

Decided

Grounding check passed — the comment cites a real retrieved source (p.142). On-track vs. goal; no drift. Proceed to human gate.

8 · Human approval gatehumanawaiting

Retrieved

Pending change-set {comment_1, edit_1} with provenance

Decided

Surface to Erin for approve / deny — no autonomous application of edits

Passed downstream

Awaiting human; nothing applied to the live document yet

Retrieved Decided Passed downstream

03 — Evaluation

What we measure

Diligence question 3.1

What do you measure to know the system is working, and how do you define each metric? Specifically: how do you treat the distinct failure types — the wrong source being retrieved, an output claim not supported by the retrieved source (hallucination), and low output quality?

We measure each stage of the pipeline separately, on purpose — because a bad final answer is a symptom, and what makes an eval useful is being able to say why it was bad. The three failure types you name aren't interchangeable: they live in different stages and have different fixes, so we attribute every failure to a stage rather than scoring only the end result. That localization is the whole design of the eval.

The wrong source retrieved is a retrieval-stage failure, scored against a labeled set of which chunks are relevant per query. The headline metric is recall@k / hit-rate — did a relevant chunk make the top-k at all — because if it wasn't retrieved, the generator simply can't use it. Around that we track context recall (did we get all the chunks needed) versus context precision (are the relevant ones ranked above the noise), and MRR / nDCG for how high the first relevant chunk landed. We split a recall miss (relevant chunk absent — usually fatal) from a precision miss (irrelevant chunk ranked high — dilutes context) because they have different fixes. For Viridon we add filter correctness — did a scope like ISO_RTO = CAISO actually apply — because the dangerous failure here is cross-project contamination, which scoped retrieval (KL·H) exists to prevent.

An unsupported claim — a hallucination — is a generation-stage failure, measured as faithfulness / groundedness. The key distinction: we don't score "is it true in the world," we score "is every claim entailed by what we retrieved" — the right contract for RAG, because we control the sources. Mechanically we decompose the output into atomic claims and check each against the retrieved context (supported / unsupported / contradicted), plus citation accuracy — does the cited source actually support the claim, which the provenance lineage from 2.3 makes directly checkable. One subtlety: a hallucination is often a retrieval failure in disguise. If context-recall was low, the model filled the gap — so we only call it a generation bug when recall was high and it still invented something. That's why we measure retrieval separately rather than scoring the final answer alone.

Low output quality is the fuzziest and most domain-specific. The generic dimensions are answer relevance and completeness, instruction-following (did it respect "don't touch the boilerplate", length, tone), coherence, and format / schema validity (already enforced by the structured-output validation in 2.2). But the differentiated quality metric for proposal writing is "winning-ness" — scoring whether an edit makes a section more like the sections that have won, built from the selection-report advice in KL·E. That's a quality rubric grounded in what Viridon actually cares about, which no off-the-shelf eval framework gives you.

Around those three sit a broader taxonomy. Because this is agentic, not only RAG (Section 2), we also measure task success / completion, tool-selection accuracy, and trajectory correctness (did it reach the answer for the right reasons, not by luck), alongside operational signals — drift / escalation rate (2.2), latency, and cost. And the single best real-world quality signal for the assistant is human edit-distance / acceptance: how much Erin changes a proposed edit before accepting it. Low edit distance is high quality, measured on live usage with no labeling. The full taxonomy is in the detail below.

Two limitations to be upfront about. Several of these metrics use an LLM as judge, which is itself fallible — so we calibrate it against human labels, reserve it for scale, and keep humans on the high-stakes and subjective calls. And all of it is only as good as the ground truth it's scored against, which is the next question (3.2). Tooling stays in-VPC per the deployment principle: RAGAS-style metric computation (RAGAS is an open-source toolkit for evaluating RAG systems) plus a self-hosted judge model on Bedrock, fed by the execution traces from 2.3.

Where it breaks — failure localization

Each example is checked at three points; the first ✕ is where it breaks, which points to a specific fix. Downstream checks are moot once an upstream stage fails.

Example query① Right source?② Faithful to source?③ Quality output?Where it breaks → fix
"Sunrise deliverability outcome (CAISO)"Pass
"Interconnection cost, Project Atlas"recall missRetrieval — relevant chunk absent → tune rank profile / embeddings
"Deliverability evidence requirements"unsupportedGeneration — recall was high, claim invented → tighten grounding / prompt
"Summarize selection feedback"formatQuality — verbose, ignored format → tune prompt / schema
"Costs for Project X" (scoped)wrong filterScoping — pulled another project → fix filter (KL·H)
Additional detail full metrics taxonomy, definitions, judge calibration

The metrics taxonomy

TierMetricDefinitionTargets
Retrieval — "did we find the right thing?"
RetrievalRecall@k / hit-rateFraction of queries where a relevant chunk is in the top-kWrong source
RetrievalContext recallDid we retrieve all the chunks needed to answerWrong source (recall)
RetrievalContext precisionAre relevant chunks ranked above irrelevant onesWrong source (precision)
RetrievalMRR / nDCGPosition of the first / all relevant chunks (rank-weighted)Wrong source
RetrievalFilter correctnessDid hard filters (e.g. ISO/RTO) scope correctlyCross-project contamination
Generation — "did it use what it found honestly?"
GenerationFaithfulness / groundednessFraction of output claims entailed by the retrieved contextHallucination
GenerationCitation accuracyDoes the cited source actually support its claimHallucination
Quality — "is the output good?"
QualityAnswer relevance / completenessDoes it answer the question, fullyLow quality
QualityInstruction-followingRespects constraints — boilerplate, length, toneLow quality
QualityFormat / schema validityStructured output is well-formed (ties to 2.2)Low quality
Quality"Winning-ness"Does an edit make a section more like winning sections (from KL·E)Low quality (domain)
Agentic & operational — "did the workflow behave?"
AgenticTask success / completionDid the workflow achieve the goal end-to-endEnd-to-end
AgenticTool-selection accuracyWas the right tool chosen at each stepProcess
AgenticTrajectory correctnessRight steps for the right reasons, not luckProcess
OperationalDrift / escalation rateHow often it goes off-track or needs a human (2.2)Reliability
OperationalLatency / costSpeed, tokens, spend per runEfficiency
Real-worldHuman edit-distance / acceptanceHow much a user changes a proposed edit before accepting itLive quality

Localization — three checkpoints, three fixes

  • ① Right source retrieved? → if not, it's a retrieval problem; fix the rank profile, embeddings, or filters. No downstream metric matters until this passes.
  • ② Used faithfully? → relevant only once retrieval passed. An unsupported claim with high context-recall is a genuine generation bug; with low recall it's really a retrieval miss.
  • ③ Quality output? → evaluated last, because a faithful-but-poorly-written or off-format answer is a generation/prompt problem, not a data problem.

LLM-as-judge — calibration & honesty

  • Calibrated against humans — the judge model is validated on a human-labeled sample so we know its agreement rate before we trust it at scale.
  • Reserved for scale — automated judging runs the volume; humans handle high-stakes, subjective, and "winning-ness" calls.
  • Self-hosted — the judge runs on Bedrock in-VPC, fed by the observability traces (2.3), so eval data never leaves Viridon's environment.

Every metric here needs a labeled "right answer" to score against — how we build that ground truth, and how we minimize the SME time it takes, is Section 3.2.

03 — Evaluation

Ground truth

Diligence question 3.2

How do you establish ground truth — the labeled "right answers" evals are scored against — and who builds that set? Where the answer depends on our subject-matter experts, how do you minimize the time required from them?

Ground truth is the real bottleneck in enterprise RAG evaluation, so we treat SME time as the scarce resource we engineer around, not an afterthought. The first move is to stop thinking of "ground truth" as one thing: it has three layers — retrieval truth (which chunks are relevant to a query), answer truth (the correct answer text), and preference / rubric truth (which of two outputs is better, or how it scores on a rubric). They cost very different amounts to label, so matching the cheapest viable label type to each metric is already a major saver — most retrieval and faithfulness checks need no authored answer at all.

The single biggest unlock for Viridon is that your archive is already a labeled dataset. A library of won proposals and ~200-page selection reports isn't just source material — a winning section is a gold answer for "how should this section read," and a selection report is labeled feedback on what was strong and weak. So a large share of the "right answers" already exist in your corpus; the work is extraction, not authoring. That turns ground truth from a cost you'd carry into an asset you already own.

Around that, we draw labels from the cheapest sources first (the ladder below). Reference-free metrics need zero SME input — faithfulness is checked against the retrieved context, not a gold answer, and schema validity is deterministic. Implicit labels from real usage are free and compounding: every time Erin accepts, edits, or rejects a proposed change, that's a label, and the edit diff tells us how it was wrong. Synthetic generation with human verification handles the rest — an LLM drafts (question, answer, source-chunk) triples from your documents, and the SME's job collapses from authoring to approving or correcting, which is several times faster.

Where SMEs are needed, we minimize their time deliberately: approve, don't author (review LLM-drafted labels rather than writing from scratch — the biggest single lever); active learning (we surface the highest-value cases — where the system is uncertain or where the judge and a human disagree — instead of asking them to label at random); a small, stratified golden set plus a large auto-graded set (a few hundred carefully chosen, human-verified examples anchor a calibrated LLM judge that handles the volume); and capture in the natural workflow (a thumbs-up, an accepted edit, or a "this source is wrong" flag in the product is a label given without extra effort). The result is that SME involvement is bounded and front-loaded, and trends toward near-zero as usage-based labels compound.

On who builds it: we build the harness, generate the synthetic candidates, mine the historical corpus, and run and calibrate the judge; your SMEs spend bounded, high-leverage time approving and correcting the golden set and resolving the contested cases; and the product harvests implicit labels continuously. Two caveats. Ground truth isn't static or singular — SMEs disagree and what "wins" shifts as sponsors change — so we measure inter-annotator agreement, version the golden set, and treat it as living rather than a one-time deliverable. And synthetic labels carry a bias risk — an auto-generated test set can be easy in the same ways the system is good, flattering the scores — which we counter by seeding from your real artifacts and keeping a human-authored slice as the hard anchor.

Where labels come from — by SME cost

Reference-free metrics
Faithfulness (vs. retrieved context) & schema validity — no gold answer needed
SME cost · none
Implicit from usage
Accept / edit / reject on proposed changes — free, compounding labels; the diff shows how it was wrong
SME cost · none
Mined from history
Won proposals = gold answers · selection reports = labeled feedback — already in the archive
SME cost · low
Synthetic + SME verify
LLM drafts (question, answer, source) triples; SME approves or corrects rather than authoring
SME cost · medium
SME-authored golden slice
The hard anchor + contested cases — small, stratified, deliberately bounded
SME cost · high

Most coverage comes from the cheap and free sources at the top; the expensive, SME-authored slice is kept small and high-leverage. As usage grows, the free implicit labels compound and the SME share shrinks further.

Additional detail the three layers, division of labor, SME-minimization, caveats

Three layers of ground truth

LayerWhat's labeledTypical costHow we get it
Retrieval truthWhich chunks are relevant to a queryLowConfirm the source, or bootstrap from a known answer's source chunk
Answer truthThe correct answer textHighMine from won proposals; synthetic + verify; small SME-authored anchor
Preference / rubric truthWhich output is better, or its rubric scoreLow–mediumA/B preference or rubric — far cheaper than authoring gold; "winning-ness" derived from selection reports (KL·E)

Who builds it — division of labor

WhoDoes what
BetterBrainBuilds the eval harness; generates synthetic candidates; mines the historical corpus; runs and calibrates the judge
Viridon SMEsBounded, high-leverage time: approve / correct the golden set, resolve contested cases, set the "winning" rubric
The productHarvests implicit labels continuously (accept / edit / reject) — zero added effort

Minimizing SME time — the techniques

  • Approve, don't author — SMEs review and correct LLM-drafted labels instead of writing from scratch. The biggest single lever.
  • Active learning / prioritization — label the contested cases (system uncertain, or judge-vs-human disagreement), not random samples, for more eval signal per SME-minute.
  • Small golden set + large auto-graded set — a few hundred stratified, human-verified examples calibrate an LLM judge that grades the volume; SME effort is front-loaded and bounded.
  • Capture in the workflow — thumbs, accepted edits, and "wrong source" flags in the product produce labels as a by-product of normal use.
  • Structured elicitation — when SMEs are needed, they get a tight review UI (answer + source + ✓ / ✕ / fix), so it's minutes per item, not hours.

Honest caveats

  • Not static or singular — annotators disagree and "winning" drifts as sponsors change; we track inter-annotator agreement, version the golden set, and treat it as living.
  • Synthetic bias — auto-generated tests can flatter the system; we seed from real artifacts and keep a human-authored hard anchor to counter it.

The implicit-from-usage labels are also the input to the self-learning loop in Section 4, and they're the same signal as the human edit-distance metric in 3.1 — ground truth and the feedback loop are two views of the same data.

03 — Evaluation

How evals are run

Diligence question 3.3

How are evals run operationally — an automated pipeline, your team, our team, or a hybrid? What is the division of labor, and what ongoing time commitment would you expect from us? Specifically: when a prompt or model changes, how do you confirm the change did not break existing behavior (regression testing)?

Evals run as a hybrid at three cadences, not one — so the answer to "automated, your team, or ours" is: all three, at different speeds. A fast automated suite runs in CI on every prompt or model change and blocks the merge if it regresses (the regression gate). A comprehensive batch runs nightly and on-demand against the full golden set for the thorough scorecard. And continuous online monitoring scores sampled production traffic on reference-free metrics. Automation does the volume; humans do the judgment.

On division of labor: BetterBrain builds and maintains the pipeline, writes the metrics, owns the CI gate, triages regressions, and calibrates the judge. The automated system does the bulk of the work — CI on every change, the nightly batch, and live monitoring — with no human in the loop. Your SMEs touch only the irreducibly human part: periodic review of the golden set and adjudicating the handful of borderline cases CI surfaces.

On your time commitment — concretely, because you asked: almost all of it is upfront, establishing and ratifying the golden set and the "winning" rubric — on the order of 15–20 SME-hours, front-loaded over the first few weeks, and mostly approve-not-author (per 3.2). After that there is no standing commitment: ongoing involvement is ad-hoc only — when the golden set needs an update because the corpus or sponsor criteria changed — averaging under 30 minutes a month, and trending down further as implicit usage labels compound.

On regression testing: the locked, versioned golden set is the regression suite. On any prompt, model, embedding/index, or tool change, we re-run it and compare to the previous baseline. Because outputs are non-deterministic we don't assert string equality — we gate on metric thresholds and no-regression deltas ("no metric dropped more than N% vs. baseline"). The most actionable technique is A/B diffing: surface only the examples that flipped pass↔fail, so a reviewer looks at the handful that changed, not all 500. And we slice and gate per segment (doc type, ISO/RTO, question type), because a sub-segment can tank while the average stays flat — the silent-degradation trap that aggregate-only eval misses.

Two operational notes: The judge model is itself non-deterministic, so a "regression" can be judge noise — we pin the judge's model and prompt versions, average over runs on the golden set, and route borderline flips to a human. And we treat eval cases as version-controlled code — the golden set lives in the repo and changes via review, so the suite evolves with the same rigor as the system. The loop closes with 2.3 and 3.2: production failures caught by monitoring are promoted into the golden set, so the regression suite gets harder exactly where the system is weak. Tooling — Promptfoo, Langfuse or Phoenix — is self-hosted in-VPC per the deployment principle. The full operating model is the plan below.

Artifact

Eval plan — Proposal Writing Assistant

The operating model for the proposal-writing use case: the three cadences, the regression gate, and the SME time budget in one view.

Cadence 1 · automated

CI / pre-merge
TriggerPrompt, model, embedding/index, or tool change
RunsFast regression subset of the golden set, per stage
MetricsPer-stage thresholds + A/B diff + per-segment slice
WhoAutomated, in the pipeline
Blocking — blocks merge on regression

Cadence 2 · scheduled

Offline batch
TriggerNightly + on-demand
RunsFull golden set, all metrics
MetricsFull taxonomy (3.1) + trend tracking
WhoAutomated + BetterBrain review
Scorecard — tracks trends over time

Cadence 3 · continuous

Production monitoring
TriggerLive, on sampled real traffic
RunsReference-free metrics on real queries
MetricsFaithfulness · retrieval-hit · drift/escalation · latency · cost
WhoAutomated, with alerting
Monitor — promotes failures to golden set
Regression gate — on every change Change: prompt / model / index / tool Re-run golden set Compare vs. baseline: thresholds · A/B diff · per-segment Regression → block & review Clean → ship

Your time · upfront

~15–20 SME-hours

Front-loaded over the first few weeks — establish & ratify the golden set and "winning" rubric (mostly approve-not-author).

Your time · ongoing

< 30 min / month

Ad-hoc only — golden-set updates when the corpus or sponsor criteria change. No standing commitment.

What we measure — proposal-writing scorecard

What we measureHow we measure it · ground truthTarget (illustrative)
Setup — template, variables, change detection
Template coverageGenerated template vs. SME-ratified structure from prior wins≥ 95% required sections present
Variable detection (precision / recall)Detected variables (project_name, sponsor, capacity…) vs. SME-labeled set on held-out proposalsP ≥ 0.95 · R ≥ 0.90
Change-flagging (precision / recall)Parts flagged "likely to change" vs. SME-labeled actual changes across historical proposal pairsR ≥ 0.90 · P ≥ 0.80
Variable-fill accuracyFilled field values vs. the new project brief≥ 0.95
Update propagation (precision / recall)All correct locations updated across 300+ pages, nothing else, vs. labeled change-setR = 1.0 · P ≥ 0.95
Advice & AI editing
nowAdvice retrieval (recall@k)Relevant selection-report advice retrieved, vs. labeled edit-context → advice pairsRecall@5 ≥ 0.90
nowScoping / no contaminationAdversarial cross-project queries — does it ever pull another project's data?0 cross-project leaks
nowRecommendation grounding (faithfulness + citation)Every comment's claim checked against its cited retrieved sourceFaithful ≥ 0.95 · Citation ≥ 0.98
Drafted-content groundingDrafted-section claims checked vs. brief + prior wins≥ 0.95
"Winning-ness" of editsLLM-judge rubric from selection-report advice (KL·E) + SME preference on a sampleEdit improves score in ≥ 80%
nowBoilerplate preservationDiff of changed text vs. the change-map — only flagged parts touched≥ 0.99 untouched
Real-world & operational
nowHuman acceptance + edit-distanceLive accept / edit / reject on proposed changes (implicit labels)Acceptance ≥ 70% & rising
Workflow completionTrace status (2.3): valid template + filled draft, no failed step≥ 0.98
nowLatency · cost · driftProduction monitoring (2.3)Within budget · escalation only at human gate

Targets are illustrative starting points; the actual thresholds are set from the baseline once the golden set is established (3.2) and become the no-regression bar in CI.

Phased rollout — where we start

We don't stand all of this up at once. The highlighted rows are our initial focus — the metrics that most directly safeguard your data, prevent unsupported claims, and reflect real-world usefulness, and that we can put in place quickly. The rest layer in as the golden set and live usage data mature.

Additional detail regression mechanics, evals-as-code, the flywheel

Regression gate — mechanics

  • Threshold + delta gating — pass requires both an absolute floor (e.g. faithfulness ≥ target) and no-regression vs. baseline (no metric down more than a set delta). No string-equality assertions on stochastic output.
  • A/B pass-flip diff — the report shows only the examples that changed verdict between old and new, in both directions, so review is targeted at what moved.
  • Per-segment gating — metrics are sliced by doc type, ISO/RTO, and question type and gated per slice, so a regression in one segment can't hide behind a flat average.
  • Component + end-to-end — retrieval, generation and the full workflow are regression-tested independently (mirroring the 3.1 localization), so a failure points to a stage.
  • Judge-noise control — judge model and prompt are version-pinned; golden-set scores are averaged over runs; borderline flips go to a human, so noise isn't mistaken for regression.

Evals as code, and the flywheel

  • Version-controlled — the golden set and metric definitions live in the repo and change via review, evolving with the same rigor as the system.
  • Triggered, not just scheduled — the relevant suite fires on the event that matters (prompt edit, model bump, index change, tool change), tied to the change type.
  • Self-reinforcing — production failures caught by online monitoring (2.3) are promoted into the golden set (3.2), so the regression suite keeps getting harder where the system is weakest.
  • In-VPC tooling — Promptfoo / Langfuse / Phoenix, self-hosted, so eval data never leaves Viridon's environment.

04 — Self-learning & institutional memory

Feedback loop

Diligence question 4.1

How does the system learn from use over time, and what tools or techniques support this? Specifically: what signal is captured (accept/reject, edits to drafts, explicit corrections); is learning applied live/in-session or through a batch "reflection" process (e.g. nightly), and why that cadence; and does this same feedback feed into how you evaluate the system?

The system learns primarily in the knowledge layer, not in the model weights. When Erin edits or rejects a proposed change, we capture the before→after diff and its context, generate a piece of reusable advice from it ("for CAISO deliverability sections, lead with the quantified outcome"), and index that advice so future retrievals surface it. It's the same machinery as the selection-report advice module (KL·E), and the result is inspectable and correctable — you can read, edit, or delete what the system has "learned" (Section 4.3). And it isn't only advice that improves: the same feedback updates other knowledge-layer components in Figure 1 — the entity & concept map / knowledge graph (KL·G), the "what changes" patterns (KL·D), and the glossary (KL·L) — so a correction can fix an entity or a relationship in the graph, not just a piece of guidance.

On cadence, it's both — split by mechanism. Anything that's just retrieval-time context applies live: a correction becomes an indexed advice entry the very next retrieval can pull, with nothing retrained, and the in-session working memory (2.1) already adapts within a task. Anything that involves synthesis or judgment is deferred to a batch / nightly "reflection": clustering many edits into one durable advice entry, resolving contradictions when SMEs disagree, promoting a pattern only once it's been seen several times, and re-ranking which advice is trusted. Why that cadence: a single edit is noisy, and you don't want one idiosyncratic correction to immediately reshape behavior for everyone — the batch step is deliberate noise control, and it's where the degradation guardrail lives (4.5).

The signal we capture is richer than accept/reject. The most valuable is the edit diff itself — the corrected text tells us not just that a suggestion was wrong but how, and it doubles as a free gold label. Around it: explicit accept / reject / thumbs, behavioral signals (used, ignored, asked a follow-up, re-ran the search), and explicit corrections ("this source is wrong", "this advice doesn't apply here") — the highest-value, lowest-volume signal.

On reinforcement learning, we use the framing deliberately, not the heavy machinery. We treat acceptance as a reward signal and optimize the policy that decides which advice and which retrieval configuration to surface — not the model weights. The genuine reinforcement-style mechanisms on our roadmap are (1) a system that automatically learns which advice and which result-ranking to surface — it tries different options, watches which ones lead Erin to accept the suggestion, and shifts toward the ones that work, essentially self-tuning A/B testing that runs continuously, fully in-VPC, with no model weights touched — and (2) automatic tuning of the prompts and examples against the eval metrics, so they're optimized by measurement rather than by hand. We explicitly do not fine-tune model weights: it would break the inspectability we're selling, complicate the in-VPC deployment, make evals harder, and is the wrong investment at this corpus size. What we do is accumulate the accept/reject preference pairs as an asset — the dataset that would make fine-tuning possible later, to be spent only if the eval gain ever warrants it.

And yes — the same signal feeds evaluation. Every accept / edit / reject is simultaneously a learning signal and an eval label (the human-edit-distance metric in 3.1, the implicit labels in 3.2). That coupling is also the safeguard — a new or updated advice entry is promoted only if it does not regress the golden set, so the feedback loop and the eval loop are the same flywheel: usage → advice → eval-gated promotion → better retrieval → more usage. High-impact advice can require human approval before it goes live, so what the system learns stays governed.

Use · accept / edit / reject Capture · diff + context Live · index advice Batch · consolidate + eval-gate Better retrieval ↻ live advice is available to the very next retrieval · batch promotion is eval-gated, so errors aren't reinforced — the feedback loop and the eval loop are one flywheel
Additional detail signal taxonomy, live vs. batch, the RL spectrum

Signal captured

SignalWhat it tells usType
Edit diff (before → after)How a suggestion was wrong; the corrected text is a free gold labelImplicit · richest
Accept / reject / thumbsCoarse good / bad on a proposed changeExplicit
BehavioralUsed, ignored, asked a follow-up, re-ran the searchImplicit
Explicit correction"This source is wrong" · "this advice doesn't apply here"Explicit · highest-value

Live vs. batch — split by mechanism

MechanismCadenceWhy
Index a correction as adviceLive (in-session)Retrieval-time context, no retraining — next retrieval can use it
Same-session adaptation (working memory)LiveWithin-task only; resets per run (2.1)
Consolidate many edits into durable adviceBatch (nightly)Dedup + synthesis; one edit is noisy
Resolve contradictions / promote after NBatchJudgment + noise control
Re-rank which advice is trustedBatchNeeds aggregate signal
Promote a new/updated entryBatchEval-gated — must not regress the golden set (4.5)

The reinforcement-learning spectrum — what we do and don't

ApproachStance
Knowledge-layer learning (edit → advice → index)core · live The primary mechanism; inspectable and correctable
Acceptance as a reward signal over the retrieval / advice policyyes Reinforcement-style, no weight changes
Auto-learn which advice / ranking to surfaceroadmap Tries options and favors the ones that get accepted — continuous, self-tuning A/B testing; in-VPC, no weights touched
Auto-tune the prompts and examples vs. eval metricsroadmap Optimized by measurement instead of by hand
Accumulate accept/reject preference pairsyes Banked as an asset; spent only if evals justify
Fine-tune model weights (RLHF)not planned Breaks inspectability + in-VPC simplicity; wrong investment at this scale

04 — Self-learning & institutional memory

What changes when it learns

Diligence question 4.2

When the system "learns," what concretely changes — retrieval ranking, prompts, a memory/concept store, model weights, or something else?

In one line: what changes is data and configuration, not the model. The core mechanism is the concept / advice store — learning adds, updates, and re-ranks indexed advice entries and the entity & concept map (the knowledge graph), along with the other knowledge-layer modules in Figure 1, which together are the institutional memory we cover in 4.3. Everything else follows from that: because new advice is indexed, retrieval surfaces different, better-grounded context, and on the roadmap the ranking itself self-tunes toward what gets accepted. Prompts and examples are auto-tuned against the eval metrics (roadmap), not edited per interaction. Session memory adapts live within a task and resets per run (2.1). And the eval golden set itself grows as production failures are promoted into it — so the system's measurement improves alongside its behavior. Model weights do not change, by design.

CandidateChanges?What changes
Memory / concept storeYes — primaryAdvice entries added / updated / re-ranked — the institutional memory (4.3)
Concept / entity graph (KL·G)YesNew or corrected entities & links — projects, customers, ISO/RTOs, terms (Figure 1)
Retrieval rankingYesNew advice changes what's surfaced; ranking self-tunes by acceptance (roadmap)
Prompts / examplesYes (roadmap)Auto-tuned against eval metrics — not edited per interaction
Session / working memoryYes — liveWithin-task adaptation; resets per run (2.1)
Eval golden setYesGrows as production failures are promoted (3.2 / 4.5)
Model weightsNoUnchanged by design — inspectability, in-VPC, eval simplicity

04 — Self-learning & institutional memory

Concept layer

Diligence question 4.3

What accumulates as institutional memory, and in what form? Specifically: is it human-readable, auditable, and correctable — can we inspect and fix what the system "knows"?

What accumulates is structured, human-readable knowledge — not opaque vectors or model weights. The institutional memory is a set of concept and advice entries in the knowledge layer: the advice mined from edits and selection reports (KL·E), the entity and concept map of customers, projects, ISO/RTOs and terms and how they link (KL·G), the "what changes" patterns (KL·D), and the onboarding glossary (KL·L). Each is a record you can read in plain language, not a number in a tensor.

In form, every entry is a structured record: a plain-language statement of the knowledge, the scope it applies to, its provenance (which edits and sources produced it), confidence and usage stats, a status, and a version history. The anatomy is below — it reads like a note with an audit trail.

To the heart of your question — yes, all of it is human-readable, auditable, and correctable. Everything the system learns is exposed to you in plain language and is fully editable: you can inspect any entry and trace it back to the edits and sources that produced it (the provenance from 2.3), correct or rewrite it, and disable, delete, or pin it. Because learning lives in the concept store and not in the weights (4.2), the entire learned state is open to inspection and repair — there is no hidden knowledge baked into a model you can't read. A curation view makes this a first-class surface (4.4).

Auditability comes from the same structure: each entry carries provenance (what created it, when), version history (what changed), and usage stats (how often it's retrieved and applied, and how often accepted) — so you can audit not just what it knows but why it knows it and whether it's actually being used. A correction simply becomes the new entry (eval-gated before it's trusted, per 4.5). And because the whole memory is a readable, ownable artifact rather than a black box, it transfers with the company — the same owned-asset principle that runs through the architecture.

Anatomy of a concept entry

FieldWhat it holdsIllustrative
StatementThe knowledge, in plain language"For CAISO deliverability sections, lead with the quantified outcome"
ScopeWhere it appliesISO_RTO = CAISO · proposal §deliverability
ProvenanceWhat produced itDerived from 3 accepted edits + selection report p.142
Confidence / usageHow trusted, how often usedSeen 7× · applied 23× · 91% accepted
PriorityOptional manual weight — overrides auto-confidence to set precedenceNormal · High · Critical (e.g. pin a must-follow rule)
StatusActive / disabled / pinnedActive
Version historyWhat changed & whenv2 — broadened from "Sunrise" to all CAISO (Apr 2026)

A fully populated example entry and the end-to-end learning loop are provided as the Section 4 artifact.

Additional detail what accumulates, and what you can do to any entry

What accumulates as institutional memory

MemoryFormSource
Advice entries (KL·E)Plain-language guidance with scope & provenanceMined from accepted / rejected edits + selection reports
Concept & entity map (KL·G)Entities (customers, projects, ISO/RTOs, terms) and the links between themIngestion, usage + corrections
"What changes" patterns (KL·D)Which parts of a proposal historically need changingCross-proposal history + accepted edits
Onboarding glossary (KL·L)Company context, terminology, how concepts connectCorpus + curation

What you can do to any entry

  • Inspect — read the entry and its provenance; trace it to the edits and sources behind it.
  • Correct / rewrite — fix the statement or narrow / broaden its scope; the correction becomes the new entry.
  • Disable / delete — turn off or remove anything that's wrong or no longer applies.
  • Prioritize / pin — set an entry's priority (Normal / High / Critical) or pin it as authoritative so it's preferred in retrieval.
  • Govern — high-impact entries can require approval before they go live (4.1), and changes are eval-gated before they're trusted (4.5).

All of this runs through a curation surface — the upkeep model is Section 4.4.

04 — Self-learning & institutional memory

Upkeep

Diligence question 4.4

How much of this is automated versus requiring human curation, and by whom?

Upkeep is almost entirely automated. The loop in 4.1 does the work with no person in it — generating advice and entity/graph updates from edits, indexing, consolidating and de-duplicating them, scoring confidence, promoting a pattern only once it's been seen enough times, and re-ranking which knowledge is trusted. Humans do not author or maintain the memory by hand.

The human role is oversight by exception. The one thing worth watching for is an incorrect deduction — the system over-generalizing a one-off edit into a rule that shouldn't apply broadly. When that happens, a reviewer corrects, narrows, disables, or deletes the entry (or adds one directly to teach something) through the curation surface (4.3). It's review-and-fix when something looks off, not continuous curation.

By whom, and how little: the curation is done by an SME or power user (e.g. Erin) for the proposal domain, with BetterBrain monitoring the memory's overall health and tuning the loop. The burden stays low because the guardrails catch most bad deductions before they ever reach a human — confidence thresholds, promote-after-N, eval-gating (4.5), and approval on high-impact entries (4.1) — so what reaches manual review is the genuine exceptions. This is consistent with the eval upkeep budget in 3.3: bounded and ad-hoc, well below a standing commitment.

TaskOwner
Generate advice + entity/graph updates from edits & correctionsAutomated
Index, consolidate, dedup, re-rank; score confidence; promote-after-NAutomated
Eval-gate changes before they're trusted (4.5)Automated
Surface low-confidence / contested entries for reviewAutomated → flags to humans
Review flagged entries; correct / narrow over-general deductionsHuman — SME / power user
Add, edit, disable, delete, prioritize entries as neededHuman — SME / power user
Monitor memory health; tune the loopBetterBrain

04 — Self-learning & institutional memory

Preventing degradation

Diligence question 4.5

How do you prevent a feedback loop from reinforcing errors or degrading the system over time?

The core risk in any feedback loop is an echo chamber: the system learns from its own outputs and reinforces its own mistakes until errors quietly become "truth." We prevent that with defense-in-depth — multiple independent safeguards, so that if one misses a problem the next one catches it — and one structural choice does most of the work.

That choice: learning is measured against an independent anchor, not against the model's own recent behavior. A new or updated entry is promoted only if it doesn't regress the golden set (3.3), and the golden set is anchored in external truth — selection reports plus a human-authored slice (3.2) — not in whatever the model did lately. So the loop can't drift to merely agree with itself.

On top of that, noise control and negative signal: we consolidate in batch and promote only after a pattern recurs (promote-after-N), so one idiosyncratic edit can't reshape behavior, and contradictions are resolved rather than stacked (4.1). And we learn from rejections and heavy edits, not just acceptances — so the loop isn't one-sidedly reinforcing what it already does.

Then catch and contain: per-segment regression gating and online monitoring (2.3, 3.3) catch degradation even when the aggregate looks fine. And because the memory is data, not model weights (4.1), with full provenance and version history (4.3), a bad deduction is traceable, reversible, and deletable — the blast radius is bounded and nothing compounds silently. High-impact changes can be rolled out gradually — applied to a small slice first (a "canary") and widened only if it holds — before they're fully trusted.

Finally, knowledge also degrades by going stale — a sponsor's criteria shift, an ISO/RTO rule changes. Entries carry recency and versioning, older ones decay or get re-validated, and periodic re-evaluation keeps the memory current. Taken together, errors can't quietly become truth, drift is caught against an external reference, and the whole memory stays inspectable and reversible.

Failure modeSafeguard
Errors reinforced as "truth"Promotion eval-gated against an independent golden set; promote-after-N + confidence thresholds (3.3, 4.1)
Echo chamber (learns only from its own outputs)Anchored to external truth — selection reports + human-authored slice (3.2); rejections & corrections weighted, not just acceptances
One noisy edit reshapes behaviorBatch consolidation; promote-after-N; contradictions resolved, not stacked (4.1)
Silent / per-segment degradationPer-segment regression gating + online monitoring (3.3, 2.3)
A bad entry compounds invisiblyFull provenance + version history; inspect / disable / delete (4.3) — bounded blast radius
Degradation baked in irreversiblyLearning is data, not weights (4.1) — reversible and roll-back-able
A bad change ships widelyCI eval-gate + gradual / canary rollout before full trust
Stale knowledge (criteria change)Recency / versioning; decay or re-validation of old entries; periodic re-eval

Artifact

The learning loop & a memory entry

What is stored, where it lives, and on what cadence it updates — followed by a fully populated example of a single concept/memory entry.

1 · Use
Erin works
Accepts, edits, or rejects a proposed change in the editor
2 · Capture
Signal logged
The edit diff + context is recorded — the richest signal
accept · edit diff · reject · explicit correction
3 · Update memory
Live path Index the correction as advice KL·E Usable by the very next retrieval — minutes, not hours

both
feed
Nightly path Consolidate · dedup · promote-after-N · re-rank trust KL·EKL·GKL·DKL·L
Eval-gate · must not regress golden set
Low-confidence / contested → flagged for SME review (correct · disable · pin)
4 · Propose
Better-grounded suggestion
Improved memory surfaces in the next comment or draft — assistant proposes, Erin disposes
External anchor: golden set grounded in selection reports + human-authored slice (3.2) — the loop measures against independent truth, not its own recent outputs

Rejections & heavy edits weighted, not just acceptances · production failures grow the golden set · feedback loop and eval loop are one flywheel

What's stored, where, and on what cadence

What's storedWhere (Figure 1)Cadence
Raw signal — accept / edit / reject + diff + contextSignal logLive, on every action
Advice entries (guidance from edits + selection reports)KL·E advice storeLive to index · nightly to consolidate
Concept & entity linksKL·G knowledge graphNightly (+ live for direct corrections)
"What changes" patternsKL·DNightly
Glossary / termsKL·LNightly / on curation
Trust & ranking of adviceKL·E + reranker (KL·H)Nightly, eval-gated
Preference pairs (accept / reject)Eval + training-data bankLive append; spent only if evals justify
Eval golden setEval storeGrows as production failures are promoted

Example — a single concept / memory entry

↳ Produced by step 3 (nightly consolidation of 3 edits + selection report p.142), eval-gated to v2 — this is what the loop outputs into KL·E

ADV-0427 "For CAISO deliverability sections, lead with the quantified deliverability outcome before the methodology." Priority: High Active
Scope
ISO_RTO = CAISO · doc_type = proposal · section = Deliverability
Confidence / usage
Seen 7× · applied 23× · 91% accepted
Provenance
Derived from 3 accepted edits (Sunrise, Aspen, Vega bids) + selection report p.142 — "evaluators rewarded proposals that quantified the deliverability outcome upfront."
Priority
High — set manually by Erin; overrides auto-confidence so it's always preferred
Status
Active · v2 · updated Apr 2026
Version history
v1 (Feb 2026) — scoped to the Sunrise bid only · v2 (Apr 2026) — broadened to all CAISO after the pattern recurred across 3 bids (nightly consolidation, eval-gated)
Edit Narrow scope Disable Delete View provenance Every field is editable — this is what the system "knows", in full

This is the institutional memory in concrete form: human-readable, traceable to the edits and sources that produced it, prioritizable, and correctable — and it lives in your platform as an owned asset, not a black box.