BetterBrain × Viridon — Architecture Diagram

The architecture

The whole picture, end to end.

Here's how we think about the architecture. Your documents feed a Knowledge Layer made of many components. Tools and an orchestrator draw on those components — many-to-many. The orchestrator chains tools into workflows. Workflows compose into the mini-apps your teams use. Color shows what we bring vs. build net-new — but everything on this diagram is fully customizable. Every tool, every knowledge-layer component, workflows, orchestrator, mini-apps — even the pieces we've built before, we customize and tune to exactly what Viridon needs.

Flow ↑ Sources (bottom) → Knowledge Layer → Tools → Workflows → Mini-apps (top)

↳ Click any tool, workflow, or mini-app to pin what it uses across the stack below. Click again to clear. The links run many-to-many across the stack.

Viridon Sources

your data, owned by you

SOURCESharePoint · proposals · RFIs · contracts · ISO/RTO public docsStays in your tenant. Ingested into the knowledge layer; never used to train other models.

►
feeds▲ ingested & structured from ▲

Knowledge Layer

far more than indexing & retrieval — many components

all appsKL·AIndexing & RetrievalHybrid BM25 + vector (Vespa). This is all Glean gives you.

all appsKL·BIngestion & StructuringAny doc type → parsed, chunked, tagged, structured records

KL·CSection & Question ExtractionBreak proposals into sections / RFP questions

KL·D"What Changes" MapWhich parts of a past proposal likely need changing this cycle

KL·ESelection-Report AdviceWin/loss themes mined from ~200-pg sponsor reports

KL·FTemplate GenerationAuto-built proposal templates from past wins

all appsKL·GConcept Mapping & LinkagesConcepts, customers, projects, terms — and how they link

all appsKL·HScoped Retrieval + RerankingProject / client / global scope, RBAC-aware, reranked

KL·I · bespokePublic-Doc EnrichmentISO/RTO transmission plans, deliverability studiesfor you

KL·J · bespokeRFI Q&A + SME DelegationPrior RFI answers & who owned what last timefor you

KL·K · bespokeStandard-Terms PlaybookViridon's clause library & acceptable positionsfor you

KL·L · bespokeOnboarding GlossaryCompany context, tutorials & how concepts connectfor you

►
powers▲ tools & orchestrator pull components ▲

Orchestrator & Tools

both read the knowledge layer · marked tools serve every app

all appsT·QGrounded Q&A / chatCited answers over the knowledge layer — the MCP entry point for every team

T1Read & comment on a paragraphSuggests improvements vs. selection-report themes

T2Draft a sectionFrom template + structured prior wins

T3Identify opportunitiesWhere to differentiate this bid

T4Flow updates across 300+ pagesNumbers, vendors, names everywhere

T5Evaluate against criteriaScore a draft vs. what wins

T6Aggregate & match attachmentsSME reports into one narrative voice

all appsT7Web research & scrapeLive external + public-doc context

T8Build a templateAuto-derive from past proposals

T9SME routerLikely owner from past delegation patterns

T10RFI trackerAuto-populate items & assign owners

T11Clause & field extractorCounterparty, dates, term, obligations

T12Screen vs. standard termsFlag only what needs human review

T13Contract trackerRepository of every NDA & agreement

►
chains▲ orchestrator chains tools ▲

Workflows

orchestrated or deterministic chains of tools

WF · proposalSetupTemplate + flag what's outdated

WF · proposalStrategyFind angles & framing

WF · proposalDraftingAI teammate drafts & pulls in attachments

WF · proposalEvaluationReview, comment, propose edits

WF · RFIIntake & matchParse questions · find prior Q&A

WF · RFIDraft responsesFrom proposal, SME reports & past RFIs

WF · RFIRoute & trackAssign SME owners · populate tracker

WF · legalIngest & extractPull fields from every contract into tracker

WF · legalScreen incomingCompare NDA vs. standard terms · flag exceptions

WF · ISO/RTOSME Q&APublic docs + project history for a customer

WF · onboardingNew-joiner chatCompany context, people, terminology

►
composes▲ composed of ▲

Mini-apps

what each team touches

APP·1Proposal Writing AssistantOrigination · Erin

APP·2RFI Response DrafterOrigination

APP·3Legal Contract ScreenerLegal (2-person)

APP·4ISO / RTO SME + OnboardingAll teams

Figure 1. High-level overview of the architecture — what's built today and what's net-new for Viridon.

Two things the colors say. First, the components marked all apps — retrieval, ingestion, concept linkages, scoped retrieval, grounded Q&A, web research — are shared infrastructure every mini-app reuses. Second, almost everything is foundation we've built and customize to you; the bespoke pieces — app-specific tools (T1, T3, T4, T6, T10, T12) and knowledge-layer modules KL·I–L — are where we spend the saved time.

Deployment & security principle

Open-source and in-VPC only — nothing leaves Viridon's AWS environment.

We've deliberately constrained the entire stack to components that are either open-source software or managed AWS services reached privately from within Viridon's own account. Parsing, the knowledge layer, the Vespa index, and orchestration all run inside Viridon's VPC — its own private, isolated network within AWS — on EKS / EC2 (Amazon's managed compute); model access goes through Amazon Bedrock (AWS's managed service for running foundation models) over AWS PrivateLink (a private connection that keeps traffic off the public internet). No documents or queries are sent to a third-party SaaS over the public internet, and nothing is locked to a proprietary cloud that we control. The result is a self-contained, owned asset that transfers with the company — the through-line of the whole architecture.

Open-source coreSelf-hosted in VPCBedrock via PrivateLinkNo third-party SaaS egressNo vendor cloud lock-in

01 — Ingestion · Indexing · Retrieval

Parsing

Diligence question 1.1

How would you parse our source documents, and what tools or platforms would you use and why? Specifically: how do you handle complex tables — merged cells, multi-page tables, nested or irregular layouts?

We parse Viridon's corpus with layout-aware parsing rather than naive text extraction, and the distinction matters specifically because of what your documents are: 300+ page proposals with 100+ attachments, ~200-page selection reports, embedded tables and figures, mixed formatting from many authors and source systems. Copying a PDF's text layer and splitting on line breaks scrambles multi-column reading order, flattens tables into meaningless runs of numbers, and silently drops every figure. Instead we render each page, detect its regions with a layout model, recover table structure cell-by-cell, OCR anything without a text layer, and run a vision-language model over figures so that content carried in images becomes searchable text.

The backbone is the Unstructured library, which turns PDF, DOCX, XLSX, and HTML into typed elements — titles, narrative text, tables, images — each carrying its text, metadata, and a bounding box locating it on the page. Three models do the heavy lifting underneath: a layout-detection model (yolox) that finds regions and preserves reading order on complex pages; a table-transformer that recovers row and column structure, including merged cells, and emits each table as addressable HTML; and OCR (optical character recognition — turning page images into machine-readable text) for scanned pages and image regions. We route each document to the right parsing strategy — high-resolution layout-plus-OCR for the messy real-world proposals, a cheaper native-text path for clean digital PDFs, and full-page OCR for scans — rather than forcing a single mode across the corpus.

For complex tables, merged cells and spans are recovered by the table-transformer and preserved all the way through to per-cell records, so a value like "$42M" stays bound to its row ("Phase 2 capex") and column ("2027 estimate"). Nested and irregular tables that defeat structure inference degrade gracefully: we retry without forcing a grid and keep the table as both an image and a VLM-written description, so content is never lost even when the exact grid can't be recovered. Multi-page tables are the one piece that's net-new for Viridon — because we split documents per page for throughput, a table spanning two pages first appears as two elements, and we add a reconciliation pass that detects the continuation (matching headers and column count, no intervening narrative text) and stitches them into one logical table before indexing.

Figures get dedicated handling because so much of the signal in transmission proposals and ISO/RTO studies lives in diagrams, not prose — one-line diagrams, substation layouts, deliverability-study charts. Every figure is extracted and, whether or not it carries any extractable text, passed to a vision-language model — an AI that looks at an image and describes it in words. We index that description alongside a pointer back to the original image. A query about a deliverability constraint near a given substation can then retrieve a one-line diagram that contains zero searchable text. Optionally, multi-modal embeddings can make the image itself retrievable by visual similarity — we'd add that only if evaluation shows figure-heavy queries underperforming, rather than paying for it speculatively.

The commodity stages — raw OCR and the base layout and table models — we take off the shelf rather than rebuild. Our differentiated value is the orchestration around them: strategy routing per document, the dual image-plus-structure table representation, VLM figure indexing, multi-page reconciliation, and the fallbacks that keep your hardest documents parsing into something useful instead of failing silently.

▸ Additional detail models, table handling, VLM, multi-modal, full tool stack

The backbone — Unstructured

The parsing layer is built on the Unstructured Python library (unstructured 0.17.0 + unstructured-inference 0.8.10). It converts PDF, DOCX, XLSX, and HTML into a stream of typed elements — Title, NarrativeText, Table, Image, PageBreak — each carrying text, metadata, and (in the high-resolution path) a bounding-box polygon. Everything downstream — chunking, table handling, figure description, retrieval scope — keys off those types and coordinates, which is why getting this layer right is what makes the rest possible. For a 300-page proposal the document is pulled from S3, split into one PDF per page, then parsed in parallel page-buckets — a throughput decision with one consequence we handle explicitly (multi-page tables, below).

Strategy selection — three modes, chosen per document built today

HI_RES — default for PDFs up to ~999 pages with image indexing on. Pages are rendered to images, the layout model finds regions, each region is OCR'd. This is what handles the messy real-world proposals.
FAST — for very large PDFs (>999 pages) or when image indexing is off. Reads the native text layer via pdfminer, no layout model. Much cheaper, appropriate where layout is simple.
OCR_ONLY — full-page OCR for scanned documents with no text layer, common in older contracts and counterparty NDAs. Also the fallback in the lightweight read path.

The models, and what each one does

Model / component	Role	Why it's needed for Viridon's docs
yolox	Layout detection — finds text blocks, titles, tables, figures as bounding boxes on the rendered page	Multi-column layouts, sidebars, callouts, headers/footers — positional detection preserves reading order instead of interleaving columns.
Table-transformer	Recovers row/column structure including spans for each detected table; emits text_as_html	Turns a table from a meaningless run of numbers into cell-addressable structure.
OCR (Tesseract)	Reads text out of rendered regions and scanned pages	Selection reports and attachments are frequently scans or image-based exports.
pdfminer	Native text-layer extraction (FAST path)	Cheap, accurate path for clean digital PDFs where the layout model isn't warranted.
PyMuPDF (fitz)	Extracts hyperlink annotations, attaches each URL to the nearest element by coordinate	Proposals and contracts cross-reference prior sections and external docs by link.

All of the above run in the pipeline today (Figure 1).

Complex tables — merged cells, nesting, irregular layouts

Merged cells and spans built today — the table-transformer recovers rowspan/colspan and we preserve them in the HTML representation, then parse that HTML with pandas into per-row/per-column child records, so each cell is indexable with its row and column context rather than as a floating value.
Two representations, kept together built today — for every table we retain both the structured cell records and the original table rendered as an image (uploaded to S3, base64 in the payload). Retrieval can hit a specific cell value; a human or a VLM can still see the original visual. We never collapse the table to only one of the two.
Irregular / nested layouts built today — when structure inference fails on a genuinely irregular table, the pipeline retries with infer_table_structure=False and falls back to the table image plus a VLM description. Graceful degradation — we may not perfectly recover the grid, but we never drop the content.
Multi-page tables net-new for Viridon — because we split the PDF per page for parallel throughput, a table spanning pages N and N+1 is first detected as two separate Table elements. We add a reconciliation pass: detect a table at the bottom of page N and the top of page N+1, confirm matching column count and header signature with no intervening narrative, and stitch them into one logical table before chunking — carrying the header forward onto continuation rows. Where the match is ambiguous, we keep both the stitched table and the per-page originals so nothing is lost to a wrong guess.

Figures and images — VLM descriptions indexed alongside the image built today

A large share of the signal in transmission proposals and ISO/RTO studies lives in figures, not prose. The pipeline extracts every image/figure region, uploads it to S3, and keeps the base64 in metadata. If the region's OCR text comes back empty — i.e. it's a true diagram, not a text box — we run a Vision Language Model over it — on Amazon Bedrock (reached privately over PrivateLink) or a self-hosted open model — to generate a natural-language description of what the figure shows, and we index that description as searchable text paired with a pointer back to the original image. The practical effect: a query like "deliverability constraint near the X substation" can retrieve a figure that contains zero extractable text, because the VLM's description of it is in the index.

Multi-modal embeddings optional

Today we embed text — including the VLM-generated figure descriptions — into a single text vector space (Vespa). The figure is therefore retrievable via its description. An optional extension adds a multi-modal embedding (a CLIP-style model) so the image itself is embedded into a shared vector space and retrievable by visual similarity, not only through its description. The trade-off is additional indexing infrastructure and cost; in practice the description-indexing approach already captures most of the retrieval value, so we'd enable this only if the eval harness shows figure-heavy queries underperforming.

Other file types built today

DOCX — a custom Docx picture partitioner extracts embedded images, so figures inside Word proposals get the same image + VLM treatment.
XLSX — sheet- and cell-level elements, relevant for model outputs and trackers.
HTML — used in the web-research / scraper path for public ISO/RTO context.

Tools & platforms in the parsing stack

Layer	Tool / model	Why this one
Parsing framework	Unstructured (0.17.0 / inference 0.8.10)	Typed-element output with coordinates; one interface across PDF/DOCX/XLSX/HTML; proven on long, messy documents.
Layout detection	yolox	Recovers reading order and region types on complex multi-column pages.
Table structure	table-transformer	Cell-level structure + spans → addressable HTML.
OCR	Tesseract (via Unstructured)	Reads scanned pages and image regions with no text layer.
Native text	pdfminer	Cheap, accurate path for clean digital PDFs.
Hyperlinks	PyMuPDF (fitz)	Recovers and re-attaches link annotations by coordinate.
Table → records	pandas	Turns table HTML into per-cell row/column records.
Figure understanding	VLM — Bedrock or self-hosted open model	Describes figures so image-only content becomes searchable text.
Embedding + index	text embedding model → Vespa	Embeds text + figure descriptions into the hybrid retrieval index (KL·A).
Object store	S3	Original assets stay in your tenant.

Edge vs. commodity — preview of 1.5

Raw OCR and the base layout/table models are commoditized — we use strong off-the-shelf components there. Our differentiated value is the orchestration around them: strategy routing, the dual image-plus-structure table representation, VLM figure-description indexing, multi-page table reconciliation, and the graceful-degradation fallbacks that mean Viridon's hardest documents still parse into something useful.

01 — Ingestion · Indexing · Retrieval

Chunking & metadata

Diligence question 1.2

What is your approach to chunking these documents, and what tools would you use and why? Specifically: what metadata schema would you attach to chunks (e.g. document type, ISO/RTO, date, section), and how would that metadata be created — automated extraction, manual tagging, or a mix?

We chunk in two stages. First, structural chunking from the parse (Figure 1, KL·B) breaks a document into typed elements — narrative blocks, table rows, figure descriptions — so each chunk already knows what kind of thing it is and where it sits. Then each element's text is split by a semantic chunker into topic-coherent pieces, and each piece becomes one embedding (a numeric representation of its meaning, so related passages land near each other). We do this because fixed-size windows cut through the middle of an argument or a table; topic-aware splits keep a retrievable chunk on a single idea, which is what drives retrieval quality on long proposals and ~200-page selection reports.

We don't use one chunker for everything — we match the strategy to the content to control both quality and cost. Long-form prose goes through semantic chunking (the primary strategy); short, already-structured fields (table rows, short metadata snippets) use cheaper recursive or fixed-length splitting, because running the full semantic pass on a one-line field spends embedding calls for no retrieval gain. Each Vespa record is capped at five embeddings so no single record becomes a bag of unrelated vectors. The tooling is a custom semantic chunker (Greg Kamradt's percentile-breakpoint method, run on our own embeddings rather than a third-party wrapper), LangChain's recursive character splitter for structured content, and Vespa as the index.

Every chunk lands in our standard_document schema, which carries a deep metadata set on each record: document type and subtype; page number and on-page coordinates (for exact citation); section and hierarchy IDs; table position (row, column, is-table-root); figure pointers; created/updated dates and a version; folder path; and access scope (organization, owner, collaborators). The schema also includes typed key/value maps — string→string, string→int, string→double — so Viridon-specific dimensions like ISO/RTO, sponsor, project, or filing date attach to chunks without a schema change when a new dimension shows up. Maps are how we store arbitrary per-client metadata as dictionaries rather than hard-coding columns.

That metadata drives two kinds of filtering. Hard filtering uses the exact-match maps — "only chunks where ISO_RTO = CAISO and document_type = selection report" — applied before ranking to scope the search precisely. Soft filtering uses metadata that's full-text indexed (titles, filenames, folder path, short structured fields): it's folded into the hybrid score via BM25 — the standard keyword-matching algorithm — so a matching project name boosts a chunk's relevance without excluding anything. (Hybrid search blends this keyword signal with vector search, which matches on meaning rather than exact words.) And because we set this per field, metadata can be made BM25-searchable or kept filter-only by choice — ISO/RTO can be a hard filter, a soft ranking signal, or both, depending on how you want it to behave.

Metadata is created mostly automatically, with light human curation. The parser emits structure, coordinates, page, and hierarchy; file and source systems give dates, folder path, and access scope; a classifier assigns document type/subtype; and an LLM extraction pass pulls Viridon-specific values (ISO/RTO, sponsor, project, key dates) out of the content into the maps. Humans confirm the taxonomies and correct the occasional misclassification — corrections feed back rather than being re-done each time. This whole step is the Ingestion & Structuring component (KL·B) feeding the index (KL·A).

▸ Additional detail two-stage model, chunkers, full metadata schema, hard vs soft filtering

The two-stage model built today

Most file indexing (PDF, DOCX, TXT, PPTX, XLSX) follows the same shape: a structured element's text → SemanticChunker → a long_text_data[] array (each entry independently embedded) → break_element_arr_into_semi_equal_lengths (max 5) → one or more Vespa records. Semantic chunking decides how to split a block's text for embedding; the structural pass before it decides what each block is. The record-splitter is not a text splitter — it caps each record at five embeddings and distributes the chunks evenly, so retrieval granularity stays clean and no single record stores too many vectors.

1 · Semantic chunking — primary strategy built today

The default for almost all document indexing, based on Greg Kamradt's percentile-breakpoint approach, customized to run on our own encode_many rather than a LangChain embedding wrapper. The algorithm:

Split text into sentences by regex (((?<=[.?!])\s+|\n)).
Combine each sentence with its neighbors (buffer size 1) into "combined sentences".
Embed the combined sentences in batches of 500.
Compute cosine distance between adjacent combined sentences; mark a breakpoint wherever distance exceeds the 95th-percentile threshold.
Group the sentences between breakpoints into chunks.
Post-process with a recursive character split to enforce ~32–512 char bounds, merge tiny chunks, and split oversized ones.

Used by PDF, DOCX, TXT, PPTX, XLSX and the shared embedding helper — i.e. the whole Viridon corpus.

2 · Fallback splitters built today

Recursive character split (LangChain, splitting on paragraph → line → word → char to a 512-char cap) — for structured or already-sectioned content: heading/content blocks, table rows, short fields, where fixed sizes are predictable and embedding cost should stay low.
Fixed-length (20k char window, 200 overlap, no semantics) — only where content is already short structured snippets and the semantic pass would add cost without benefit.

Cost discipline: semantic chunking embeds every combined-sentence pair for the distance calculation and then the final chunks — materially more embedding calls than recursive/fixed splitting. Reserving it for long-form prose and using cheaper splitters for short structured fields is a deliberate cost choice.

3 · Structural chunking first built today

Before any of the above, documents are chunked by structure: tables become row/column children (carrying row_num, col_num, is_table_root), images get VLM descriptions, and hierarchy is captured via group IDs. Only the resulting text (or image description) then goes through semantic chunking.

The metadata schema — standard_document built today

Each chunk is one standard_document record (inheriting a shared base schema). The fields map directly onto everything Viridon needs:

Need	Schema field(s)	How it's created	Indexing
Document type	document_type, document_subtype	Classifier at ingest (proposal, selection report, RFI, NDA, ISO/RTO study)	filter
ISO/RTO, sponsor, project, dates	string_string_hard_filter_map, string_int_hard_filter_map, string_double_hard_filter_map	LLM extraction + source metadata — no schema change for new dimensions	hard filter (exact)
Section / hierarchy	group_id, parent_group_id, hierarchy_group_ids	Structural parse	filter / scope
Page & location	page_number, coordinates (map<string,float>)	Parser (Figure 1)	filter / citation
Table position	row_num, col_num, is_table_root, is_number_only	Table parser	filter
Figures	image_uuid, image_aws_key, is_image_useful	Image + VLM pipeline	filter
Date / recency / version	document_created_at, created_at, updated_at, version	File & source metadata	attribute (filter + rank)
Location in tenant	folder_path, folder_path_ids	SharePoint tree	BM25 + filter
Access scope (RBAC — role-based access control)	organization_id(s), owner_id, collaborator_ids	Source / SSO	filter (→ KL·H)
Titles	name, filename	Source	BM25
Short structured fields	short_text_field_data (weightedset)	Extraction	BM25 (weighted)
The chunk text itself	long_text_data (array<string>) + long_text_embeddings	Semantic chunker	BM25 + semantic

Hard filtering — exact metadata scoping

The three typed maps (string→string, string→int, string→double) are stored with exact match on both key and value, as fast-search attributes. That's what lets a query say "restrict to ISO_RTO = CAISO, document_type = selection report, year ≥ 2023" and have Vespa narrow the candidate set before ranking. Because they're maps, adding a new metadata dimension is a data change, not a schema migration.

Soft filtering — metadata as a ranking signal, BM25-searchable by choice

Fields marked enable-bm25 (long_text_data, name, filename, folder_path, short_text_field_data) are full-text searchable and carry per-field weights in the rank profile (e.g. name/filename weighted ~2× body text), so matching metadata boosts relevance rather than excluding. Fields marked attribute-only (document type, page, the hard-filter maps) are not BM25-searched — they're pure filters. This is a per-field choice: any piece of metadata can be made a soft ranking signal (indexed), a hard filter (attribute), or both. There are also trigram (gram-size 3) variants of the indexed fields for fuzzy/typo-tolerant matching, and an empty-field discount so a chunk isn't unfairly penalized for missing an optional field. The full hybrid rank profile — BM25 / semantic / n-gram weighting, multi-vector closeness over long_text_embeddings, and reranking — is covered in Section 1.4 (Retrieval).

How metadata is created — automated, extracted, or curated

Automated from the pipeline — page, coordinates, hierarchy, table position, chunk IDs, figure pointers (the parser); dates, folder path, access scope (file & source systems).
LLM-extracted — ISO/RTO, sponsor, project, referenced dates and other Viridon dimensions, pulled from content into the maps and short fields during ingestion.
Human-curated (light) — confirming the document-type and ISO/RTO taxonomies and correcting misclassifications; corrections feed back so they're not repeated. The taxonomy is configured per client configured for Viridon, on top of the mechanism that already exists.

01 — Ingestion · Indexing · Retrieval

Embedding

Diligence question 1.3

What is your approach to embedding, and which model(s) or platform(s) would you use and why?

For Viridon we'd run a privacy-first, model-agnostic embedding layer on Amazon Bedrock, so every document is embedded entirely within Viridon's own AWS environment. We treat the embedding model as a swappable component rather than a fixed dependency — the retrieval architecture doesn't change when the model does — but the deployment posture (in-tenant, no data egress) is the part we'd hold fixed, because it's important in ensuring this system becomes an asset in your data room at time of exit.

The reason to anchor on Bedrock is data control. Accessed through an AWS PrivateLink VPC endpoint, embedding traffic stays on the AWS network within Viridon's chosen region and never crosses the public internet. Bedrock does not use inputs or outputs to train any model, and does not share them with model providers; data is encrypted in transit and at rest, optionally under Viridon's own KMS keys; and the service carries the compliance coverage a Blackstone-backed infrastructure company will be asked about (SOC 1/2/3, ISO 27001 and family, HIPAA-eligible, GDPR, FedRAMP). The practical consequence for the exit story: the embedding index is an owned asset with no third-party data exposure that could reprice a deal.

On model choice, the default for text is Amazon Titan Text Embeddings V2 — optimized for RAG (retrieval-augmented generation: answering from retrieved documents), multilingual, with selectable output dimensions (256 / 512 / 1024) and unit-normalized vectors. For the figure-heavy ISO/RTO and transmission material, we'd use a multimodal embedding model — Amazon Nova Multimodal Embeddings (a single model spanning text, documents, images, video and audio, with cross-modal retrieval) or Titan Multimodal Embeddings G1. This is what makes the multimodal retrieval we flagged in Parsing (1.1) real: a one-line diagram or deliverability chart is embedded into the same space as the surrounding text, so a text query can retrieve the figure directly, not only via its written description. If a privacy requirement or an eval result points elsewhere, Cohere embeddings on Bedrock — or self-hosted open-source models (e.g. BGE, E5, GTE) that run entirely inside the VPC — are also available; the choice is eval-gated and privacy-gated, not locked.

Two engineering invariants govern the layer. First, index and query must use the same model and the same dimension — both sides are wired to the chosen Bedrock model, and the vector index's tensor dimension is set to match. Second, we pick the output dimension to balance accuracy against storage and latency; these models use Matryoshka-style dimensions (embeddings trained so they can be safely truncated to a shorter length), so e.g. dropping from 1024 to 512 keeps ~99% of retrieval accuracy at half the storage. This is the detail behind the Indexing & Retrieval component (KL·A) in Figure 1.

Finally, embeddings are one signal of three, not the whole story. Retrieval blends semantic (vector closeness) with BM25 (lexical) and n-gram (fuzzy) matching, and the semantic weight is automatically zeroed for chunks that have no vector (e.g. number-only table cells) so lexical matching still works. That design is what lets the embedding model be swapped without rebuilding retrieval — the full ranker is covered in Section 1.4.

▸ Additional detail Bedrock privacy, model options, dimensions, cross-modal retrieval

Why Amazon Bedrock — data privacy in depth recommended for Viridon

Network isolation — AWS PrivateLink creates an interface VPC endpoint in Viridon's subnets; embedding traffic stays on the AWS network within the chosen region and never traverses the public internet.
No training, no sharing, no retention — inputs and outputs are not used to train any foundation model, and not shared with model providers. Providers have no access to the isolated model-deployment accounts.
Encryption & keys — TLS in transit, KMS at rest, with the option of customer-managed keys.
Residency — data stays in the chosen region; no cross-region movement unless explicitly opted in.
Compliance — SOC 1/2/3, ISO 27001 / 27017 / 27018 / 27701, CSA STAR L2, HIPAA-eligible, GDPR, FedRAMP — the documentation a sponsor or acquirer's diligence will expect.

Recommended models

Model	Modality	Output dims	Why / when
Titan Text Embeddings V2 titan-embed-text-v2:0	Text (8k tokens, 100+ languages)	256 / 512 / 1024	Default for prose. RAG-tuned, normalized, flexible dimension for storage/latency control.
Amazon Nova Multimodal Embeddings	Text · document · image · video · audio (unified)	256 / 384 / 1024 / 3072	Best fit for figure-heavy ISO/RTO + transmission docs; cross-modal retrieval in one space.
Titan Multimodal Embeddings G1 titan-embed-image-v1	Text + image (shared space)	256 / 384 / 1024	Lighter multimodal option for image-by-text / image-by-image search.
Cohere (on Bedrock) · open models (BGE / E5 / GTE)	Text (model-dependent)	model-dependent	Eval/privacy-gated alternatives. Open models run fully self-hosted in-VPC.

Multimodal & cross-modal retrieval for Viridon

A large share of Viridon's signal lives in diagrams — one-lines, substation layouts, deliverability charts. In Parsing (1.1) we make those searchable by indexing a VLM-written description of each figure. A multimodal embedding model goes one step further: it embeds the figure itself into the same vector space as text, so a text query retrieves the image by visual-semantic similarity, not only through its description. Nova Multimodal Embeddings is explicitly designed for exactly this — searching documents that mix infographics and text — which is why it's our lead recommendation where figure retrieval matters.

Dimensions & the index invariant

The embedding dimension must match the vector index. Bedrock's models expose Matryoshka-style dimensions, so we choose a point on the accuracy/cost curve — typically 1024, or 512 where storage and latency matter and ~99% of accuracy is retained — and set the index tensor dimension to match. The one hard rule: the same model and dimension are used at index time and query time. Changing the model later means re-embedding the corpus and updating the index dimension, which is a deliberate, eval-gated migration rather than a silent swap.

Throughput

Bedrock offers asynchronous / batch embedding jobs for indexing large corpora (the 300+ page proposals) and a latency-optimized path for query-time embedding, which maps cleanly onto our parallel indexing design and keeps interactive search fast.

Hybrid retrieval recap built today

Embeddings feed the semantic leg of a three-signal hybrid ranker (semantic + BM25 + n-gram, default blend ≈ 0.5 / 0.4 / 0.1). Semantic weight auto-zeroes for chunks with no vector so lexical and fuzzy matching still operate. Because retrieval is hybrid and the embedding model is abstracted behind a single encode interface, the model is genuinely swappable — the retrieval architecture, covered in 1.4, is unchanged by the choice.

01 — Ingestion · Indexing · Retrieval

Retrieval

Diligence question 1.4

Walk us through your retrieval approach, including the tools or platforms used and why. Specifically: is retrieval semantic-only or hybrid (keyword + semantic), and what signals drive ranking?

Retrieval is hybrid, not semantic-only, and it's a multi-stage pipeline rather than a single vector lookup. Every search fuses three signals inside one Vespa query — BM25 (lexical/keyword), semantic (vector nearest-neighbour over the chunk embeddings), and n-gram (character-trigram, typo-tolerant) — and several lightweight LLM steps wrap around that core to handle the messiness of real enterprise questions. Vespa (open-source, self-hosted in Viridon's VPC) is the engine because it does true hybrid retrieval and two-phase ranking in a single query; the query is embedded with the same Bedrock model used at index time (Section 1.3).

Before anything is searched, two things happen. Query distillation rewrites a conversational message into a standalone query — "what about their deliverability?" becomes "Sunrise project deliverability study outcome" using the recent conversation history, with explicit instructions not to over-interpret domain terms. Then multi-query expansion generates three additional phrasings (synonyms, full-forms/abbreviations like "CAISO" ↔ "California ISO", and different angles), so we typically search with four parallel queries. This lifts recall on under-specified questions, which is the common failure mode on a corpus this varied.

Each query then runs hybrid search in Vespa, scoped by organization, optional source selection, and hard metadata filters (the maps from 1.2 — e.g. ISO_RTO = CAISO). This is the Scoped Retrieval component (KL·H) in Figure 1. The lexical legs run over text plus high-value fields (name, filename, folder path); the semantic leg runs an approximate-nearest-neighbour search over the multi-vector chunk embeddings, with a candidate set of ~500 before ranking.

Ranking is where the signals combine, and it's fully configurable through Vespa rank profiles. The default profile blends semantic at 0.5, BM25 at 0.4, and n-gram at 0.1 in a first phase, then re-ranks the top 500 in a global phase with each signal linearly normalized (a master profile offers reciprocal-rank fusion — a standard method for merging several ranked lists — as an alternative). On top of that, field-level weights mean a match in a document's name or filename (weight 300) outranks the same match in body text (150), an empty-field discount keeps a chunk from being penalized for missing optional metadata, and the semantic leg auto-zeroes for chunks that have no embedding (number-only table cells) so lexical matching still surfaces them. A per-word significance step scores each query term HIGH / MEDIUM / LOW (1.0 / 0.5 / 0.01) so filler words don't pollute the lexical match while the full-query embedding stays intact.

After the parallel searches return, an application fusion layer merges and de-duplicates across the four queries and boosts documents that matched multiple phrasings (final = top score + 0.2 × second-best) — deliberate recall amplification for results that show up under several framings. Optional precision passes sit on top: Vespa's global-phase rerank is always on; a self-hosted open cross-encoder reranker (e.g. BGE-reranker) — a slower, more accurate model that re-scores the shortlist for precision — can be added; and an optional LLM source-picker can read the shortlist and return an ordered set of the most relevant sources before the answer is generated.

Finally, custom ranking is a first-class lever, not an afterthought — the rank profile is tuned per customer. The signal weights, the field weights, significance on/off, linear-norm vs reciprocal-rank fusion, scope defaults, and image inclusion are all configurable per request or per org without code changes, and we tune them against Viridon's eval set (Section 3). A worked retrieval trace on a representative document is below; we'd deliver the full trace on a Viridon document of your choosing as the requested artifact.

▸ Additional detail pipeline, rank profiles & weights, fusion, worked retrieval trace

The retrieval pipeline, end to end built today

1 · Query distillation — an LLM rewrites the conversational message into a standalone query using the last several turns; streamed to the user as "Generating search query…".
2 · Multi-query expansion — an LLM produces 3 extra phrasings (synonyms, full-forms/abbreviations, alternate angles); the original is appended → ~4 parallel queries.
3 · Term significance — an LLM scores each word HIGH (1.0), MEDIUM (0.5) or LOW (0.01), seeing both the original question and the search term so it weights by actual intent.
4 · Hybrid Vespa search — BM25 + semantic + n-gram fused in one YQL query per term, scoped and filtered.
5 · Rank profile — two-phase weighted ranking inside Vespa.
6 · Cross-query fusion — merge, dedupe, multi-match boost.
7 · Enrichment — hits become multimodal context (text, images via presigned URLs, table structure) for the answer LLM; optional cross-encoder rerank / LLM source-pick.

Hybrid search — three modes in one query

A single Vespa request combines all three retrieval modes, plus the per-word significance weights injected into the lexical legs:

… or userInput(@query) // BM25 (lexical) … or ({defaultIndex:"grams"} userInput(@query)) // n-gram (fuzzy) … or ({targetHits:500} nearestNeighbor(long_text_embeddings, prompt_embedding)) // semantic (ANN)

The semantic leg uses approximate nearest-neighbour over the multi-vector chunk embeddings with a ~500-candidate target; the final hits count returned is tunable (e.g. 30 for the API, up to 500 for the full RAG path).

Rank profiles & the signals that drive ranking built today

Two phases. First-phase computes a weighted blend of the three signals; global-phase re-ranks the top 500 with each signal linearly normalized (the master profile swaps in reciprocal-rank fusion when use_reciprocal_rank is set). The defaults on standard_document:

Signal	Default weight	Mechanism
Semantic	0.5	closeness of query embedding vs the chunk's multi-vector long_text_embeddings
BM25 (lexical)	0.4	BM25 over long_text_data, name, filename, folder_path, short fields
N-gram	0.1	nativeRank over the trigram fields — typo / partial-token tolerance

Inside the BM25 leg, fields are weighted so identity matches win: name 300, filename 300, folder_path 200, body text 150, short fields 150. Two guards matter: an empty-field discount (×0.1) so a chunk isn't penalized for lacking an optional field, and a semantic auto-zero — if a chunk has no embedding (number-only cells), its semantic weight drops to 0 and the blend renormalizes over the lexical signals so the chunk is still retrievable. All weights are inputs, so they're overridable per request.

Term significance — which words matter

Per-word weights are injected into the YQL so noisy lexical matches on filler words are suppressed while the full-query embedding is untouched:

default contains ({weight:1.0, significance:HIGH} "deliverability") default contains ({weight:0.01, significance:LOW} "the")

HIGH (1.0) for names/entities, MEDIUM (0.5) for secondary context, LOW (0.01) for stopwords/generic terms. Significance can be disabled per org where it isn't helping.

Cross-query fusion built today

Within each query's result set, signals are linearly normalized and re-scored with the per-hit weights; across the four queries, hits are de-duplicated by ID and a document that matched multiple phrasings is boosted: final = top score + 0.2 × second-highest score. This intentionally amplifies recall for documents that surface under several framings of the same question.

Reranking & per-customer tuning

Vespa global-phase rerank — top-500 re-score, always on. built
Cross-encoder reranker — a self-hosted open cross-encoder reranker (e.g. BGE-reranker / mxbai-rerank) for higher precision on the shortlist; addable to the main path and runs fully in-VPC. optional
LLM source-picker — optionally an LLM reads the shortlist and returns an ordered most-relevant set before answer generation. optional · built
Per-customer rank profile — signal weights, field weights, significance on/off, fusion method, scope and image-inclusion defaults — all tuned to Viridon against the eval set, no code change. tuned for Viridon

Worked retrieval trace — representative illustrative; full artifact on a Viridon doc

User message (with history)

"what was the deliverability outcome for Sunrise in CAISO?"

After distillation + expansion → 4 parallel queries

Sunrise project deliverability study outcome CAISO Sunrise transmission deliverability assessment California ISO Sunrise full capacity deliverability status CAISO Sunrise interconnection deliverability result

Term significance (query 1)

Sunrise · 1.0project · 0.01deliverability · 1.0study · 0.5outcome · 0.5CAISO · 1.0

Hard filter applied: ISO_RTO = CAISO.

Candidate chunks — per-signal scores → blended (0.4·bm25 + 0.5·sem + 0.1·ngram)

Chunk	BM25	Semantic	N-gram	Blended	Matched queries
A · Selection report, "Deliverability" §, p.142	0.82	0.91	0.40	0.82	1, 2, 3
D · file Sunrise_Deliverability_Study.pdf	0.88 (filename wt 300)	0.66	0.55	0.74	1, 2, 4
B · Proposal exec summary mention	0.70	0.74	0.30	0.68	1
C · Table cell "FCD status: Conditional"	0.55	n/a — no vector	0.65	0.57	1, 3

Chunk C is a number-only cell: semantic auto-zeroes and the blend renormalizes over the lexical signals — (0.4·0.55 + 0.1·0.65) / 0.5 = 0.57 — so it's still retrieved.

Cross-query fusion (top + 0.2 × second) → final ranking

Rank	Chunk	Fusion	Final
1	A	0.82 + 0.2·0.80	0.98
2	D	0.74 + 0.2·0.71	0.88
3	C	0.57 + 0.2·0.54	0.68
4	B	0.68 (single match)	0.68

A wins on strong semantic + three-query match; D rises on the filename weight and multi-query match; C — a number-only cell with no embedding — still ranks via lexical signals. This is the shape of the trace artifact we'd deliver on a representative Viridon document.

Tools & platforms in the retrieval stack

Layer	Tool / platform	Why
Hybrid index & ranking	Vespa (open source, self-hosted in-VPC)	BM25 + semantic + n-gram fused in one query; two-phase rank profiles; ANN at scale.
Query embedding	Bedrock model (per 1.3)	Same model + dimension as index; in-VPC.
Query understanding	LLM — Bedrock or self-hosted open model (distill · expand · significance · optional source-pick)	Turns messy conversational input into high-recall, intent-weighted queries.
Precision rerank	Open cross-encoder (e.g. BGE-reranker), self-hosted	Optional shortlist reranking for higher precision.
Cross-query fusion	Application layer	Merge, dedupe, multi-match recall boost.

01 — Ingestion · Indexing · Retrieval

Edge vs. commodity

Diligence question 1.5

Looking across the pipeline above, which stages do you consider commoditized (off-the-shelf tooling), and where do you provide differentiated value?

Several stages of the pipeline are genuinely commoditized, and we deliberately use strong off-the-shelf components for them rather than reinventing them. Recognizing what's commodity is a feature — it's what keeps the build lean and lets us concentrate engineering where it actually differentiates. In Figure 1 terms, the commodity sits underneath the boxes; the boxes themselves — how they're composed, tuned, and extended — are where the value is.

The commoditized stages are the foundation models and retrieval primitives. Parsing leans on the Unstructured framework, OCR, and the base layout and table-structure models. Embedding is a commodity capability — Titan, Nova and Cohere on Bedrock and self-hosted open models (BGE, E5) are interchangeable, and the vector math is the same everywhere. Retrieval's core operations — BM25, approximate-nearest-neighbour vector search, n-gram matching — are off-the-shelf Vespa primitives, and basic character-based chunking is a solved problem. The platforms themselves (Unstructured, Vespa, Bedrock) are infrastructure we build on, not things we'd ever rebuild. Reinventing any of these would destroy value, not create it.

Our differentiated value is the orchestration around those commodity pieces, and the Viridon-specific layers on top of them. In parsing: strategy routing per document, the dual image-plus-structure table representation, VLM figure-description indexing, multi-page table reconciliation, and the graceful-degradation fallbacks that keep hard documents from failing silently. In chunking: the two-stage structural-then-semantic design, the tuned semantic chunker, and — the biggest piece — the metadata schema with its typed maps, hard/soft filtering, and BM25-by-choice fields. In embedding: the privacy-first in-VPC deployment posture and the model-agnostic abstraction. In retrieval: the multi-stage LLM-wrapped pipeline (distillation, expansion, term significance, cross-query fusion) and the per-customer rank profiles.

And then there's the layer with no off-the-shelf equivalent at all — the bespoke knowledge-layer modules built for Viridon: the "what changes" map (KL·D), public-doc enrichment (KL·I), the RFI Q&A + SME-delegation memory (KL·J), the standard-terms playbook (KL·K), and the onboarding glossary (KL·L), plus the app-specific tools. No general-purpose tool ships these because they encode Viridon's process, not a generic one. This is exactly the gap an off-the-shelf product leaves: it gives you the commodity retrieval box and stops — which is why, on the diagram, a tool like Glean covers only KL·A.

This division is what makes the platform both efficient and bespoke. We don't pay to rebuild commodity foundations, so the budget goes to the integration, tuning, and Viridon-specific modules that are genuinely differentiated — and those are the parts that sit in Viridon's environment as an owned asset.

▸ Additional detail stage-by-stage commodity vs. value, and the net-new layer

Commodity vs. differentiated, stage by stage

Stage	Commoditized (off-the-shelf)	Our differentiated value
Parsing	Unstructured framework; OCR (Tesseract); base layout (yolox) & table-structure (table-transformer) models; pdfminer; PyMuPDF	Strategy routing per document; dual image + structure table representation; VLM figure-description indexing; multi-page table reconciliation; graceful-degradation fallbacks
Chunking	Recursive / fixed-length character splitters (LangChain)	Two-stage structural → semantic → record-cap design; tuned semantic chunker (95th-pct, own embeddings); cost-aware strategy selection; the standard_document metadata schema (typed maps, hard/soft filtering, BM25-by-choice)
Embedding	The embedding model itself (Titan / Nova / Cohere on Bedrock, or self-hosted open models)	Privacy-first in-VPC Bedrock deployment; model-agnostic abstraction; index/query dimension-invariant management; multimodal cross-modal wiring
Indexing & retrieval	Vespa platform; BM25, ANN/vector search, n-gram primitives	Multi-stage LLM pipeline (distill, expand, significance); hybrid rank profiles (signal + field weights, empty-field discount, semantic auto-zero); cross-query fusion; per-customer rank tuning
Knowledge-layer modules	— no off-the-shelf equivalent —	Net-new for Viridon: KL·D "what changes" map · KL·I public-doc enrichment · KL·J RFI Q&A + SME delegation · KL·K standard-terms playbook · KL·L onboarding glossary · bespoke tools (T1, T3, T4, T6, T10, T12)

Why the division is deliberate

Commodity stays commodity — foundation models and search primitives improve rapidly and are interchangeable; building our value on them (not instead of them) means Viridon inherits those improvements for free and isn't locked to one vendor.
Value concentrates in composition and context — almost all of the differentiation is in how the commodity pieces are routed, tuned, and fused, and in the modules that encode Viridon's specific documents and process.
The off-the-shelf ceiling — a packaged tool delivers the commodity retrieval box and nothing above it; it can't be shaped to the proposal-writing workflow, and the bespoke knowledge-layer modules simply don't exist in it.

01 — Ingestion · Indexing · Retrieval

Artifacts

The two artifacts requested for this section: a retrieval trace on a representative document, and the third-party tools, platforms, and models in the proposed stack.

Artifact A

Retrieval trace — representative document

For a sample query, the chunks retrieved and how they were ranked. The bars decompose each chunk's relevance into the three weighted signals, so it's visible why each ranked where it did. This expands the inline trace from Section 1.4; we'd run the full version on a Viridon document of your choosing.

Representative doc · 2024 CAISO selection report · ≈198 pp · embedded tables + deliverability figures

Sample query (after distillation + 4-way expansion, ISO_RTO = CAISO filter applied)

"what was the deliverability outcome for Sunrise in CAISO?"

Per-query relevance — how each signal contributes to the blend (0.4·BM25 + 0.5·semantic + 0.1·n-gram)

Selection report — "Deliverability" section

p.142 · narrative · matched queries 1, 2, 3

blended0.82

File: Sunrise_Deliverability_Study.pdf

filename match (field weight 300) · matched queries 1, 2, 4

blended0.74

Proposal — executive summary mention

p.4 · narrative · matched query 1

blended0.68

Table cell — "FCD status: Conditional"

p.151 · number-only cell · no embedding · matched queries 1, 3

blended0.57

Semantic (0.5) BM25 (0.4) N-gram (0.1)

Note the bottom result: a number-only table cell with no embedding — the semantic segment is absent (auto-zeroed) and the blend renormalizes over the lexical signals, so the cell is still retrieved.

Cross-query fusion (final = top score + 0.2 × second-best) → final order

Rank	Chunk	Fusion	Final
1	Selection report — Deliverability §	0.82 + 0.2·0.80	0.98
2	Sunrise_Deliverability_Study.pdf	0.74 + 0.2·0.71	0.88
3	Table cell — FCD status	0.57 + 0.2·0.54	0.68
4	Proposal exec-summary mention	0.68 (single match)	0.68

The table cell rises above the single-match proposal mention because it matched two query phrasings and earned the fusion boost — recall amplification for results that surface under multiple framings.

Artifact B

Third-party tools, platforms & models

The third-party stack the pipeline builds on. Every component is either open-source (self-hosted in Viridon's VPC) or a managed AWS service reached privately over PrivateLink — nothing requires data to leave Viridon's AWS account. Our differentiated value (Section 1.5) is the integration, tuning, and Viridon-specific modules around these — which are our own, not third-party.

Component	Type	Deployment	Role
Parsing & structuring
Unstructured	Library	Open-source · in-VPC	Typed-element parsing across PDF/DOCX/XLSX/HTML
yolox	Model	Open-source · in-VPC	Page layout / region detection
Table-transformer	Model	Open-source · in-VPC	Table row/column structure recovery
Tesseract	Engine	Open-source · in-VPC	OCR for scans & image regions
pdfminer	Library	Open-source · in-VPC	Native PDF text-layer extraction (FAST path)
PyMuPDF (fitz)	Library	Open-source · in-VPC	Hyperlink annotation extraction
pandas	Library	Open-source · in-VPC	Table HTML → per-cell records
VLM (vision-language model)	Model	Bedrock (PrivateLink) or self-hosted OSS	Figure & diagram description generation
Embedding
Amazon Bedrock	Platform	AWS · PrivateLink	In-VPC model access (no-train, KMS, no public egress)
Titan Text Embeddings V2	Model	Bedrock (PrivateLink)	Default text embeddings (256/512/1024-d)
Nova Multimodal Embeddings	Model	Bedrock (PrivateLink)	Cross-modal text + figure embeddings (lead for figures)
Open embedding models (BGE / E5 / GTE)	Model	Open-source · in-VPC	Self-hosted alternative; fully in-VPC
Cohere · Titan Multimodal G1	Model	Bedrock (PrivateLink)	Alternative managed embeddings (eval/privacy-gated)
Index & retrieval
Vespa	Platform	Open-source (Apache 2.0) · self-hosted in-VPC	Hybrid BM25 + semantic + n-gram index; two-phase rank profiles; ANN
LLM (query understanding)	Model	Bedrock (PrivateLink) or self-hosted OSS	Query distillation, expansion, term significance, optional source-picking
Cross-encoder reranker (BGE-reranker)	Model	Open-source · in-VPC	Optional precision reranking of the shortlist
Storage & security
Amazon S3	Platform	AWS · in-tenant	Original documents & extracted images
AWS KMS	Platform	AWS · in-tenant	Encryption at rest (customer-managed keys optional)
AWS PrivateLink	Platform	AWS · in-VPC	Private VPC connectivity; no public-internet egress

Specific model selections (embedding model, VLM, query/rerank models) are finalized per Viridon's data-privacy requirements and eval results; the architecture treats each as a swappable component, and an open-source self-hosted option exists for every model role.

02 — Orchestration

Orchestration design

Diligence question 2.1 · grounded in the demoed workflow

Walk us through how the demoed workflow is structured — the steps, how they connect, and the framework or major dependencies it is built on. More importantly: how do you think about orchestration design, and how do you decide between approaches (routing to specialists, a manager decomposing into parallel sub-tasks, or a single-agent flow)? What about the design you chose seems right for this workflow?

We'll ground this in what we demoed: the Proposal Writing Assistant — Setup workflow, end to end. In Figure 1 that's the Setup workflow in layer 4, marked deterministic, and it's the first phase of a larger AI teammate that runs Setup → Strategy → Drafting → Evaluation. The key design choice: Setup is a deterministic chain, not an agent improvising. The steps are known, ordered, and correctness-critical on a 300-page document — so we use the simplest control flow that does the job reliably, which here means a fixed sequence of tool calls rather than a free-roaming agent.

The demoed sequence: past winning proposals are ingested and broken into typed sections (KL·B, KL·C); a working template is auto-derived and the recurring variables are detected — project_name, sponsor, capacity, key dates (T8, KL·F); a "what changes" map flags the parts of the starting proposal that likely need to change this cycle — learned from how proposals have historically changed — versus the parts safe to keep (KL·D); the new bid's brief and documents are then used to fill the template variables and revise the flagged parts, touching only what likely needs to change so the rest is preserved (T4, KL·D); the ~200-page sponsor selection reports are ingested (KL·B); their win/loss themes are extracted and indexed as a retrievable advice module (KL·E); and finally, in AI editing, when the assistant recommends what to edit and how, it pulls that indexed selection-report advice (and prior sections) via retrieval, comments paragraph-by-paragraph, and proposes edits (T1, T5, drawing on KL·A / KL·E / KL·H). The full step-by-step is below.

On framework and dependencies: the orchestrator (Figure 1, foundation) chains these tools, and both the tools and the knowledge layer are exposed over MCP — the Model Context Protocol, an open standard for connecting AI assistants to tools and data. As the diagram says, the orchestrator can be an MCP client, your Claude / GPT desktop app, or a custom router — that's deliberate, because it decouples the control flow from the tools. Setup runs as a deterministic MCP tool chain; the later phases (Strategy, Drafting, Evaluation) move to orchestrated routing where the path depends on content.

The orchestrator also maintains structured memory across the loop, and how we model that memory is itself a design decision. Rather than carrying one ever-growing chat transcript — which conflates different concerns and quickly overflows the context window — we separate session memory into three kinds: conversation (the natural-language turns that capture user intent and constraints, like "don't delete anything"), working state (a structured scratchpad of the IDs and decisions the workflow has accumulated — the source of truth for where we are), and an episodic trace (an ordered log of every tool call, its arguments, and its outcome). Each planner step receives a deliberately bounded package assembled from these — truncated conversation, current working state, the last few trace steps — rather than everything every time. That separation is what keeps the agent within context limits, keeps machine state reliable across steps (the resumability in 2.2), and makes the whole run auditable (the execution trace in 2.3). This is session-scoped memory for the orchestration loop; cross-session institutional memory is a separate concern, covered in Section 4.

Scaling Setup into a full "AI teammate for proposal writing" is, concretely, two things: implementing the knowledge-layer modules, and building a composable tool set — Read paragraph, Create comment, Draft a section, Identify opportunities, Flow updates across the document, Evaluate against criteria, Aggregate attachments, Web research, Grounded Q&A. Those are exactly T1–T8 and T·Q in Figure 1, plus a few document-editor primitives. The teammate itself is a conversational multi-agent loop living in the multiplayer editor: it routes each turn to the right tool or subagent, drafts / comments / researches like a colleague, and proposes changes a human approves.

How we think about orchestration design comes down to a few principles. Use the simplest control flow that works — deterministic where steps are known, agentic only where they aren't. MCP as the interface, so tools are swappable and the orchestrator is replaceable. Human-in-the-loop at the right gates — the AI proposes, the human disposes; no risky or irreversible action (editing the live document, flowing a change across 300 pages) happens without explicit approval. And guardrails, governance and security throughout: RBAC-scoped retrieval (KL·H), the open-source / in-VPC deployment from our deployment principle, per-tool permissions, and a full execution trace of every step (Section 2.3). We build for performance (parallelize what's parallelizable) and extensibility (tools compose and get reused across mini-apps).

We decide between orchestration approaches by the shape of the work, and the patterns nest rather than compete. A deterministic chain for known, ordered, correctness-critical steps (Setup). A router to specialists when requests are heterogeneous and each needs a different capability (the teammate picking T1 vs T2 vs T7 per turn). A manager that decomposes into parallel sub-tasks when the work splits into independent units (evaluating all sections at once, multi-query retrieval, flowing one change across 300 pages). And a multi-agent loop for interactive, open-ended work with a human present (live drafting). For proposal writing this combination is right because Setup demands reliability, drafting is inherently interactive, evaluation is embarrassingly parallel — and, critically, every specialist tool we build composes: the comment, Q&A, research and retrieval tools built here are reused by the RFI drafter, the legal screener, and the onboarding assistant. That reuse is the whole shared-foundation thesis of Figure 1, and it's why we optimize for composition.

The demoed Setup workflow — step by step

Upload past proposals → ingest & section-extract

Past winning proposals are parsed into typed sections and RFP questions, ready for templating and retrieval.

past proposalsKL·B ingestionKL·C section extraction

Generate template + detect variables

Auto-derive a working template from prior wins and detect the variables that recur throughout a proposal — project_name, sponsor, capacity, dates.

T8 build templateKL·F template generation

Flag what likely needs to change

Look at the past proposal we're starting from and flag the parts that likely need to change this cycle — learned from how proposals have historically changed — versus the parts safe to keep.

KL·D "what changes" map

Fill variables & revise flagged parts

Use the new bid's brief and documents to fill the template variables and revise the parts the map flagged — touching only what likely needs to change, so the rest is preserved.

new brief + docsT4 flow updatesKL·D

Upload selection reports → ingest

Ingest the ~200-page sponsor selection reports alongside the proposal corpus.

selection reportsKL·B ingestion

Extract & index advice

Mine win/loss themes and actionable advice from the selection reports and index it as a retrievable knowledge module.

KL·E selection-report advice

AI editing via RAG

When recommending what to edit and how, the assistant retrieves the indexed selection-report advice (and relevant prior sections), comments paragraph-by-paragraph, and proposes edits grounded in what has won before.

T1 read & commentT5 evaluateKL·AKL·EKL·H

Human approves

Every proposed change is surfaced for the human to approve or deny before it touches the document — the gate before any edit is applied.

human-in-the-loop gate

Input / gate Tool Knowledge-layer module Human approval

▸ Additional detail the teammate tool set, design principles, choosing between patterns

Building the "AI teammate" — composable tools on the knowledge layer

The teammate is a single conversational agent that calls a set of small, well-scoped tools (the same ones in Figure 1, plus editor primitives). Each is built once and reused across apps — that reuse is the point.

Tool	What it does	Knowledge layer it draws on	Reused by
T·Q · Grounded Q&A	Cited answers over the knowledge layer	KL·A, G, H	every mini-app
T1 · Read & comment	Suggests improvements vs. selection-report themes	KL·C, E, G	RFI, legal
T2 · Draft a section	From template + structured prior wins	KL·A, B, C, F	RFI drafter
T3 · Identify opportunities	Where to differentiate this bid	KL·A, E, G	—
T4 · Flow updates	Propagate a change across 300+ pages	KL·B, D	evaluation
T5 · Evaluate against criteria	Score a draft vs. what wins	KL·A, E, G, H	—
T6 · Aggregate attachments	SME reports into one narrative voice	KL·A, B, H	RFI drafter
T7 · Web research & scrape	Live external + public-doc context	KL·B, I	ISO/RTO, all
T8 · Build a template	Auto-derive from past proposals	KL·C, D, F	—
Editor primitives	Read paragraph · Create comment · Apply approved edit	—	drafting surface

Orchestration design — the principles

Simplest control flow that works — deterministic chains where steps are known and correctness matters; agentic routing only where the path genuinely depends on content. Determinism buys reliability, observability, and speed.
MCP as the interface — tools and the knowledge layer are exposed over MCP, so the orchestrator (MCP client / Claude / GPT desktop / custom router) is decoupled from the tools. Either side can be swapped without rewriting the other.
Human-in-the-loop at the right gates — the AI proposes; a human approves or denies before any consequential action. No edit to the live document, no change flowed across the proposal, no external action without explicit approval (Figure 1's Evaluation phase makes this gate explicit).
Guardrails, governance & security — RBAC-aware scoped retrieval (KL·H) so the agent only ever sees what the user may see; the open-source / in-VPC deployment so nothing leaves Viridon's environment; per-tool permissioning; and a full, inspectable execution trace (Section 2.3).
Performance — parallelize independent work (multi-query retrieval, section-parallel evaluation), keep deterministic chains to avoid wasted LLM round-trips, cache where safe.
Extensibility — new use cases reuse existing tools and add a few; the orchestration pattern stays the same. This is what makes each subsequent mini-app a fraction of the first.

Orchestration memory — session-scoped, three-way split built today

A single chat transcript doesn't scale and conflates three different concerns. We model the orchestration loop's memory (planner → tool calls → synthesizer) as three separate types on a per-session SessionMemory, so each stays clean and bounded.

Memory type	What it holds	Answers	How it's used
Conversation	Natural-language user/assistant turns	"What did they ask for?" — intent, constraints, tone	Fed to the planner, truncated by turn count + character budget
Working state	Structured scratchpad — IDs & decisions (list_id, task_id, last_search_query)	"Where are we right now?"	Patched / replaced explicitly; in every (size-limited) planner package; the source of truth across steps
Episodic trace	Ordered tool-call log — name, redacted args, success/failure, result summary, timestamps	"What happened?"	Written on each tool call; recent summaries go to the planner; powers audit, debug, replay & synthesis

Two cross-cutting mechanisms keep this within budget built today:

Compression & summarization — when a tool output exceeds a threshold (~4,000 chars), it's compressed (heuristic or optional LLM) and the salient IDs are merged into working state, so large MCP payloads don't blow the context window or drown the signal.
Bounded planner package — each planner step gets a deliberately size-limited view (truncated conversation + working state + last N trace steps + recent tool summaries), enforcing intentional context propagation rather than "send everything every time".

The synthesizer then produces the final answer by reading the goal, the trace, and the observations — not raw MCP blobs — which is what lets the loop stay reliable and within limits while still answering well.

Memory — planned / partial

Long-term memory planned — user preferences, standing instructions, and org facts retrieved by ID or embedding into the planner prompt, so the agent can remember across sessions ("always use workspace X", "never post to #general"). Documented but not yet built in the MCP client; it's the bridge to the institutional memory in Section 4.
Conversation summarization planned — a rolling summary of older turns instead of dropping them. Truncation is built today; summarization preserves early context more cheaply and is the next step.
Subagent memory partial — each subagent runs its own plan–act loop with its own memory and returns a bounded result; the parent merges the child's working state (under subagent_last), not the full child trace, so delegation stays scoped and doesn't pollute parent context.

The core idea: separating intent (conversation), state (working state) and history (episodic trace) lets the orchestrator stay within context limits, keep machine state reliable, and still synthesize good answers — the opposite of stuffing one growing transcript into every call.

Choosing between orchestration patterns

Pattern	Use when	In proposal writing
Deterministic chain	Steps are known, ordered, and correctness-critical	The Setup phase — fixed sequence, fully traceable
Router to specialists	Requests are heterogeneous; each needs a different capability	The drafting teammate routing each turn to T1 / T2 / T3 / T7
Manager → parallel sub-tasks	The task splits into independent units that aggregate	Evaluating every section at once; multi-query retrieval; flowing one change across 300 pages
Multi-agent loop	Interactive, open-ended, human present	Live drafting in the multiplayer editor

Why this design is right for the workflow

Reliability where it matters — Setup is deterministic, so a long, high-stakes document is processed the same way every time, with a clean trace and no agent drift.
Fit to the human reality of drafting — drafting is iterative and collaborative, so a single conversational teammate with approval gates matches how Erin actually works, rather than forcing an autonomous agent onto a human process.
Speed where the work parallelizes — evaluation and retrieval fan out, so review of a full proposal doesn't run serially.
Composition over a monolith — building specialist tools (not one giant agent) means the proposal work directly powers the RFI drafter, legal screener, and onboarding assistant. We optimize for extensibility because the second app should cost a fraction of the first.

02 — Orchestration

Reliability

Diligence question 2.2

How do you think about reliability in a multi-step workflow, and what tools or techniques do you use to achieve it? Specifically: what happens when a step fails or returns low-quality output, how do you validate output between steps, and how do you prevent the workflow from drifting off course?

Our first reliability technique is to minimize the surface area for failure: the most reliable step is a deterministic one, which is why Setup is a fixed chain (2.1) rather than an agent improvising. For the parts that genuinely need an LLM, we treat it as a fallible component and wrap it in four things — validated structured outputs, checkpointed state, risk-classified human approval, and an in-loop reflection step. Together those cover the three failure questions: what happens when a step fails, how we validate between steps, and how we keep the workflow from drifting.

When a step fails or returns low-quality output, we separate two cases. A hard failure (error, timeout, tool exception) triggers a bounded retry with backoff, then a fallback path where one exists — the same pattern as the parsing fallbacks in 1.1, where a failed table inference degrades to image + description rather than crashing — and if it's still failing, we resume from the last checkpoint and, if exhausted, escalate to a human rather than proceed on a broken step. A soft failure (the step runs but the output is malformed, low-quality, or unsupported) is caught by the validation and reflection gates below, then repaired, retried, or escalated. The principle throughout: never silently pass a bad result downstream.

We validate output between steps by making every step emit a structured, typed output against an explicit schema — so we know exactly what the model produced and can check it programmatically before the next step consumes it. Schema validation deterministically catches malformed or hallucinated structure (missing fields, wrong types, out-of-range values); a grounding check verifies that claims which should be supported by retrieved sources actually are (the anti-hallucination contract that feeds our eval harness in Section 3). Each step has an explicit input/output contract, so a downstream step never has to guess what it received.

State management makes failure recoverable. Each step's inputs and outputs are checkpointed and steps are designed to be idempotent (safe to re-run without applying anything twice), so on any failure we know exactly where we left off and resume from the last good checkpoint — we don't re-parse a 300-page proposal or re-embed the corpus because a later step timed out. This matters most for the proposal workflow specifically, which runs over months, not minutes.

We prevent drift with a reflection step in the loop — a pattern we've already implemented in our agentic orchestration work. After a step (or on a cadence), a critic re-checks the work against the original objective and constraints, catches drift, and either re-anchors, re-plans, or halts. The goal and constraints are carried through every step so the agent never loses the thread, scoped tools limit how far it can wander, and explicit termination criteria plus bounded autonomy (caps on tool calls, recursion, and cost) stop a runaway loop.

Plan→ Act · call tool→ Validate output→ Reflect vs. goal ↻ on track → continue · drifting → re-anchor / re-plan · off-track or budget exceeded → halt & escalate to human

Finally, the human gate is itself a reliability control. We classify actions by risk and reversibility: read-only and reversible-in-draft actions (search, comment, propose an edit) run autonomously, while consequential or irreversible actions — applying an edit to the live document, flowing a change across 300 pages, anything external — require explicit human approval. And because applied edits are versioned (the document carries a version field), even an approved change is reversible. All of this sits on a full execution trace (Section 2.3), because you can't make reliable what you can't see.

▸ Additional detail failure handling, validation, state, reflection, risk-gating

Failure handling — by failure mode

Failure mode	How we detect it	Response
Hard failure (error / timeout)	Exception, timeout, tool error	Bounded retry with backoff → fallback path → resume from last checkpoint → escalate if exhausted
Malformed output	Schema validation fails	Repair / re-prompt → bounded retries → escalate
Low-quality / unsupported output	Critic + grounding check fail	Reflection re-do with feedback → escalate to human if it doesn't converge
Drift from the goal	Reflection step vs. objective	Re-anchor / re-plan; halt if it can't get back on track
Runaway loop	Step / cost / recursion budget exceeded	Hard stop → surface the partial result and the reason

Inter-step validation

Structured outputs — every step returns typed data against a schema, so the output is machine-checkable, not free text that the next step has to parse hopefully.
Schema validation — required fields, types, enums and ranges are enforced deterministically; a violation is a caught failure, not a downstream surprise.
Grounding checks — outputs that should be source-supported are verified against the retrieved evidence; unsupported claims are flagged before they propagate.
Explicit contracts — each step declares its input and output shape, so steps compose safely and a change to one can't quietly break the next.

State & resumability

Checkpointing — each step's I/O is persisted; a failure resumes from the last good step rather than restarting expensive work (parsing, embedding).
Idempotency — steps are safe to re-run, so retries and resumes don't duplicate or corrupt work.
Long-running by design — the proposal workflow spans months; durable state is what makes that survivable across interruptions.

Reflection & bounded autonomy

Critic in the loop — an evaluator re-checks each step's work against the goal and constraints (already implemented in our agentic orchestration products).
Goal anchoring — the original objective and constraints travel with the workflow so the agent doesn't lose the thread across many steps.
Scoped tools + termination criteria — limited tool surface and explicit stop conditions keep loops convergent.
Budgets — caps on tool calls, recursion depth and cost catch a runaway before it does damage or burns spend.
Abstention — a step can report low confidence and escalate rather than guess; "I'm not sure, here's why" beats a confident hallucination.

Risk-classified human approval

Action class	Examples	Autonomy
Read-only / retrieval	Search, read a paragraph, grounded Q&A	Autonomous
Generative · reversible in draft	Draft a section, propose an edit, generate a template, leave a comment	Autonomous (proposed, not applied)
Consequential / irreversible	Apply an edit to the live document, flow a change across the proposal, any external action (send, export)	Human approval required

Applied edits are versioned, so an approved change can still be rolled back — reversibility is a backstop even past the approval gate.

How it connects

Reliability isn't a single feature — it's the combination of a deterministic backbone (2.1), validated structured contracts between steps, durable state, an in-loop critic, risk-gated approval, and full observability (2.3). The same eval harness that measures quality (Section 3) doubles as regression protection: when a prompt or model changes, it confirms existing behavior didn't break before the change ships.

02 — Orchestration

Observability

Diligence question 2.3

How do you think about observability, and what tooling do you use for it? Specifically: can we and our technical advisor see the full execution trace of a workflow — what each step retrieved, decided, and passed downstream?

Yes — fully, top to bottom. The shift that makes this real is treating a workflow run as spans, not log lines: a run is one trace, each step is a span, and nested tool calls and sub-agents are child spans. That tree is exactly what answers "what each step retrieved, decided, and passed downstream," because those relationships are a hierarchy, not a flat stream. We build it on OpenTelemetry (the open industry standard for tracing software) with LLM-specific semantic conventions, and the backend — Langfuse or Phoenix — is open-source and self-hosted, so the trace store lives inside Viridon's VPC alongside everything else. No traces of Viridon's proposals go to a third-party SaaS.

We design for two audiences with two surfaces over the same captured data. Your technical advisor gets the full span tree and audit logs — every tool call, every retrieval, every decision, validation result and hand-off, with token, cost and latency per span and the exact prompt-template and model version that produced each output, plus replay. Erin and end users get explainability instead of raw internals: a "why did it suggest this?" view that traces any AI recommendation back through the advice it used to the source page. Same data underneath; the advisor sees the engine, the user sees the reason.

Mapping directly to your three words: retrieved is the retrieval trace from Section 1.4 captured on each search span — the distilled and expanded queries, the candidate set, the per-signal scores, and what was filtered out, not just what came back. Decided is the planner's tool choice and the alternatives it weighed, the working-state diff for that step, and the validation/reflection verdict. Passed downstream is the typed output and the working-state delta — the bounded planner package handed to the next step.

The substrate already exists. The episodic trace, working state, and bounded planner package from our memory model (2.1) already record what each step did, what changed, and what was passed on. Observability is mostly turning that into spans and a UI — productionizing what the orchestration loop already captures, not bolting on a parallel logging system.

The feature that matters most for proposal writing is provenance lineage: for each AI claim or proposed edit, we record which retrieved chunk supported it and link that chunk back to its source page. So a recommendation traces cleanly as edit → selection-report advice (KL·E) → chunk on p.142 of the 2023 report. End-to-end answer-to-source lineage is what makes the user-facing explainability trustworthy — and it doubles as a clean artifact for a future data room.

Two things worth flagging for a technical reviewer. Logged "reasoning" is the model's stated rationale — an honest record of what it reported, not a proof of the true cause — so "decided" means the recorded decision plus its stated reasoning. And exact reproducibility is bounded by model nondeterminism: we pin prompt-template and model versions and set seeds where the provider allows, so replaying a trace is fully reliable, but re-generating a hosted model's identical output is not guaranteed. Because traces contain document content, redaction (arguments are already redacted), access control on the trace UI, retention limits, and sampling are part of the design, not afterthoughts.

▸ Additional detail tooling, per-span capture, the two surfaces, provenance, limits

Tooling self-hosted in-VPC

OpenTelemetry + LLM semantic conventions (OpenInference / OpenLLMetry) — an open standard, so we're not locked to one vendor's trace format.
Langfuse or Phoenix as the trace backend & UI — both open-source and self-hostable, deployed inside Viridon's VPC per the deployment principle.
Replay — because state is checkpointed (2.2), a run can be re-executed from any step for time-travel debugging.

What's captured per span

Captured	Detail
Identity & timing	Span name, parent, start/end, duration
Inputs	Tool name, redacted args, the bounded package the step received
Retrieval	Distilled + expanded queries, candidate set, per-signal scores, what was dropped, rank profile used
Decision	Planner tool choice + alternatives weighed; validation / reflection verdict
Output	Typed result + working-state delta passed downstream
Cost & version	Tokens, cost, latency; prompt-template + model version
Grounding	Which sources supported which claims — the provenance link

Two audiences, two surfaces

Audience	Surface	What they see
You + technical advisor	Full span tree + audit logs (Langfuse / Phoenix, self-hosted)	Every step's retrieval, decision, validation and hand-off; cost / latency; prompt + model versions; replay
Erin / end users	In-product explainability view	"Why did it suggest this?" — recommendation → advice used → source page; no raw internals

Observability → evaluation loop

Traces aren't only for debugging. We sample production traces into the eval set (Section 3) and monitor online quality signals — grounding-failure rate, retrieval-hit-rate, drift / halt events — not just latency and cost. That's the difference between "we have logs" and "we know it's working", and it's what turns the regression story in 2.2 into a live signal.

Honest limits

Stated rationale ≠ proof — we record the model's reported decision and reasoning; it's a faithful record of what it said, not a guarantee of the underlying cause.
Bounded reproducibility — version-pinning and seeds make trace replay reliable; a hosted model's exact output isn't guaranteed identical on re-run.
Sensitive content — traces hold document text, so redaction, RBAC on the trace UI, retention limits, and sampling are designed in from the start.

Artifact

End-to-end execution trace — demoed Setup workflow

The full run as an expandable span tree. Click any step to see what it retrieved, decided, and passed downstream. This is a representative render of what the advisor sees in the self-hosted trace UI.

▸TRACE · Proposal Writing Assistant — Setupdeterministic8 steps · 41.6s

▸1 · Ingest & section-extractKL·B / C18.2s

Retrieved

12 past proposals + 3 selection reports from SharePoint (source scope applied)

Decided

HI_RES parse strategy (docs < 999 pp, image indexing on); table inference enabled

Passed downstream

1,840 typed sections + 312 RFP questions → working_state.section_index

▸2 · Generate template + detect variablesT8 · KL·F6.1s

Retrieved

12 prior winning proposals, ranked by selection outcome (KL·F)

Decided

Template derived from highest-scoring wins; 47 recurring variables detected — project_name, sponsor, capacity_mw, cod_date …

Passed downstream

Template + variable manifest → working_state.template

▸3 · Flag what likely needs to changeKL·D3.4s

Retrieved

The starting proposal + change patterns learned across historical proposals

Decided

Flagged the parts of the starting proposal that likely need to change this cycle (e.g. project specifics, deliverability sections) vs. the parts safe to keep

Passed downstream

"What likely needs to change" map → working_state.change_map

▸4 · Fill variables for the new bidT4 · KL·D5.0s

Retrieved

New project brief + 4 supporting documents

Decided

Revise only the flagged parts and fill the template variables; the rest left untouched

Passed downstream

Filled draft v0 → working_state.draft_id (version 1)

▸5 · Ingest selection reportsKL·B9.7s

Retrieved

3 sponsor selection reports (~200 pp each)

Decided

HI_RES parse; table + figure extraction

Passed downstream

Parsed selection-report records → index

▸6 · Extract & index adviceKL·E7.3s

Retrieved

Parsed selection-report records

Decided

Mined 64 win/loss advice entries; indexed as a retrievable advice module

Passed downstream

Advice module → KL·E (now retrievable by the editor)

▸7 · AI editing via RAGT1 · T511.2s

Retrieved

Selection-report advice + prior winning sections — top chunk: 2023 selection report, p.142 (final score 0.98). See child span for the full retrieval trace.

Decided

Propose 1 comment + 1 edit to §3.2, strengthening deliverability evidence — grounded in KL·E theme "deliverability evidence under-stated vs. winning bids"

Passed downstream

Proposed change-set {comment_1, edit_1} → working_state.pending_changes

▸7.1 · Retrieve (hybrid)KL·A / E / H1.4s

Retrieved

Query "deliverability outcome for Sunrise in CAISO" → 4 expanded queries → 4 chunks. Ranked: selection-report §Deliverability p.142 (0.98) · Sunrise_Deliverability_Study.pdf (0.88) · FCD-status table cell (0.68) · exec-summary mention (0.68). Full per-signal breakdown in Section 1 · Artifact A.

Passed downstream

Top 4 ranked chunks → comment + evaluate tools

▸7.2 · T1 · Read & commentT14.6s

Decided

Flag §3.2 paragraph; rationale: selection-report advice (KL·E) says deliverability outcomes win on quantified evidence — current draft asserts without figures

Passed downstream

1 proposed comment, with provenance link → p.142

▸7.3 · T5 · Evaluate vs. criteriaT53.8s

Decided

§3.2 scores 6/10 against winning bids; gap = quantified deliverability outcome

Passed downstream

Score + gap note attached to the change-set

▸7.4 · Validate & reflectguard1.4s

Decided

Grounding check passed — the comment cites a real retrieved source (p.142). On-track vs. goal; no drift. Proceed to human gate.

▸8 · Human approval gatehumanawaiting

Retrieved

Pending change-set {comment_1, edit_1} with provenance

Decided

Surface to Erin for approve / deny — no autonomous application of edits

Passed downstream

Awaiting human; nothing applied to the live document yet

Retrieved Decided Passed downstream

03 — Evaluation

What we measure

Diligence question 3.1

What do you measure to know the system is working, and how do you define each metric? Specifically: how do you treat the distinct failure types — the wrong source being retrieved, an output claim not supported by the retrieved source (hallucination), and low output quality?

We measure each stage of the pipeline separately, on purpose — because a bad final answer is a symptom, and what makes an eval useful is being able to say why it was bad. The three failure types you name aren't interchangeable: they live in different stages and have different fixes, so we attribute every failure to a stage rather than scoring only the end result. That localization is the whole design of the eval.

The wrong source retrieved is a retrieval-stage failure, scored against a labeled set of which chunks are relevant per query. The headline metric is recall@k / hit-rate — did a relevant chunk make the top-k at all — because if it wasn't retrieved, the generator simply can't use it. Around that we track context recall (did we get all the chunks needed) versus context precision (are the relevant ones ranked above the noise), and MRR / nDCG for how high the first relevant chunk landed. We split a recall miss (relevant chunk absent — usually fatal) from a precision miss (irrelevant chunk ranked high — dilutes context) because they have different fixes. For Viridon we add filter correctness — did a scope like ISO_RTO = CAISO actually apply — because the dangerous failure here is cross-project contamination, which scoped retrieval (KL·H) exists to prevent.

An unsupported claim — a hallucination — is a generation-stage failure, measured as faithfulness / groundedness. The key distinction: we don't score "is it true in the world," we score "is every claim entailed by what we retrieved" — the right contract for RAG, because we control the sources. Mechanically we decompose the output into atomic claims and check each against the retrieved context (supported / unsupported / contradicted), plus citation accuracy — does the cited source actually support the claim, which the provenance lineage from 2.3 makes directly checkable. One subtlety: a hallucination is often a retrieval failure in disguise. If context-recall was low, the model filled the gap — so we only call it a generation bug when recall was high and it still invented something. That's why we measure retrieval separately rather than scoring the final answer alone.

Low output quality is the fuzziest and most domain-specific. The generic dimensions are answer relevance and completeness, instruction-following (did it respect "don't touch the boilerplate", length, tone), coherence, and format / schema validity (already enforced by the structured-output validation in 2.2). But the differentiated quality metric for proposal writing is "winning-ness" — scoring whether an edit makes a section more like the sections that have won, built from the selection-report advice in KL·E. That's a quality rubric grounded in what Viridon actually cares about, which no off-the-shelf eval framework gives you.

Around those three sit a broader taxonomy. Because this is agentic, not only RAG (Section 2), we also measure task success / completion, tool-selection accuracy, and trajectory correctness (did it reach the answer for the right reasons, not by luck), alongside operational signals — drift / escalation rate (2.2), latency, and cost. And the single best real-world quality signal for the assistant is human edit-distance / acceptance: how much Erin changes a proposed edit before accepting it. Low edit distance is high quality, measured on live usage with no labeling. The full taxonomy is in the detail below.

Two limitations to be upfront about. Several of these metrics use an LLM as judge, which is itself fallible — so we calibrate it against human labels, reserve it for scale, and keep humans on the high-stakes and subjective calls. And all of it is only as good as the ground truth it's scored against, which is the next question (3.2). Tooling stays in-VPC per the deployment principle: RAGAS-style metric computation (RAGAS is an open-source toolkit for evaluating RAG systems) plus a self-hosted judge model on Bedrock, fed by the execution traces from 2.3.

Where it breaks — failure localization

Each example is checked at three points; the first ✕ is where it breaks, which points to a specific fix. Downstream checks are moot once an upstream stage fails.

Example query	① Right source?	② Faithful to source?	③ Quality output?	Where it breaks → fix
"Sunrise deliverability outcome (CAISO)"	✓	✓	✓	Pass
"Interconnection cost, Project Atlas"	✕recall miss	—	—	Retrieval — relevant chunk absent → tune rank profile / embeddings
"Deliverability evidence requirements"	✓	✕unsupported	—	Generation — recall was high, claim invented → tighten grounding / prompt
"Summarize selection feedback"	✓	✓	✕format	Quality — verbose, ignored format → tune prompt / schema
"Costs for Project X" (scoped)	✕wrong filter	—	—	Scoping — pulled another project → fix filter (KL·H)

▸ Additional detail full metrics taxonomy, definitions, judge calibration

The metrics taxonomy

Tier	Metric	Definition	Targets
Retrieval — "did we find the right thing?"
Retrieval	Recall@k / hit-rate	Fraction of queries where a relevant chunk is in the top-k	Wrong source
Retrieval	Context recall	Did we retrieve all the chunks needed to answer	Wrong source (recall)
Retrieval	Context precision	Are relevant chunks ranked above irrelevant ones	Wrong source (precision)
Retrieval	MRR / nDCG	Position of the first / all relevant chunks (rank-weighted)	Wrong source
Retrieval	Filter correctness	Did hard filters (e.g. ISO/RTO) scope correctly	Cross-project contamination
Generation — "did it use what it found honestly?"
Generation	Faithfulness / groundedness	Fraction of output claims entailed by the retrieved context	Hallucination
Generation	Citation accuracy	Does the cited source actually support its claim	Hallucination
Quality — "is the output good?"
Quality	Answer relevance / completeness	Does it answer the question, fully	Low quality
Quality	Instruction-following	Respects constraints — boilerplate, length, tone	Low quality
Quality	Format / schema validity	Structured output is well-formed (ties to 2.2)	Low quality
Quality	"Winning-ness"	Does an edit make a section more like winning sections (from KL·E)	Low quality (domain)
Agentic & operational — "did the workflow behave?"
Agentic	Task success / completion	Did the workflow achieve the goal end-to-end	End-to-end
Agentic	Tool-selection accuracy	Was the right tool chosen at each step	Process
Agentic	Trajectory correctness	Right steps for the right reasons, not luck	Process
Operational	Drift / escalation rate	How often it goes off-track or needs a human (2.2)	Reliability
Operational	Latency / cost	Speed, tokens, spend per run	Efficiency
Real-world	Human edit-distance / acceptance	How much a user changes a proposed edit before accepting it	Live quality

Localization — three checkpoints, three fixes

① Right source retrieved? → if not, it's a retrieval problem; fix the rank profile, embeddings, or filters. No downstream metric matters until this passes.
② Used faithfully? → relevant only once retrieval passed. An unsupported claim with high context-recall is a genuine generation bug; with low recall it's really a retrieval miss.
③ Quality output? → evaluated last, because a faithful-but-poorly-written or off-format answer is a generation/prompt problem, not a data problem.

LLM-as-judge — calibration & honesty

Calibrated against humans — the judge model is validated on a human-labeled sample so we know its agreement rate before we trust it at scale.
Reserved for scale — automated judging runs the volume; humans handle high-stakes, subjective, and "winning-ness" calls.
Self-hosted — the judge runs on Bedrock in-VPC, fed by the observability traces (2.3), so eval data never leaves Viridon's environment.

Every metric here needs a labeled "right answer" to score against — how we build that ground truth, and how we minimize the SME time it takes, is Section 3.2.

03 — Evaluation

Ground truth

Diligence question 3.2

How do you establish ground truth — the labeled "right answers" evals are scored against — and who builds that set? Where the answer depends on our subject-matter experts, how do you minimize the time required from them?

Ground truth is the real bottleneck in enterprise RAG evaluation, so we treat SME time as the scarce resource we engineer around, not an afterthought. The first move is to stop thinking of "ground truth" as one thing: it has three layers — retrieval truth (which chunks are relevant to a query), answer truth (the correct answer text), and preference / rubric truth (which of two outputs is better, or how it scores on a rubric). They cost very different amounts to label, so matching the cheapest viable label type to each metric is already a major saver — most retrieval and faithfulness checks need no authored answer at all.

The single biggest unlock for Viridon is that your archive is already a labeled dataset. A library of won proposals and ~200-page selection reports isn't just source material — a winning section is a gold answer for "how should this section read," and a selection report is labeled feedback on what was strong and weak. So a large share of the "right answers" already exist in your corpus; the work is extraction, not authoring. That turns ground truth from a cost you'd carry into an asset you already own.

Around that, we draw labels from the cheapest sources first (the ladder below). Reference-free metrics need zero SME input — faithfulness is checked against the retrieved context, not a gold answer, and schema validity is deterministic. Implicit labels from real usage are free and compounding: every time Erin accepts, edits, or rejects a proposed change, that's a label, and the edit diff tells us how it was wrong. Synthetic generation with human verification handles the rest — an LLM drafts (question, answer, source-chunk) triples from your documents, and the SME's job collapses from authoring to approving or correcting, which is several times faster.

Where SMEs are needed, we minimize their time deliberately: approve, don't author (review LLM-drafted labels rather than writing from scratch — the biggest single lever); active learning (we surface the highest-value cases — where the system is uncertain or where the judge and a human disagree — instead of asking them to label at random); a small, stratified golden set plus a large auto-graded set (a few hundred carefully chosen, human-verified examples anchor a calibrated LLM judge that handles the volume); and capture in the natural workflow (a thumbs-up, an accepted edit, or a "this source is wrong" flag in the product is a label given without extra effort). The result is that SME involvement is bounded and front-loaded, and trends toward near-zero as usage-based labels compound.

On who builds it: we build the harness, generate the synthetic candidates, mine the historical corpus, and run and calibrate the judge; your SMEs spend bounded, high-leverage time approving and correcting the golden set and resolving the contested cases; and the product harvests implicit labels continuously. Two caveats. Ground truth isn't static or singular — SMEs disagree and what "wins" shifts as sponsors change — so we measure inter-annotator agreement, version the golden set, and treat it as living rather than a one-time deliverable. And synthetic labels carry a bias risk — an auto-generated test set can be easy in the same ways the system is good, flattering the scores — which we counter by seeding from your real artifacts and keeping a human-authored slice as the hard anchor.

Where labels come from — by SME cost

Reference-free metrics

Faithfulness (vs. retrieved context) & schema validity — no gold answer needed

SME cost · none

Implicit from usage

Accept / edit / reject on proposed changes — free, compounding labels; the diff shows how it was wrong

SME cost · none

Mined from history

Won proposals = gold answers · selection reports = labeled feedback — already in the archive

SME cost · low

Synthetic + SME verify

LLM drafts (question, answer, source) triples; SME approves or corrects rather than authoring

SME cost · medium

SME-authored golden slice

The hard anchor + contested cases — small, stratified, deliberately bounded

SME cost · high

Most coverage comes from the cheap and free sources at the top; the expensive, SME-authored slice is kept small and high-leverage. As usage grows, the free implicit labels compound and the SME share shrinks further.

▸ Additional detail the three layers, division of labor, SME-minimization, caveats

Three layers of ground truth

Layer	What's labeled	Typical cost	How we get it
Retrieval truth	Which chunks are relevant to a query	Low	Confirm the source, or bootstrap from a known answer's source chunk
Answer truth	The correct answer text	High	Mine from won proposals; synthetic + verify; small SME-authored anchor
Preference / rubric truth	Which output is better, or its rubric score	Low–medium	A/B preference or rubric — far cheaper than authoring gold; "winning-ness" derived from selection reports (KL·E)

Who builds it — division of labor

Who	Does what
BetterBrain	Builds the eval harness; generates synthetic candidates; mines the historical corpus; runs and calibrates the judge
Viridon SMEs	Bounded, high-leverage time: approve / correct the golden set, resolve contested cases, set the "winning" rubric
The product	Harvests implicit labels continuously (accept / edit / reject) — zero added effort

Minimizing SME time — the techniques

Approve, don't author — SMEs review and correct LLM-drafted labels instead of writing from scratch. The biggest single lever.
Active learning / prioritization — label the contested cases (system uncertain, or judge-vs-human disagreement), not random samples, for more eval signal per SME-minute.
Small golden set + large auto-graded set — a few hundred stratified, human-verified examples calibrate an LLM judge that grades the volume; SME effort is front-loaded and bounded.
Capture in the workflow — thumbs, accepted edits, and "wrong source" flags in the product produce labels as a by-product of normal use.
Structured elicitation — when SMEs are needed, they get a tight review UI (answer + source + ✓ / ✕ / fix), so it's minutes per item, not hours.

Honest caveats

Not static or singular — annotators disagree and "winning" drifts as sponsors change; we track inter-annotator agreement, version the golden set, and treat it as living.
Synthetic bias — auto-generated tests can flatter the system; we seed from real artifacts and keep a human-authored hard anchor to counter it.

The implicit-from-usage labels are also the input to the self-learning loop in Section 4, and they're the same signal as the human edit-distance metric in 3.1 — ground truth and the feedback loop are two views of the same data.

03 — Evaluation

How evals are run

Diligence question 3.3

How are evals run operationally — an automated pipeline, your team, our team, or a hybrid? What is the division of labor, and what ongoing time commitment would you expect from us? Specifically: when a prompt or model changes, how do you confirm the change did not break existing behavior (regression testing)?

Evals run as a hybrid at three cadences, not one — so the answer to "automated, your team, or ours" is: all three, at different speeds. A fast automated suite runs in CI on every prompt or model change and blocks the merge if it regresses (the regression gate). A comprehensive batch runs nightly and on-demand against the full golden set for the thorough scorecard. And continuous online monitoring scores sampled production traffic on reference-free metrics. Automation does the volume; humans do the judgment.

On division of labor: BetterBrain builds and maintains the pipeline, writes the metrics, owns the CI gate, triages regressions, and calibrates the judge. The automated system does the bulk of the work — CI on every change, the nightly batch, and live monitoring — with no human in the loop. Your SMEs touch only the irreducibly human part: periodic review of the golden set and adjudicating the handful of borderline cases CI surfaces.

On your time commitment — concretely, because you asked: almost all of it is upfront, establishing and ratifying the golden set and the "winning" rubric — on the order of 15–20 SME-hours, front-loaded over the first few weeks, and mostly approve-not-author (per 3.2). After that there is no standing commitment: ongoing involvement is ad-hoc only — when the golden set needs an update because the corpus or sponsor criteria changed — averaging under 30 minutes a month, and trending down further as implicit usage labels compound.

On regression testing: the locked, versioned golden set is the regression suite. On any prompt, model, embedding/index, or tool change, we re-run it and compare to the previous baseline. Because outputs are non-deterministic we don't assert string equality — we gate on metric thresholds and no-regression deltas ("no metric dropped more than N% vs. baseline"). The most actionable technique is A/B diffing: surface only the examples that flipped pass↔fail, so a reviewer looks at the handful that changed, not all 500. And we slice and gate per segment (doc type, ISO/RTO, question type), because a sub-segment can tank while the average stays flat — the silent-degradation trap that aggregate-only eval misses.

Two operational notes: The judge model is itself non-deterministic, so a "regression" can be judge noise — we pin the judge's model and prompt versions, average over runs on the golden set, and route borderline flips to a human. And we treat eval cases as version-controlled code — the golden set lives in the repo and changes via review, so the suite evolves with the same rigor as the system. The loop closes with 2.3 and 3.2: production failures caught by monitoring are promoted into the golden set, so the regression suite gets harder exactly where the system is weak. Tooling — Promptfoo, Langfuse or Phoenix — is self-hosted in-VPC per the deployment principle. The full operating model is the plan below.

Artifact

Eval plan — Proposal Writing Assistant

The operating model for the proposal-writing use case: the three cadences, the regression gate, and the SME time budget in one view.

Cadence 1 · automated

CI / pre-merge

TriggerPrompt, model, embedding/index, or tool change

RunsFast regression subset of the golden set, per stage

MetricsPer-stage thresholds + A/B diff + per-segment slice

WhoAutomated, in the pipeline

Blocking — blocks merge on regression

Cadence 2 · scheduled

Offline batch

TriggerNightly + on-demand

RunsFull golden set, all metrics

MetricsFull taxonomy (3.1) + trend tracking

WhoAutomated + BetterBrain review

Scorecard — tracks trends over time

Cadence 3 · continuous

Production monitoring

TriggerLive, on sampled real traffic

RunsReference-free metrics on real queries

MetricsFaithfulness · retrieval-hit · drift/escalation · latency · cost

WhoAutomated, with alerting

Monitor — promotes failures to golden set

Regression gate — on every change Change: prompt / model / index / tool→ Re-run golden set→ Compare vs. baseline: thresholds · A/B diff · per-segment→ Regression → block & review Clean → ship

Your time · upfront

~15–20 SME-hours

Front-loaded over the first few weeks — establish & ratify the golden set and "winning" rubric (mostly approve-not-author).

Your time · ongoing

< 30 min / month

Ad-hoc only — golden-set updates when the corpus or sponsor criteria change. No standing commitment.

What we measure — proposal-writing scorecard

What we measure	How we measure it · ground truth	Target (illustrative)
Setup — template, variables, change detection
Template coverage	Generated template vs. SME-ratified structure from prior wins	≥ 95% required sections present
Variable detection (precision / recall)	Detected variables (project_name, sponsor, capacity…) vs. SME-labeled set on held-out proposals	P ≥ 0.95 · R ≥ 0.90
Change-flagging (precision / recall)	Parts flagged "likely to change" vs. SME-labeled actual changes across historical proposal pairs	R ≥ 0.90 · P ≥ 0.80
Variable-fill accuracy	Filled field values vs. the new project brief	≥ 0.95
Update propagation (precision / recall)	All correct locations updated across 300+ pages, nothing else, vs. labeled change-set	R = 1.0 · P ≥ 0.95
Advice & AI editing
nowAdvice retrieval (recall@k)	Relevant selection-report advice retrieved, vs. labeled edit-context → advice pairs	Recall@5 ≥ 0.90
nowScoping / no contamination	Adversarial cross-project queries — does it ever pull another project's data?	0 cross-project leaks
nowRecommendation grounding (faithfulness + citation)	Every comment's claim checked against its cited retrieved source	Faithful ≥ 0.95 · Citation ≥ 0.98
Drafted-content grounding	Drafted-section claims checked vs. brief + prior wins	≥ 0.95
"Winning-ness" of edits	LLM-judge rubric from selection-report advice (KL·E) + SME preference on a sample	Edit improves score in ≥ 80%
nowBoilerplate preservation	Diff of changed text vs. the change-map — only flagged parts touched	≥ 0.99 untouched
Real-world & operational
nowHuman acceptance + edit-distance	Live accept / edit / reject on proposed changes (implicit labels)	Acceptance ≥ 70% & rising
Workflow completion	Trace status (2.3): valid template + filled draft, no failed step	≥ 0.98
nowLatency · cost · drift	Production monitoring (2.3)	Within budget · escalation only at human gate

Targets are illustrative starting points; the actual thresholds are set from the baseline once the golden set is established (3.2) and become the no-regression bar in CI.

Phased rollout — where we start

We don't stand all of this up at once. The highlighted rows are our initial focus — the metrics that most directly safeguard your data, prevent unsupported claims, and reflect real-world usefulness, and that we can put in place quickly. The rest layer in as the golden set and live usage data mature.

▸ Additional detail regression mechanics, evals-as-code, the flywheel

Regression gate — mechanics

Threshold + delta gating — pass requires both an absolute floor (e.g. faithfulness ≥ target) and no-regression vs. baseline (no metric down more than a set delta). No string-equality assertions on stochastic output.
A/B pass-flip diff — the report shows only the examples that changed verdict between old and new, in both directions, so review is targeted at what moved.
Per-segment gating — metrics are sliced by doc type, ISO/RTO, and question type and gated per slice, so a regression in one segment can't hide behind a flat average.
Component + end-to-end — retrieval, generation and the full workflow are regression-tested independently (mirroring the 3.1 localization), so a failure points to a stage.
Judge-noise control — judge model and prompt are version-pinned; golden-set scores are averaged over runs; borderline flips go to a human, so noise isn't mistaken for regression.

Evals as code, and the flywheel

Version-controlled — the golden set and metric definitions live in the repo and change via review, evolving with the same rigor as the system.
Triggered, not just scheduled — the relevant suite fires on the event that matters (prompt edit, model bump, index change, tool change), tied to the change type.
Self-reinforcing — production failures caught by online monitoring (2.3) are promoted into the golden set (3.2), so the regression suite keeps getting harder where the system is weakest.
In-VPC tooling — Promptfoo / Langfuse / Phoenix, self-hosted, so eval data never leaves Viridon's environment.

04 — Self-learning & institutional memory

Feedback loop

Diligence question 4.1

How does the system learn from use over time, and what tools or techniques support this? Specifically: what signal is captured (accept/reject, edits to drafts, explicit corrections); is learning applied live/in-session or through a batch "reflection" process (e.g. nightly), and why that cadence; and does this same feedback feed into how you evaluate the system?

The system learns primarily in the knowledge layer, not in the model weights. When Erin edits or rejects a proposed change, we capture the before→after diff and its context, generate a piece of reusable advice from it ("for CAISO deliverability sections, lead with the quantified outcome"), and index that advice so future retrievals surface it. It's the same machinery as the selection-report advice module (KL·E), and the result is inspectable and correctable — you can read, edit, or delete what the system has "learned" (Section 4.3). And it isn't only advice that improves: the same feedback updates other knowledge-layer components in Figure 1 — the entity & concept map / knowledge graph (KL·G), the "what changes" patterns (KL·D), and the glossary (KL·L) — so a correction can fix an entity or a relationship in the graph, not just a piece of guidance.

On cadence, it's both — split by mechanism. Anything that's just retrieval-time context applies live: a correction becomes an indexed advice entry the very next retrieval can pull, with nothing retrained, and the in-session working memory (2.1) already adapts within a task. Anything that involves synthesis or judgment is deferred to a batch / nightly "reflection": clustering many edits into one durable advice entry, resolving contradictions when SMEs disagree, promoting a pattern only once it's been seen several times, and re-ranking which advice is trusted. Why that cadence: a single edit is noisy, and you don't want one idiosyncratic correction to immediately reshape behavior for everyone — the batch step is deliberate noise control, and it's where the degradation guardrail lives (4.5).

The signal we capture is richer than accept/reject. The most valuable is the edit diff itself — the corrected text tells us not just that a suggestion was wrong but how, and it doubles as a free gold label. Around it: explicit accept / reject / thumbs, behavioral signals (used, ignored, asked a follow-up, re-ran the search), and explicit corrections ("this source is wrong", "this advice doesn't apply here") — the highest-value, lowest-volume signal.

On reinforcement learning, we use the framing deliberately, not the heavy machinery. We treat acceptance as a reward signal and optimize the policy that decides which advice and which retrieval configuration to surface — not the model weights. The genuine reinforcement-style mechanisms on our roadmap are (1) a system that automatically learns which advice and which result-ranking to surface — it tries different options, watches which ones lead Erin to accept the suggestion, and shifts toward the ones that work, essentially self-tuning A/B testing that runs continuously, fully in-VPC, with no model weights touched — and (2) automatic tuning of the prompts and examples against the eval metrics, so they're optimized by measurement rather than by hand. We explicitly do not fine-tune model weights: it would break the inspectability we're selling, complicate the in-VPC deployment, make evals harder, and is the wrong investment at this corpus size. What we do is accumulate the accept/reject preference pairs as an asset — the dataset that would make fine-tuning possible later, to be spent only if the eval gain ever warrants it.

And yes — the same signal feeds evaluation. Every accept / edit / reject is simultaneously a learning signal and an eval label (the human-edit-distance metric in 3.1, the implicit labels in 3.2). That coupling is also the safeguard — a new or updated advice entry is promoted only if it does not regress the golden set, so the feedback loop and the eval loop are the same flywheel: usage → advice → eval-gated promotion → better retrieval → more usage. High-impact advice can require human approval before it goes live, so what the system learns stays governed.

Use · accept / edit / reject→ Capture · diff + context→ Live · index advice→ Batch · consolidate + eval-gate→ Better retrieval ↻ live advice is available to the very next retrieval · batch promotion is eval-gated, so errors aren't reinforced — the feedback loop and the eval loop are one flywheel

▸ Additional detail signal taxonomy, live vs. batch, the RL spectrum

Signal captured

Signal	What it tells us	Type
Edit diff (before → after)	How a suggestion was wrong; the corrected text is a free gold label	Implicit · richest
Accept / reject / thumbs	Coarse good / bad on a proposed change	Explicit
Behavioral	Used, ignored, asked a follow-up, re-ran the search	Implicit
Explicit correction	"This source is wrong" · "this advice doesn't apply here"	Explicit · highest-value

Live vs. batch — split by mechanism

Mechanism	Cadence	Why
Index a correction as advice	Live (in-session)	Retrieval-time context, no retraining — next retrieval can use it
Same-session adaptation (working memory)	Live	Within-task only; resets per run (2.1)
Consolidate many edits into durable advice	Batch (nightly)	Dedup + synthesis; one edit is noisy
Resolve contradictions / promote after N	Batch	Judgment + noise control
Re-rank which advice is trusted	Batch	Needs aggregate signal
Promote a new/updated entry	Batch	Eval-gated — must not regress the golden set (4.5)

The reinforcement-learning spectrum — what we do and don't

Approach	Stance
Knowledge-layer learning (edit → advice → index)	core · live The primary mechanism; inspectable and correctable
Acceptance as a reward signal over the retrieval / advice policy	yes Reinforcement-style, no weight changes
Auto-learn which advice / ranking to surface	roadmap Tries options and favors the ones that get accepted — continuous, self-tuning A/B testing; in-VPC, no weights touched
Auto-tune the prompts and examples vs. eval metrics	roadmap Optimized by measurement instead of by hand
Accumulate accept/reject preference pairs	yes Banked as an asset; spent only if evals justify
Fine-tune model weights (RLHF)	not planned Breaks inspectability + in-VPC simplicity; wrong investment at this scale

04 — Self-learning & institutional memory

What changes when it learns

Diligence question 4.2

When the system "learns," what concretely changes — retrieval ranking, prompts, a memory/concept store, model weights, or something else?

In one line: what changes is data and configuration, not the model. The core mechanism is the concept / advice store — learning adds, updates, and re-ranks indexed advice entries and the entity & concept map (the knowledge graph), along with the other knowledge-layer modules in Figure 1, which together are the institutional memory we cover in 4.3. Everything else follows from that: because new advice is indexed, retrieval surfaces different, better-grounded context, and on the roadmap the ranking itself self-tunes toward what gets accepted. Prompts and examples are auto-tuned against the eval metrics (roadmap), not edited per interaction. Session memory adapts live within a task and resets per run (2.1). And the eval golden set itself grows as production failures are promoted into it — so the system's measurement improves alongside its behavior. Model weights do not change, by design.

Candidate	Changes?	What changes
Memory / concept store	Yes — primary	Advice entries added / updated / re-ranked — the institutional memory (4.3)
Concept / entity graph (KL·G)	Yes	New or corrected entities & links — projects, customers, ISO/RTOs, terms (Figure 1)
Retrieval ranking	Yes	New advice changes what's surfaced; ranking self-tunes by acceptance (roadmap)
Prompts / examples	Yes (roadmap)	Auto-tuned against eval metrics — not edited per interaction
Session / working memory	Yes — live	Within-task adaptation; resets per run (2.1)
Eval golden set	Yes	Grows as production failures are promoted (3.2 / 4.5)
Model weights	No	Unchanged by design — inspectability, in-VPC, eval simplicity

04 — Self-learning & institutional memory

Concept layer

Diligence question 4.3

What accumulates as institutional memory, and in what form? Specifically: is it human-readable, auditable, and correctable — can we inspect and fix what the system "knows"?

What accumulates is structured, human-readable knowledge — not opaque vectors or model weights. The institutional memory is a set of concept and advice entries in the knowledge layer: the advice mined from edits and selection reports (KL·E), the entity and concept map of customers, projects, ISO/RTOs and terms and how they link (KL·G), the "what changes" patterns (KL·D), and the onboarding glossary (KL·L). Each is a record you can read in plain language, not a number in a tensor.

In form, every entry is a structured record: a plain-language statement of the knowledge, the scope it applies to, its provenance (which edits and sources produced it), confidence and usage stats, a status, and a version history. The anatomy is below — it reads like a note with an audit trail.

To the heart of your question — yes, all of it is human-readable, auditable, and correctable. Everything the system learns is exposed to you in plain language and is fully editable: you can inspect any entry and trace it back to the edits and sources that produced it (the provenance from 2.3), correct or rewrite it, and disable, delete, or pin it. Because learning lives in the concept store and not in the weights (4.2), the entire learned state is open to inspection and repair — there is no hidden knowledge baked into a model you can't read. A curation view makes this a first-class surface (4.4).

Auditability comes from the same structure: each entry carries provenance (what created it, when), version history (what changed), and usage stats (how often it's retrieved and applied, and how often accepted) — so you can audit not just what it knows but why it knows it and whether it's actually being used. A correction simply becomes the new entry (eval-gated before it's trusted, per 4.5). And because the whole memory is a readable, ownable artifact rather than a black box, it transfers with the company — the same owned-asset principle that runs through the architecture.

Anatomy of a concept entry

Field	What it holds	Illustrative
Statement	The knowledge, in plain language	"For CAISO deliverability sections, lead with the quantified outcome"
Scope	Where it applies	ISO_RTO = CAISO · proposal §deliverability
Provenance	What produced it	Derived from 3 accepted edits + selection report p.142
Confidence / usage	How trusted, how often used	Seen 7× · applied 23× · 91% accepted
Priority	Optional manual weight — overrides auto-confidence to set precedence	Normal · High · Critical (e.g. pin a must-follow rule)
Status	Active / disabled / pinned	Active
Version history	What changed & when	v2 — broadened from "Sunrise" to all CAISO (Apr 2026)

A fully populated example entry and the end-to-end learning loop are provided as the Section 4 artifact.

▸ Additional detail what accumulates, and what you can do to any entry

What accumulates as institutional memory

Memory	Form	Source
Advice entries (KL·E)	Plain-language guidance with scope & provenance	Mined from accepted / rejected edits + selection reports
Concept & entity map (KL·G)	Entities (customers, projects, ISO/RTOs, terms) and the links between them	Ingestion, usage + corrections
"What changes" patterns (KL·D)	Which parts of a proposal historically need changing	Cross-proposal history + accepted edits
Onboarding glossary (KL·L)	Company context, terminology, how concepts connect	Corpus + curation

What you can do to any entry

Inspect — read the entry and its provenance; trace it to the edits and sources behind it.
Correct / rewrite — fix the statement or narrow / broaden its scope; the correction becomes the new entry.
Disable / delete — turn off or remove anything that's wrong or no longer applies.
Prioritize / pin — set an entry's priority (Normal / High / Critical) or pin it as authoritative so it's preferred in retrieval.
Govern — high-impact entries can require approval before they go live (4.1), and changes are eval-gated before they're trusted (4.5).

All of this runs through a curation surface — the upkeep model is Section 4.4.

04 — Self-learning & institutional memory

Upkeep

Diligence question 4.4

How much of this is automated versus requiring human curation, and by whom?

Upkeep is almost entirely automated. The loop in 4.1 does the work with no person in it — generating advice and entity/graph updates from edits, indexing, consolidating and de-duplicating them, scoring confidence, promoting a pattern only once it's been seen enough times, and re-ranking which knowledge is trusted. Humans do not author or maintain the memory by hand.

The human role is oversight by exception. The one thing worth watching for is an incorrect deduction — the system over-generalizing a one-off edit into a rule that shouldn't apply broadly. When that happens, a reviewer corrects, narrows, disables, or deletes the entry (or adds one directly to teach something) through the curation surface (4.3). It's review-and-fix when something looks off, not continuous curation.

By whom, and how little: the curation is done by an SME or power user (e.g. Erin) for the proposal domain, with BetterBrain monitoring the memory's overall health and tuning the loop. The burden stays low because the guardrails catch most bad deductions before they ever reach a human — confidence thresholds, promote-after-N, eval-gating (4.5), and approval on high-impact entries (4.1) — so what reaches manual review is the genuine exceptions. This is consistent with the eval upkeep budget in 3.3: bounded and ad-hoc, well below a standing commitment.

Task	Owner
Generate advice + entity/graph updates from edits & corrections	Automated
Index, consolidate, dedup, re-rank; score confidence; promote-after-N	Automated
Eval-gate changes before they're trusted (4.5)	Automated
Surface low-confidence / contested entries for review	Automated → flags to humans
Review flagged entries; correct / narrow over-general deductions	Human — SME / power user
Add, edit, disable, delete, prioritize entries as needed	Human — SME / power user
Monitor memory health; tune the loop	BetterBrain

04 — Self-learning & institutional memory

Preventing degradation

Diligence question 4.5

How do you prevent a feedback loop from reinforcing errors or degrading the system over time?

The core risk in any feedback loop is an echo chamber: the system learns from its own outputs and reinforces its own mistakes until errors quietly become "truth." We prevent that with defense-in-depth — multiple independent safeguards, so that if one misses a problem the next one catches it — and one structural choice does most of the work.

That choice: learning is measured against an independent anchor, not against the model's own recent behavior. A new or updated entry is promoted only if it doesn't regress the golden set (3.3), and the golden set is anchored in external truth — selection reports plus a human-authored slice (3.2) — not in whatever the model did lately. So the loop can't drift to merely agree with itself.

On top of that, noise control and negative signal: we consolidate in batch and promote only after a pattern recurs (promote-after-N), so one idiosyncratic edit can't reshape behavior, and contradictions are resolved rather than stacked (4.1). And we learn from rejections and heavy edits, not just acceptances — so the loop isn't one-sidedly reinforcing what it already does.

Then catch and contain: per-segment regression gating and online monitoring (2.3, 3.3) catch degradation even when the aggregate looks fine. And because the memory is data, not model weights (4.1), with full provenance and version history (4.3), a bad deduction is traceable, reversible, and deletable — the blast radius is bounded and nothing compounds silently. High-impact changes can be rolled out gradually — applied to a small slice first (a "canary") and widened only if it holds — before they're fully trusted.

Finally, knowledge also degrades by going stale — a sponsor's criteria shift, an ISO/RTO rule changes. Entries carry recency and versioning, older ones decay or get re-validated, and periodic re-evaluation keeps the memory current. Taken together, errors can't quietly become truth, drift is caught against an external reference, and the whole memory stays inspectable and reversible.

Failure mode	Safeguard
Errors reinforced as "truth"	Promotion eval-gated against an independent golden set; promote-after-N + confidence thresholds (3.3, 4.1)
Echo chamber (learns only from its own outputs)	Anchored to external truth — selection reports + human-authored slice (3.2); rejections & corrections weighted, not just acceptances
One noisy edit reshapes behavior	Batch consolidation; promote-after-N; contradictions resolved, not stacked (4.1)
Silent / per-segment degradation	Per-segment regression gating + online monitoring (3.3, 2.3)
A bad entry compounds invisibly	Full provenance + version history; inspect / disable / delete (4.3) — bounded blast radius
Degradation baked in irreversibly	Learning is data, not weights (4.1) — reversible and roll-back-able
A bad change ships widely	CI eval-gate + gradual / canary rollout before full trust
Stale knowledge (criteria change)	Recency / versioning; decay or re-validation of old entries; periodic re-eval

Artifact

The learning loop & a memory entry

What is stored, where it lives, and on what cadence it updates — followed by a fully populated example of a single concept/memory entry.

1 · Use

Erin works

Accepts, edits, or rejects a proposed change in the editor

→

2 · Capture

Signal logged

The edit diff + context is recorded — the richest signal

accept · edit diff · reject · explicit correction

→

3 · Update memory

Live path Index the correction as advice KL·E Usable by the very next retrieval — minutes, not hours

┊
both
feed
┊

Nightly path Consolidate · dedup · promote-after-N · re-rank trust KL·EKL·GKL·DKL·L

Eval-gate · must not regress golden set

Low-confidence / contested → flagged for SME review (correct · disable · pin)

→

4 · Propose

Better-grounded suggestion

Improved memory surfaces in the next comment or draft — assistant proposes, Erin disposes

External anchor: golden set grounded in selection reports + human-authored slice (3.2) — the loop measures against independent truth, not its own recent outputs

Rejections & heavy edits weighted, not just acceptances · production failures grow the golden set · feedback loop and eval loop are one flywheel

What's stored, where, and on what cadence

What's stored	Where (Figure 1)	Cadence
Raw signal — accept / edit / reject + diff + context	Signal log	Live, on every action
Advice entries (guidance from edits + selection reports)	KL·E advice store	Live to index · nightly to consolidate
Concept & entity links	KL·G knowledge graph	Nightly (+ live for direct corrections)
"What changes" patterns	KL·D	Nightly
Glossary / terms	KL·L	Nightly / on curation
Trust & ranking of advice	KL·E + reranker (KL·H)	Nightly, eval-gated
Preference pairs (accept / reject)	Eval + training-data bank	Live append; spent only if evals justify
Eval golden set	Eval store	Grows as production failures are promoted

Example — a single concept / memory entry

↳ Produced by step 3 (nightly consolidation of 3 edits + selection report p.142), eval-gated to v2 — this is what the loop outputs into KL·E

ADV-0427 "For CAISO deliverability sections, lead with the quantified deliverability outcome before the methodology." Priority: High Active

Scope

ISO_RTO = CAISO · doc_type = proposal · section = Deliverability

Confidence / usage

Seen 7× · applied 23× · 91% accepted

Provenance

Derived from 3 accepted edits (Sunrise, Aspen, Vega bids) + selection report p.142 — "evaluators rewarded proposals that quantified the deliverability outcome upfront."

Priority

High — set manually by Erin; overrides auto-confidence so it's always preferred

Status

Active · v2 · updated Apr 2026

Version history

v1 (Feb 2026) — scoped to the Sunrise bid only · v2 (Apr 2026) — broadened to all CAISO after the pattern recurred across 3 bids (nightly consolidation, eval-gated)

Edit Narrow scope Disable Delete View provenance Every field is editable — this is what the system "knows", in full

This is the institutional memory in concrete form: human-readable, traceable to the edits and sources that produced it, prioritizable, and correctable — and it lives in your platform as an owned asset, not a black box.