Wvlet Architecture: Compiler Internals
This document is a single‑page, high‑level walkthrough of Wvlet’s compiler and internal representation. It is intentionally practical and links to concrete code locations so new contributors can orient quickly. Wvlet is still evolving; details here may change as phases are added or rearranged.
Audience: contributors building or debugging the language, compiler phases, and codegen.
What this is not: a user tutorial or syntax reference (see the main docs for those).
Quick Map
At a glance, a compilation unit flows through these stages:
- Parse source → LogicalPlan + Expressions
- Label symbols (unique IDs) for definitions and queries
- Resolve types (DataType and RelationType) and rewrite unresolved nodes
- Optimize/normalize via tree rewrites (ongoing)
- Plan execution (tests/debug/save/query tasks)
- Generate text (pretty‑print Wvlet; SQL dialects in future work)
Key packages and files (links go to GitHub, labels show file names only):
- Parsing: WvletParser.scala, SqlParser.scala
- Tokens: WvletToken.scala, SqlToken.scala
- Scanners: WvletScanner.scala, SqlScanner.scala, Scanner.scala
- Trees (expr): Expression.scala, exprs.scala, Attribute.scala
- Trees (plan): LogicalPlan.scala, relation.scala, plan.scala, ddl.scala, sqlPlan.scala, execution.scala
- Types: DataType.scala, TreeNode.scala
- Symbols: Symbol.scala, SymbolInfo.scala, SymbolLabeler.scala
- Type resolution: TypeResolver.scala
- Rewriting infra: RewriteRule.scala
- Execution planning: ExecutionPlanner.scala
- Pretty printing: WvletGenerator.scala, LogicalPlanPrinter.scala
Useful commands:
- Compile:
./sbt compile
- Tests:
./sbt test
(or per‑module, e.g.,./sbt langJVM/test
) - Format:
./sbt scalafmtAll
CompilationUnit
The compiler processes one source file per CompilationUnit and records all intermediate artifacts for that file.
- Type: CompilationUnit.scala
- Holds:
unresolvedPlan
: parsed tree before symbol/type resolution.resolvedPlan
: fully typedLogicalPlan
after resolver passes.executionPlan
: tasks produced byExecutionPlanner
.modelDependencies
: DAG of model‑to‑model references.knownSymbols
: symbols discovered/entered during analysis.lastError
,finishedPhases
,lastCompiledAt
(book‑keeping for incremental builds).
- Source:
sourceFile
(tracks file path, content, timestamps) and helpers liketext(span)
andtoSourceLocation
for diagnostics. - Lifecycle:
- Created from files or strings via
CompilationUnit.fromPath|fromFile|fromWvletString|fromSqlString
. - Standard library and presets load as units with
isPreset = true
. reload()
clears phase markers and errors when the file changes;needsRecompile
compares timestamps.
- Created from files or strings via
- Context integration: Every
Context
carries acompilationUnit
; phases log and look up symbols using it (e.g.,ctx.compilationUnit.enter(sym)
, error locations viasourceLocationAt(span)
).
Trees: LogicalPlan and Expressions
The compiler uses two primary node families:
- Expression (values, names, operators): see Expression.scala. Each expression exposes
def dataType: DataType
, often computed structurally from its children. - LogicalPlan (relations/statements): see LogicalPlan.scala. Every plan exposes
def relationType: RelationType
(a subtype ofDataType
) describing its output schema.
Common patterns:
- Immutability: Trees are immutable products; transformation utilities (
transform
,transformUp
,transformExpressions
, etc.) rebuild nodes when children change. - Metadata preservation:
SyntaxTreeNode.copyInstance
preserves metadata (e.g., attached symbol) on rewritten copies. - Schema computation: Plan nodes compute
relationType
functionally from inputs. Leaf/source nodes may carry aschema: RelationType
(e.g., scans/files/values) used to computerelationType
.
Terminology:
- DataType.scala defines scalar/compound types (e.g., int, varchar(10), array[t]).
- RelationType extends
DataType
to represent table‑like schemas. Variants includeSchemaType
,ProjectedType
,AggregationType
,ConcatType
,RelationTypeList
,UnresolvedRelationType
,EmptyRelationType
(all in DataType.scala). NamedType(name, dataType)
pairs a field name with aDataType
and is used to enumerate columns (see DataType.scala).
Symbols and SymbolInfo
Symbols provide stable identifiers for definitions and queries across phases.
- Symbol: unique ID + back‑pointer to the owning tree (set during labeling and updated on copies). See Symbol.scala.
- SymbolInfo: phase‑mutable info attached to a symbol (see SymbolInfo.scala):
symbolType
:Package | Import | ModelDef | TypeDef | MethodDef | ValDef | Relation | Query | Expression
tpe
/dataType
: the resolvedType
(DataType
for most symbols)- scope/owner: used for name resolution
Where symbols appear today
- Definitions: PackageDef, Import, TypeDef, TopLevelFunctionDef, ModelDef, ValDef, and the top‑level Query get symbols in SymbolLabeler.scala.
- Other nodes may not have symbols yet; they rely on structural
relationType
/dataType
until we broaden symbol coverage.
Invariants to keep in mind
- Trees are immutable; “type updates” happen by mutating
SymbolInfo
or by emitting a rewritten tree with new fields (e.g., aFileScan
with a concreteschema
). - When a node has a symbol and also a stable schema (e.g.,
ModelDef
,TableScan
,Values
), the symbol’sdataType
should match the node’srelationType
.
Compilation Pipeline (Phases)
1) Parsing → Trees
WvletParser (for .wv
) and SqlParser (for .sql
) turn source into LogicalPlan
/Expression
trees. Both parsers are driven by scanners that produce a stream of token records.
Lexer/Scanner basics
- Token enums: WvletToken and SqlToken classify keywords, identifiers, operators, literals, quotes, and control tokens (comments, whitespace, EOF). They also encode reserved vs non‑reserved keywords.
- Scanners: WvletScanner and SqlScanner extend a shared ScannerBase, which yields
TokenData
vianextToken()
; parsers inspect tokens withlookAhead()
andconsume(...)
. - Comments/doc: scanners surface
COMMENT
andDOC_COMMENT
tokens; the parser attaches them to nodes so printers/SQL can preserve them. - Strings/interpolation:
WvletScanner
handles triple‑quoted and back‑quoted interpolation (STRING_INTERPOLATION_PREFIX
,BACKQUOTE_INTERPOLATION_PREFIX
) and emitsSTRING_PART
tokens;SqlScanner
recognizes double‑quoted identifiers and string literals.
Design philosophy
- Unlike the textbook split of “lexer → parser → AST” (e.g., the Dragon Book) where comments are discarded, Wvlet’s parser and tree model are designed to also serve as a syntax formatter.
- Trees preserve source information for formatting and diagnostics:
- Comments are retained on every
SyntaxTreeNode
(see TreeNode.scala), then propagated in printers (e.g.,code(n)(d)
in SqlGenerator.scala and LogicalPlanPrinter.scala). - Every node carries a precise
Span
and maps to aSourceLocation
viaContext.sourceLocationAt
, used in error messages and headers. See Context.scala and CompilationUnit.scala.
- Comments are retained on every
- The result is a “format‑aware” LogicalPlan/Expression tree that can regenerate readable Wvlet/SQL while preserving user comments where possible.
A few special cases:
- ValDef: the parser sets
dataType
from explicit annotations, table column syntax (val t(a,b) = [[...]]
), or the expression’s type. This is later reflected into the symbol. - Table/file references are parsed as TableRef/FileRef with
UnresolvedRelationType
.
2) Symbol Labeling
analyzer/SymbolLabeler.scala
assigns symbols and seeds SymbolInfo
:
ModelDef
:ModelSymbolInfo
seeded withgivenRelationType.getOrElse(child.relationType)
.ValDef
:ValSymbolInfo
seeded from the parsedv.dataType
and expression.- Top‑level queries/relations get
QuerySymbol
/RelationAliasSymbolInfo
as appropriate.
This phase also establishes scopes for packages, types, and methods so later passes can resolve names.
3) Type Resolution and Tree Rewrites
TypeResolver.scala resolves types and replaces unresolved leaves with concrete scans/values:
- TableRef → TableScan via catalog lookup or known type definitions.
- FileRef → FileScan by inspecting JSON/Parquet to obtain a
schema
. - ModelScan binds to the referenced ModelDef’s
relationType
. - ValDef table value constants: when referenced as a table, the resolver detects
SchemaType
on theValDef
and rewrites to a Values relation with that schema.
Where types live today
- Expressions compute
def dataType
on demand. - Plans compute
def relationType
. Leaf/source plans also carryschema: RelationType
in constructors. - Definitions store type on their
SymbolInfo
(authoritative forModelDef
, seeded forValDef
).
Ongoing work: moving toward a symbol‑centric typing model (see “Typing Model” below and issue #1175).
4) Normalization / Optimizations (ongoing)
Tree rewrites are expressed via RewriteRule
and applied with the transform*
APIs on plans/exprs. Typical examples:
- Pruning debug/test nodes from execution paths
- Pushing projections/filters (future work)
5) Execution Planning
ExecutionPlanner.scala
turns a LogicalPlan
into an ExecutionPlan
DAG used by runners:
- Removes
TestRelation
/Debug
from evaluated queries while generating separate tasks for them. - Emits tasks such as
ExecuteQuery
,ExecuteSave
,ExecuteValDef
.
6) Code Generation / Pretty Printing
codegen/WvletGenerator.scala
and LogicalPlanPrinter.scala
render plans/expressions back to Wvlet syntax for explain
and logs. For SQL output and layout details, see:
- GenSQL: SQL Generation — end‑to‑end SQL statement generation and task handling.
- SQL Formatting (Wadler Doc) — how the pretty printer builds compact/expanded SQL.
Typing Model
Types appear in three places today:
- Expression nodes: expose
def dataType: DataType
(structural). - Plan nodes: expose
def relationType: RelationType
(structural/derived); leaf sources carry aschema
. - Definitions’ symbols:
symbol.symbolInfo.dataType
(phase‑mutable denotation).
Short‑term plan (tracked in #1175):
- Authoritative definitions: Treat
SymbolInfo.dataType
as authoritative for definitions (ModelDef
,ValDef
). - Seed source relations: For leaf sources (e.g.,
TableScan
,FileScan
,Values
), keepschema
on the node, and seedsymbol.dataType
from it so downstream code can read a uniform view via symbols. - Functional operators: Keep intermediate operators purely functional: compute
relationType
from children; do not introduce new type fields.
Traversal and Transformation
Common utilities you’ll use while writing rules
LogicalPlan.transform
/transformUp
/transformOnce
for plan rewritesLogicalPlan.transformExpressions
to operate on embedded expressionsExpression.transformExpression
and thetraverse*
helpers for inspection
Important details
- Always return the original node if nothing changes (structural sharing reduces churn).
- When replacing a node with a new instance, the symbol (if any) is preserved and its
tree
back‑pointer is updated bycopyInstance
.
Catalog and External Schemas
Resolving TableRef
and FileRef
requires external metadata:
- Catalog lookup: provides
SchemaType
for known tables. - File inspection: JSON/Parquet analyzers infer
RelationType
for files.
The resolver rewrites leaves to TableScan
/FileScan
with an explicit schema
, from which relationType
is derived (or projected if columns
are specified).
Debugging and Developer Ergonomics
Printing
plan.pp
prints a readable tree with nested indentation.WvletGenerator
renders a Wvlet program; useful inexplain
outputs.
Logging
- Many phases respect debug flags; append
-- -l debug
to test invocations to see detailed logs, e.g.:./sbt "runner/testOnly *BasicSpec -- -l debug"
Targeted testing
- Parser/typing focused tests live under
wvlet-lang/src/test/scala/...
and can be run per class withtestOnly
.
GenSQL: SQL Generation
GenSQL converts a resolved LogicalPlan
(and its ExecutionPlan
tasks) into executable SQL text for a target SQL engine. It orchestrates statement generation, handles dialect differences, and preserves useful metadata in header comments.
- Entrypoint: GenSQL.scala
- SQL printer: SqlGenerator.scala
- Dialects and flags: DBType.scala,
SQLDialect
enums - Formatting: CodeFormatter.scala
What GenSQL does
- Plans tasks: Uses
ExecutionPlanner.plan(...)
to obtain anExecutionPlan
, then walks tasks:ExecuteQuery(Relation)
: generates a single SELECT usinggenerateSQLFromRelation
.ExecuteSave(Save, ...)
: emits DDL/DML strings viagenerateSaveSQL
(CTAS, INSERT INTO, COPY, DELETE). Honors dialect flags such assupportCreateOrReplace
,supportCreateTableWithOption
,supportSaveAsFile
.ExecuteValDef
: evaluates native expressions (e.g.,NativeExpression
) and updates theValDef
symbol.ExecuteCommand(ExecuteExpr)
: prints an expression to SQL via the sharedSqlGenerator
.- Concatenates multiple statements with
;
and prepends a header comment containing version and source location.
Generating a SELECT
generateSQLFromRelation(relation)
performs:- Expand: Inline
ModelScan
bodies and function arguments; evaluate backquoted identifier interpolations; re-runTypeResolver
on the inlined tree. - Print: Hand the expanded
Relation
toSqlGenerator.print
with aCodeFormatterConfig(sqlDBType = ctx.dbType)
. - Wrap: Optionally add a header comment with compiler version and source position.
- Expand: Inline
Expansion details (inline models and locals)
- Model binding: Looks up the referenced
ModelDef
and creates a childContext
where model parameters are bound asValSymbolInfo
with their argument expressions. - Rewrite & evaluate: Rewrites the model body with those bindings and evaluates backquote‑interpolated identifiers and simple native expressions.
- Re‑resolve types: Resolves the result again through
TypeResolver
to ensure types and schemas are consistent post‑inlining.
SqlGenerator (SELECT/DDL printer)
- Purpose: Works on
LogicalPlan
andExpression
to produce a pretty‑printed SQLDoc
using the Wadler‑inspired formatter. - Merges operators into SELECT (
SQLBlock
):- Projection/aggregation → SELECT list (with
distinct
when present). Filter
beforeGroupBy
→ HAVING; otherwise → WHERE.OrderBy
,Limit
,Offset
are positioned correctly and compacted when possible.WithQuery
emitsWITH ... AS (...)
CTEs before the main query.- Set operations (
UNION
,INTERSECT
,EXCEPT
) and pivots are lowered into supported SQL shapes. - DDL/updates like
CreateTableAs
andInsertInto
are printed for engines that support them; VALUES handling respectsrequireParenForValues
.
- Projection/aggregation → SELECT list (with
Dialect handling
- The current dialect (
ctx.dbType
) influences both printing and DDL choices:supportCreateOrReplace
,supportCreateTableWithOption
,supportSaveAsFile
,supportDescribeSubQuery
change emitted statements.- Expression syntax variations (e.g., array/map constructors, row/struct support) are toggled by
SQLDialect
flags in DBType.scala.
Outputs
- Single SELECT: one SQL block with a header.
- Multiple tasks: concatenated statements separated by semicolons; a trailing semicolon is added for multi‑statement outputs.
Notes and TODOs
- Show models currently prints via
SqlGenerator
by introspecting known symbols; it may be moved out of SQL generation later. - Dialect coverage: Some database dialects aren’t fully covered yet; adding support typically requires small DBType/SQLDialect flag updates plus minor printer branches.
SQL Formatting (Wadler Doc)
Wvlet uses a Wadler‑style pretty printer to format SQL predictably. The generator builds a tree of small layout tokens (Docs), then flattens a Group to one line if it fits the configured width; otherwise it expands with newlines and indentation.
- Doc algebra: Text, NewLine, MaybeNewLine, WhiteSpaceOrNewLine (wsOrNL), LineBreak, HList (
+
), VList (/
), Nest, Block, Group. See CodeFormatter.scala. - Groups: Try a single‑line layout by flattening soft breaks; if the flattened length
<= maxLineWidth
, keep it; else render with real breaks and indentation. - Soft breaks:
wsOrNL
becomes a space when flat, a newline when expanded;MaybeNewLine
may elide entirely when flat. UseLineBreak
for hard, non‑flattenable breaks. - Helpers:
cl(a, b, c)
: comma list with,
+wsOrNL
as a separator (compact when it fits; one item per line when not).wl(x, y, z)
: join with spaces; good for keywords and short fragments.group(...)
,nest(...)
,paren(...)
,indentedParen(...)
: wrap subqueries/blocks so they compact or expand consistently.
Example (SELECT list)
-- Code pattern in generator
group(wl("select", cl(selectItems.map(expr))))
-- Compact (fits width)
select a, b, c
-- Expanded (too long; breaks at cl separator)
select
a,
b,
c
Example (FROM with subquery)
-- Code pattern in generator
wl("from", indentedParen(query(child)))
-- Compact
from (select a from t)
-- Expanded
from (
select a
from t
)
Tuning knobs live in CodeFormatterConfig
(indent width, max line width, dialect for printing expressions). SqlGenerator
uses these building blocks so that adding new operators mostly means placing soft breaks (wsOrNL
) at legal SQL boundaries and wrapping groups appropriately.
Glossary
- DataType: type of an expression or column (e.g.,
int
,varchar(20)
,array[int]
). - RelationType: schema of a relation (table‑like value) with named fields.
- NamedType: a
(name, DataType)
pair describing a field. - Symbol: stable identifier for a tree; anchors phase‑mutable
SymbolInfo
. - Denotation (
SymbolInfo
): the symbol’s type and other metadata at a phase.
Questions or proposals? Open an issue and tag “compiler” or send a PR updating this page alongside your change.
Invariants and Guidelines (WIP)
- Avoid duplication: Trees are immutable; types should not be stored twice when avoidable.
- Consistency: If a plan node has both a symbol and an intrinsic schema, keep them consistent (
symbol.dataType == relationType
). - Single source of truth: Prefer computing types from children; use
SymbolInfo
as the single source of truth for definitions and named entities. - Locality: Keep rewrites local and explicit; avoid hidden global state.
Roadmap and Open Questions
- Symbol‑centric typing (definitions first, then source relations). See #1175.
- Expand optimization passes (projection/filter pushdown, join key normalization).
- SQL dialect backends.
If you’re about to change phase ordering, add a short ADR (Architecture Decision Record) and link it here.