more on "Abstract syntax trees";
authorwenzelm
Mon, 18 Jun 2012 21:17:34 +0200
changeset 48114 428e55887f24
parent 48113 1c4500446ba4
child 48115 d868e4f7905b
more on "Abstract syntax trees";
doc-src/IsarRef/Thy/Inner_Syntax.thy
doc-src/IsarRef/Thy/document/Inner_Syntax.tex
doc-src/Ref/syntax.tex
--- a/doc-src/IsarRef/Thy/Inner_Syntax.thy	Mon Jun 18 16:30:20 2012 +0200
+++ b/doc-src/IsarRef/Thy/Inner_Syntax.thy	Mon Jun 18 21:17:34 2012 +0200
@@ -1062,11 +1062,123 @@
 
 subsection {* Abstract syntax trees *}
 
-text {*
-  FIXME
+text {* The ML datatype @{ML_type Ast.ast} explicitly represents the
+  intermediate AST format that is used for syntax rewriting
+  (\secref{sec:syn-trans}).  It is defined in ML as follows:
+  \begin{ttbox}
+  datatype ast =
+    Constant of string |
+    Variable of string |
+    Appl of ast list
+  \end{ttbox}
+
+  An AST is either an atom (constant or variable) or a list of (at
+  least two) subtrees.  Occasional diagnostic output of ASTs uses
+  notation that resembles S-expression of LISP.  Constant atoms are
+  shown as quoted strings, variable atoms as non-quoted strings and
+  applications as a parenthesized list of subtrees.  For example, the
+  AST
+  @{ML [display] "Ast.Appl
+  [Ast.Constant \"_abs\", Ast.Variable \"x\", Ast.Variable \"t\"]"}
+  is pretty-printed as @{verbatim "(\"_abs\" x t)"}.  Note that
+  @{verbatim "()"} and @{verbatim "(x)"} are excluded as ASTs, because
+  they have too few subtrees.
+
+  \medskip AST application is merely a pro-forma mechanism to indicate
+  certain syntactic structures.  Thus @{verbatim "(c a b)"} could mean
+  either term application or type application, depending on the
+  syntactic context.
+
+  Nested application like @{verbatim "((\"_abs\" x t) u)"} is also
+  possible, but ASTs are definitely first-order: the syntax constant
+  @{verbatim "\"_abs\""} does not bind the @{verbatim x} in any way.
+  Proper bindings are introduced in later stages of the term syntax,
+  where @{verbatim "(\"_abs\" x t)"} becomes an @{ML Abs} node and
+  occurrences of @{verbatim x} in @{verbatim t} are replaced by bound
+  variables (represented as de-Bruijn indices).
 *}
 
 
+subsubsection {* Ast constants versus variables *}
+
+text {* Depending on the situation --- input syntax, output syntax,
+  translation patterns --- the distinction of atomic asts as @{ML
+  Ast.Constant} versus @{ML Ast.Variable} serves slightly different
+  purposes.
+
+  Input syntax of a term such as @{text "f a b = c"} does not yet
+  indicate the scopes of atomic entities @{text "f, a, b, c"}: they
+  could be global constants or local variables, even bound ones
+  depending on the context of the term.  @{ML Ast.Variable} leaves
+  this choice still open: later syntax layers (or translation
+  functions) may capture such a variable to determine its role
+  specifically, to make it a constant, bound variable, free variable
+  etc.  In contrast, syntax translations that introduce already known
+  constants would rather do it via @{ML Ast.Constant} to prevent
+  accidental re-interpretation later on.
+
+  Output syntax turns term constants into @{ML Ast.Constant} and
+  variables (free or schematic) into @{ML Ast.Variable}.  This
+  information is precise when printing fully formal @{text "\<lambda>"}-terms.
+
+  In AST translation patterns (\secref{sec:syn-trans}) the system
+  guesses from the current theory context which atoms should be
+  treated as constant versus variable for the matching process.
+  Sometimes this needs to be indicated more explicitly using @{text
+  "CONST c"} inside the term language.  It is also possible to use
+  @{command syntax} declarations (without mixfix annotation) to
+  enforce that certain unqualified names are always treated as
+  constant within the syntax machinery.
+
+  \medskip For ASTs that represent the language of types or sorts, the
+  situation is much simpler, since the concrete syntax already
+  distinguishes type variables from type constants (constructors).  So
+  @{text "('a, 'b) foo"} corresponds to an AST application of some
+  constant for @{text foo} and variable arguments for @{text "'a"} and
+  @{text "'b"}.  Note that the postfix application is merely a feature
+  of the concrete syntax, while in the AST the constructor occurs in
+  head position.  *}
+
+
+subsubsection {* Authentic syntax names *}
+
+text {* Naming constant entities within ASTs is another delicate
+  issue.  Unqualified names are looked up in the name space tables in
+  the last stage of parsing, after all translations have been applied.
+  Since syntax transformations do not know about this later name
+  resolution yet, there can be surprises in boundary cases.
+
+  \emph{Authentic syntax names} for @{ML Ast.Constant} avoid this
+  problem: the fully-qualified constant name with a special prefix for
+  its formal category (@{text "class"}, @{text "type"}, @{text
+  "const"}, @{text "fixed"}) represents the information faithfully
+  within the untyped AST format.  Accidental overlap with free or
+  bound variables is excluded as well.  Authentic syntax names work
+  implicitly in the following situations:
+
+  \begin{itemize}
+
+  \item Input of term constants (or fixed variables) that are
+  introduced by concrete syntax via @{command notation}: the
+  correspondence of a particular grammar production to some known term
+  entity is preserved.
+
+  \item Input type constants (constructors) and type classes ---
+  thanks to explicit syntactic distinction independently on the
+  context.
+
+  \item Output of term constants, type constants, type classes ---
+  this information is already available from the internal term to be
+  printed.
+
+  \end{itemize}
+
+  In other words, syntax transformations that operate on input terms
+  written as prefix applications are difficult to achieve.  Luckily,
+  this case rarely occurs in practice, because syntax forms to be
+  translated usually correspond to some bits of concrete notation. *}
+
+
 subsection {* Raw syntax and translations \label{sec:syn-trans} *}
 
 text {*
--- a/doc-src/IsarRef/Thy/document/Inner_Syntax.tex	Mon Jun 18 16:30:20 2012 +0200
+++ b/doc-src/IsarRef/Thy/document/Inner_Syntax.tex	Mon Jun 18 21:17:34 2012 +0200
@@ -1244,7 +1244,127 @@
 \isamarkuptrue%
 %
 \begin{isamarkuptext}%
-FIXME%
+The ML datatype \verb|Ast.ast| explicitly represents the
+  intermediate AST format that is used for syntax rewriting
+  (\secref{sec:syn-trans}).  It is defined in ML as follows:
+  \begin{ttbox}
+  datatype ast =
+    Constant of string |
+    Variable of string |
+    Appl of ast list
+  \end{ttbox}
+
+  An AST is either an atom (constant or variable) or a list of (at
+  least two) subtrees.  Occasional diagnostic output of ASTs uses
+  notation that resembles S-expression of LISP.  Constant atoms are
+  shown as quoted strings, variable atoms as non-quoted strings and
+  applications as a parenthesized list of subtrees.  For example, the
+  AST
+  \begin{verbatim}
+Ast.Appl
+  [Ast.Constant "_abs", Ast.Variable "x", Ast.Variable "t"]
+\end{verbatim}
+  is pretty-printed as \verb|("_abs" x t)|.  Note that
+  \verb|()| and \verb|(x)| are excluded as ASTs, because
+  they have too few subtrees.
+
+  \medskip AST application is merely a pro-forma mechanism to indicate
+  certain syntactic structures.  Thus \verb|(c a b)| could mean
+  either term application or type application, depending on the
+  syntactic context.
+
+  Nested application like \verb|(("_abs" x t) u)| is also
+  possible, but ASTs are definitely first-order: the syntax constant
+  \verb|"_abs"| does not bind the \verb|x| in any way.
+  Proper bindings are introduced in later stages of the term syntax,
+  where \verb|("_abs" x t)| becomes an \verb|Abs| node and
+  occurrences of \verb|x| in \verb|t| are replaced by bound
+  variables (represented as de-Bruijn indices).%
+\end{isamarkuptext}%
+\isamarkuptrue%
+%
+\isamarkupsubsubsection{Ast constants versus variables%
+}
+\isamarkuptrue%
+%
+\begin{isamarkuptext}%
+Depending on the situation --- input syntax, output syntax,
+  translation patterns --- the distinction of atomic asts as \verb|Ast.Constant| versus \verb|Ast.Variable| serves slightly different
+  purposes.
+
+  Input syntax of a term such as \isa{{\isaliteral{22}{\isachardoublequote}}f\ a\ b\ {\isaliteral{3D}{\isacharequal}}\ c{\isaliteral{22}{\isachardoublequote}}} does not yet
+  indicate the scopes of atomic entities \isa{{\isaliteral{22}{\isachardoublequote}}f{\isaliteral{2C}{\isacharcomma}}\ a{\isaliteral{2C}{\isacharcomma}}\ b{\isaliteral{2C}{\isacharcomma}}\ c{\isaliteral{22}{\isachardoublequote}}}: they
+  could be global constants or local variables, even bound ones
+  depending on the context of the term.  \verb|Ast.Variable| leaves
+  this choice still open: later syntax layers (or translation
+  functions) may capture such a variable to determine its role
+  specifically, to make it a constant, bound variable, free variable
+  etc.  In contrast, syntax translations that introduce already known
+  constants would rather do it via \verb|Ast.Constant| to prevent
+  accidental re-interpretation later on.
+
+  Output syntax turns term constants into \verb|Ast.Constant| and
+  variables (free or schematic) into \verb|Ast.Variable|.  This
+  information is precise when printing fully formal \isa{{\isaliteral{22}{\isachardoublequote}}{\isaliteral{5C3C6C616D6264613E}{\isasymlambda}}{\isaliteral{22}{\isachardoublequote}}}-terms.
+
+  In AST translation patterns (\secref{sec:syn-trans}) the system
+  guesses from the current theory context which atoms should be
+  treated as constant versus variable for the matching process.
+  Sometimes this needs to be indicated more explicitly using \isa{{\isaliteral{22}{\isachardoublequote}}CONST\ c{\isaliteral{22}{\isachardoublequote}}} inside the term language.  It is also possible to use
+  \hyperlink{command.syntax}{\mbox{\isa{\isacommand{syntax}}}} declarations (without mixfix annotation) to
+  enforce that certain unqualified names are always treated as
+  constant within the syntax machinery.
+
+  \medskip For ASTs that represent the language of types or sorts, the
+  situation is much simpler, since the concrete syntax already
+  distinguishes type variables from type constants (constructors).  So
+  \isa{{\isaliteral{22}{\isachardoublequote}}{\isaliteral{28}{\isacharparenleft}}{\isaliteral{27}{\isacharprime}}a{\isaliteral{2C}{\isacharcomma}}\ {\isaliteral{27}{\isacharprime}}b{\isaliteral{29}{\isacharparenright}}\ foo{\isaliteral{22}{\isachardoublequote}}} corresponds to an AST application of some
+  constant for \isa{foo} and variable arguments for \isa{{\isaliteral{22}{\isachardoublequote}}{\isaliteral{27}{\isacharprime}}a{\isaliteral{22}{\isachardoublequote}}} and
+  \isa{{\isaliteral{22}{\isachardoublequote}}{\isaliteral{27}{\isacharprime}}b{\isaliteral{22}{\isachardoublequote}}}.  Note that the postfix application is merely a feature
+  of the concrete syntax, while in the AST the constructor occurs in
+  head position.%
+\end{isamarkuptext}%
+\isamarkuptrue%
+%
+\isamarkupsubsubsection{Authentic syntax names%
+}
+\isamarkuptrue%
+%
+\begin{isamarkuptext}%
+Naming constant entities within ASTs is another delicate
+  issue.  Unqualified names are looked up in the name space tables in
+  the last stage of parsing, after all translations have been applied.
+  Since syntax transformations do not know about this later name
+  resolution yet, there can be surprises in boundary cases.
+
+  \emph{Authentic syntax names} for \verb|Ast.Constant| avoid this
+  problem: the fully-qualified constant name with a special prefix for
+  its formal category (\isa{{\isaliteral{22}{\isachardoublequote}}class{\isaliteral{22}{\isachardoublequote}}}, \isa{{\isaliteral{22}{\isachardoublequote}}type{\isaliteral{22}{\isachardoublequote}}}, \isa{{\isaliteral{22}{\isachardoublequote}}const{\isaliteral{22}{\isachardoublequote}}}, \isa{{\isaliteral{22}{\isachardoublequote}}fixed{\isaliteral{22}{\isachardoublequote}}}) represents the information faithfully
+  within the untyped AST format.  Accidental overlap with free or
+  bound variables is excluded as well.  Authentic syntax names work
+  implicitly in the following situations:
+
+  \begin{itemize}
+
+  \item Input of term constants (or fixed variables) that are
+  introduced by concrete syntax via \hyperlink{command.notation}{\mbox{\isa{\isacommand{notation}}}}: the
+  correspondence of a particular grammar production to some known term
+  entity is preserved.
+
+  \item Input type constants (constructors) and type classes ---
+  thanks to explicit syntactic distinction independently on the
+  context.
+
+  \item Output of term constants, type constants, type classes ---
+  this information is already available from the internal term to be
+  printed.
+
+  \end{itemize}
+
+  In other words, syntax transformations that operate on input terms
+  written as prefix applications are difficult to achieve.  Luckily,
+  this case rarely occurs in practice, because syntax forms to be
+  translated usually correspond to some bits of concrete notation.%
 \end{isamarkuptext}%
 \isamarkuptrue%
 %
--- a/doc-src/Ref/syntax.tex	Mon Jun 18 16:30:20 2012 +0200
+++ b/doc-src/Ref/syntax.tex	Mon Jun 18 21:17:34 2012 +0200
@@ -8,50 +8,6 @@
 \newcommand\Appl[1]{\ttfct{Appl}\,[#1]}
 \index{syntax!transformations|(}
 
-\section{Abstract syntax trees} \label{sec:asts}
-\index{ASTs|(}
-
-Abstract syntax trees are an intermediate form between the raw parse trees
-and the typed $\lambda$-terms.  An \AST{} is either an atom (constant or
-variable) or a list of {\em at least two\/} subtrees.  Internally, they
-have type \mltydx{Syntax.ast}: \index{*Constant} \index{*Variable}
-\index{*Appl}
-\begin{ttbox}
-datatype ast = Constant of string
-             | Variable of string
-             | Appl of ast list
-\end{ttbox}
-%
-Isabelle uses an S-expression syntax for abstract syntax trees.  Constant
-atoms are shown as quoted strings, variable atoms as non-quoted strings and
-applications as a parenthesised list of subtrees.  For example, the \AST
-\begin{ttbox}
-Appl [Constant "_constrain",
-      Appl [Constant "_abs", Variable "x", Variable "t"],
-      Appl [Constant "fun", Variable "'a", Variable "'b"]]
-\end{ttbox}
-is shown as {\tt ("_constrain" ("_abs" x t) ("fun" 'a 'b))}.
-Both {\tt ()} and {\tt (f)} are illegal because they have too few
-subtrees.
-
-The resemblance to Lisp's S-expressions is intentional, but there are two
-kinds of atomic symbols: $\Constant x$ and $\Variable x$.  Do not take the
-names {\tt Constant} and {\tt Variable} too literally; in the later
-translation to terms, $\Variable x$ may become a constant, free or bound
-variable, even a type constructor or class name; the actual outcome depends
-on the context.
-
-Similarly, you can think of ${\tt (} f~x@1~\ldots~x@n{\tt )}$ as the
-application of~$f$ to the arguments $x@1, \ldots, x@n$.  But the kind of
-application is determined later by context; it could be a type constructor
-applied to types.
-
-Forms like {\tt (("_abs" x $t$) $u$)} are legal, but \AST{}s are
-first-order: the {\tt "_abs"} does not bind the {\tt x} in any way.  Later
-at the term level, {\tt ("_abs" x $t$)} will become an {\tt Abs} node and
-occurrences of {\tt x} in $t$ will be replaced by bound variables (the term
-constructor \ttindex{Bound}).
-
 
 \section{Transforming parse trees to ASTs}\label{sec:astofpt}
 \index{ASTs!made from parse trees}