isabelle: doc-src/Ref/syntax.tex@fd773574cb81



\chapter{Syntax Transformations} \label{chap:syntax}
\newcommand\ttapp{\mathrel{\hbox{\tt\$}}}
\newcommand\mtt[1]{\mbox{\tt #1}}
\newcommand\ttfct[1]{\mathop{\mtt{#1}}\nolimits}
\newcommand\Constant{\ttfct{Constant}}
\newcommand\Variable{\ttfct{Variable}}
\newcommand\Appl[1]{\ttfct{Appl}\,[#1]}
\index{syntax!transformations|(}


\section{Transforming parse trees to ASTs}\label{sec:astofpt}
\index{ASTs!made from parse trees}
\newcommand\astofpt[1]{\lbrakk#1\rbrakk}

The parse tree is the raw output of the parser.  Translation functions,
called {\bf parse AST translations}\indexbold{translations!parse AST},
transform the parse tree into an abstract syntax tree.

The parse tree is constructed by nesting the right-hand sides of the
productions used to recognize the input.  Such parse trees are simply lists
of tokens and constituent parse trees, the latter representing the
nonterminals of the productions.  Let us refer to the actual productions in
the form displayed by {\tt print_syntax} (see \S\ref{sec:inspct-thy} for an
example).

Ignoring parse \AST{} translations, parse trees are transformed to \AST{}s
by stripping out delimiters and copy productions.  More precisely, the
mapping $\astofpt{-}$ is derived from the productions as follows:
\begin{itemize}
\item Name tokens: $\astofpt{t} = \Variable s$, where $t$ is an \ndx{id},
  \ndx{var}, \ndx{tid}, \ndx{tvar}, \ndx{num}, \ndx{xnum} or \ndx{xstr} token,
  and $s$ its associated string.  Note that for {\tt xstr} this does not
  include the quotes.

\item Copy productions:\index{productions!copy}
  $\astofpt{\ldots P \ldots} = \astofpt{P}$.  Here $\ldots$ stands for
  strings of delimiters, which are discarded.  $P$ stands for the single
  constituent that is not a delimiter; it is either a nonterminal symbol or
  a name token.

  \item 0-ary productions: $\astofpt{\ldots \mtt{=>} c} = \Constant c$.
    Here there are no constituents other than delimiters, which are
    discarded.

  \item $n$-ary productions, where $n \ge 1$: delimiters are discarded and
    the remaining constituents $P@1$, \ldots, $P@n$ are built into an
    application whose head constant is~$c$:
    \[ \astofpt{\ldots P@1 \ldots P@n \ldots \mtt{=>} c} =
       \Appl{\Constant c, \astofpt{P@1}, \ldots, \astofpt{P@n}}
    \]
\end{itemize}
Figure~\ref{fig:parse_ast} presents some simple examples, where {\tt ==},
{\tt _appl}, {\tt _args}, and so forth name productions of the Pure syntax.
These examples illustrate the need for further translations to make \AST{}s
closer to the typed $\lambda$-calculus.  The Pure syntax provides
predefined parse \AST{} translations\index{translations!parse AST} for
ordinary applications, type applications, nested abstractions, meta
implications and function types.  Figure~\ref{fig:parse_ast_tr} shows their
effect on some representative input strings.


\begin{figure}
\begin{center}
\tt\begin{tabular}{ll}
\rm input string    & \rm \AST \\\hline
"f"                 & f \\
"'a"                & 'a \\
"t == u"            & ("==" t u) \\
"f(x)"              & ("_appl" f x) \\
"f(x, y)"           & ("_appl" f ("_args" x y)) \\
"f(x, y, z)"        & ("_appl" f ("_args" x ("_args" y z))) \\
"\%x y.\ t"         & ("_lambda" ("_idts" x y) t) \\
\end{tabular}
\end{center}
\caption{Parsing examples using the Pure syntax}\label{fig:parse_ast}
\end{figure}

\begin{figure}
\begin{center}
\tt\begin{tabular}{ll}
\rm input string            & \rm \AST{} \\\hline
"f(x, y, z)"                & (f x y z) \\
"'a ty"                     & (ty 'a) \\
"('a, 'b) ty"               & (ty 'a 'b) \\
"\%x y z.\ t"               & ("_abs" x ("_abs" y ("_abs" z t))) \\
"\%x ::\ 'a.\ t"            & ("_abs" ("_constrain" x 'a) t) \\
"[| P; Q; R |] => S"        & ("==>" P ("==>" Q ("==>" R S))) \\
"['a, 'b, 'c] => 'd"        & ("fun" 'a ("fun" 'b ("fun" 'c 'd)))
\end{tabular}
\end{center}
\caption{Built-in parse \AST{} translations}\label{fig:parse_ast_tr}
\end{figure}

The names of constant heads in the \AST{} control the translation process.
The list of constants invoking parse \AST{} translations appears in the
output of {\tt print_syntax} under {\tt parse_ast_translation}.


\section{Transforming ASTs to terms}\label{sec:termofast}
\index{terms!made from ASTs}
\newcommand\termofast[1]{\lbrakk#1\rbrakk}

The \AST{}, after application of macros (see \S\ref{sec:macros}), is
transformed into a term.  This term is probably ill-typed since type
inference has not occurred yet.  The term may contain type constraints
consisting of applications with head {\tt "_constrain"}; the second
argument is a type encoded as a term.  Type inference later introduces
correct types or rejects the input.

Another set of translation functions, namely parse
translations\index{translations!parse}, may affect this process.  If we
ignore parse translations for the time being, then \AST{}s are transformed
to terms by mapping \AST{} constants to constants, \AST{} variables to
schematic or free variables and \AST{} applications to applications.

More precisely, the mapping $\termofast{-}$ is defined by
\begin{itemize}
\item Constants: $\termofast{\Constant x} = \ttfct{Const} (x,
  \mtt{dummyT})$.

\item Schematic variables: $\termofast{\Variable \mtt{"?}xi\mtt"} =
  \ttfct{Var} ((x, i), \mtt{dummyT})$, where $x$ is the base name and $i$
  the index extracted from~$xi$.

\item Free variables: $\termofast{\Variable x} = \ttfct{Free} (x,
  \mtt{dummyT})$.

\item Function applications with $n$ arguments:
    \[ \termofast{\Appl{f, x@1, \ldots, x@n}} =
       \termofast{f} \ttapp
         \termofast{x@1} \ttapp \ldots \ttapp \termofast{x@n}
    \]
\end{itemize}
Here \ttindex{Const}, \ttindex{Var}, \ttindex{Free} and
\verb|$|\index{$@{\tt\$}} are constructors of the datatype \mltydx{term},
while \ttindex{dummyT} stands for some dummy type that is ignored during
type inference.

So far the outcome is still a first-order term.  Abstractions and bound
variables (constructors \ttindex{Abs} and \ttindex{Bound}) are introduced
by parse translations.  Such translations are attached to {\tt "_abs"},
{\tt "!!"} and user-defined binders.


\section{Printing of terms}
\newcommand\astofterm[1]{\lbrakk#1\rbrakk}\index{ASTs!made from terms}

The output phase is essentially the inverse of the input phase.  Terms are
translated via abstract syntax trees into strings.  Finally the strings are
pretty printed.

Print translations (\S\ref{sec:tr_funs}) may affect the transformation of
terms into \AST{}s.  Ignoring those, the transformation maps
term constants, variables and applications to the corresponding constructs
on \AST{}s.  Abstractions are mapped to applications of the special
constant {\tt _abs}.

More precisely, the mapping $\astofterm{-}$ is defined as follows:
\begin{itemize}
  \item $\astofterm{\ttfct{Const} (x, \tau)} = \Constant x$.

  \item $\astofterm{\ttfct{Free} (x, \tau)} = constrain (\Variable x,
    \tau)$.

  \item $\astofterm{\ttfct{Var} ((x, i), \tau)} = constrain (\Variable
    \mtt{"?}xi\mtt", \tau)$, where $\mtt?xi$ is the string representation of
    the {\tt indexname} $(x, i)$.

  \item For the abstraction $\lambda x::\tau.t$, let $x'$ be a variant
    of~$x$ renamed to differ from all names occurring in~$t$, and let $t'$
    be obtained from~$t$ by replacing all bound occurrences of~$x$ by
    the free variable $x'$.  This replaces corresponding occurrences of the
    constructor \ttindex{Bound} by the term $\ttfct{Free} (x',
    \mtt{dummyT})$:
   \[ \astofterm{\ttfct{Abs} (x, \tau, t)} =
      \Appl{\Constant \mtt{"_abs"},
        constrain(\Variable x', \tau), \astofterm{t'}}
    \]

  \item $\astofterm{\ttfct{Bound} i} = \Variable \mtt{"B.}i\mtt"$.
    The occurrence of constructor \ttindex{Bound} should never happen
    when printing well-typed terms; it indicates a de Bruijn index with no
    matching abstraction.

  \item Where $f$ is not an application,
    \[ \astofterm{f \ttapp x@1 \ttapp \ldots \ttapp x@n} =
       \Appl{\astofterm{f}, \astofterm{x@1}, \ldots,\astofterm{x@n}}
    \]
\end{itemize}
%
Type constraints\index{type constraints} are inserted to allow the printing
of types.  This is governed by the boolean variable \ttindex{show_types}:
\begin{itemize}
  \item $constrain(x, \tau) = x$ \ if $\tau = \mtt{dummyT}$ \index{*dummyT} or
    \ttindex{show_types} is set to {\tt false}.

  \item $constrain(x, \tau) = \Appl{\Constant \mtt{"_constrain"}, x,
         \astofterm{\tau}}$ \ otherwise.

    Here, $\astofterm{\tau}$ is the \AST{} encoding of $\tau$: type
    constructors go to {\tt Constant}s; type identifiers go to {\tt
      Variable}s; type applications go to {\tt Appl}s with the type
    constructor as the first element.  If \ttindex{show_sorts} is set to
    {\tt true}, some type variables are decorated with an \AST{} encoding
    of their sort.
\end{itemize}
%
The \AST{}, after application of macros (see \S\ref{sec:macros}), is
transformed into the final output string.  The built-in {\bf print AST
  translations}\indexbold{translations!print AST} reverse the
parse \AST{} translations of Fig.\ts\ref{fig:parse_ast_tr}.

For the actual printing process, the names attached to productions
of the form $\ldots A^{(p@1)}@1 \ldots A^{(p@n)}@n \ldots \mtt{=>} c$ play
a vital role.  Each \AST{} with constant head $c$, namely $\mtt"c\mtt"$ or
$(\mtt"c\mtt"~ x@1 \ldots x@n)$, is printed according to the production
for~$c$.  Each argument~$x@i$ is converted to a string, and put in
parentheses if its priority~$(p@i)$ requires this.  The resulting strings
and their syntactic sugar (denoted by \dots{} above) are joined to make a
single string.

If an application $(\mtt"c\mtt"~ x@1 \ldots x@m)$ has more arguments
than the corresponding production, it is first split into
$((\mtt"c\mtt"~ x@1 \ldots x@n) ~ x@{n+1} \ldots x@m)$.  Applications
with too few arguments or with non-constant head or without a
corresponding production are printed as $f(x@1, \ldots, x@l)$ or
$(\alpha@1, \ldots, \alpha@l) ty$.  Multiple productions associated
with some name $c$ are tried in order of appearance.  An occurrence of
$\Variable x$ is simply printed as~$x$.

Blanks are {\em not\/} inserted automatically.  If blanks are required to
separate tokens, specify them in the mixfix declaration, possibly preceded
by a slash~({\tt/}) to allow a line break.
\index{ASTs|)}



\section{Macros: syntactic rewriting} \label{sec:macros}
\index{macros|(}\index{rewriting!syntactic|(}

\subsection{Specifying macros}

\begin{warn}
If a macro rule works incorrectly, inspect its internal form as
shown above, recalling that constants appear as quoted strings and
variables without quotes.
\end{warn}

\begin{warn}
If \ttindex{eta_contract} is set to {\tt true}, terms will be
$\eta$-contracted {\em before\/} the \AST{} rewriter sees them.  Thus some
abstraction nodes needed for print rules to match may vanish.  For example,
\verb|Ball(A, %x. P(x))| contracts to {\tt Ball(A, P)}; the print rule does
not apply and the output will be {\tt Ball(A, P)}.  This problem would not
occur if \ML{} translation functions were used instead of macros (as is
done for binder declarations).
\end{warn}


\begin{warn}
Another trap concerns type constraints.  If \ttindex{show_types} is set to
{\tt true}, bound variables will be decorated by their meta types at the
binding place (but not at occurrences in the body).  Matching with
\verb|Collect(A, %x. P)| binds {\tt x} to something like {\tt ("_constrain" y
"i")} rather than only {\tt y}.  \AST{} rewriting will cause the constraint to
appear in the external form, say \verb|{y::i:A::i. P::o}|.

To allow such constraints to be re-read, your syntax should specify bound
variables using the nonterminal~\ndx{idt}.  This is the case in our
example.  Choosing {\tt id} instead of {\tt idt} is a common error.
\end{warn}



\subsection{Applying rules}
As a term is being parsed or printed, an \AST{} is generated as an
intermediate form (recall Fig.\ts\ref{fig:parse_print}).  The \AST{} is
normalised by applying macro rules in the manner of a traditional term
rewriting system.  We first examine how a single rule is applied.

Let $t$ be the abstract syntax tree to be normalised and $(l, r)$ some
translation rule.  A subtree~$u$ of $t$ is a {\bf redex} if it is an
instance of~$l$; in this case $l$ is said to {\bf match}~$u$.  A redex
matched by $l$ may be replaced by the corresponding instance of~$r$, thus
{\bf rewriting} the \AST~$t$.  Matching requires some notion of {\bf
  place-holders} that may occur in rule patterns but not in ordinary
\AST{}s; {\tt Variable} atoms serve this purpose.

The matching of the object~$u$ by the pattern~$l$ is performed as follows:
\begin{itemize}
  \item Every constant matches itself.

  \item $\Variable x$ in the object matches $\Constant x$ in the pattern.
    This point is discussed further below.

  \item Every \AST{} in the object matches $\Variable x$ in the pattern,
    binding~$x$ to~$u$.

  \item One application matches another if they have the same number of
    subtrees and corresponding subtrees match.

  \item In every other case, matching fails.  In particular, {\tt
      Constant}~$x$ can only match itself.
\end{itemize}
A successful match yields a substitution that is applied to~$r$, generating
the instance that replaces~$u$.

The second case above may look odd.  This is where {\tt Variable}s of
non-rule \AST{}s behave like {\tt Constant}s.  Recall that \AST{}s are not
far removed from parse trees; at this level it is not yet known which
identifiers will become constants, bounds, frees, types or classes.  As
\S\ref{sec:asts} describes, former parse tree heads appear in \AST{}s as
{\tt Constant}s, while the name tokens \ndx{id}, \ndx{var}, \ndx{tid},
\ndx{tvar}, \ndx{num}, \ndx{xnum} and \ndx{xstr} become {\tt Variable}s.  On the other
hand, when \AST{}s generated from terms for printing, all constants and type
constructors become {\tt Constant}s; see \S\ref{sec:asts}.  Thus \AST{}s may
contain a messy mixture of {\tt Variable}s and {\tt Constant}s.  This is
insignificant at macro level because matching treats them alike.

Normalizing an \AST{} involves repeatedly applying macro rules until none are
applicable.  Macro rules are chosen in order of appearance in the theory
definitions.  You can watch the normalization of \AST{}s during parsing and
printing by setting \ttindex{Syntax.trace_ast} to {\tt true}.\index{tracing!of
  macros} The information displayed when tracing includes the \AST{} before
normalization ({\tt pre}), redexes with results ({\tt rewrote}), the normal
form finally reached ({\tt post}) and some statistics ({\tt normalize}).


\section{Translation functions} \label{sec:tr_funs}
\index{translations|(}
%
This section describes the translation function mechanism.  By writing \ML{}
functions, you can do almost everything to terms or \AST{}s during parsing and
printing.  The logic LK is a good example of sophisticated transformations
between internal and external representations of sequents; here, macros would
be useless.

A full understanding of translations requires some familiarity
with Isabelle's internals, especially the datatypes {\tt term}, {\tt typ},
{\tt Syntax.ast} and the encodings of types and terms as such at the various
stages of the parsing or printing process.  Most users should never need to
use translation functions.

\subsection{Declaring translation functions}
There are four kinds of translation functions, with one of these
coming in two variants.  Each such function is associated with a name,
which triggers calls to it.  Such names can be constants (logical or
syntactic) or type constructors.

Function {\tt print_syntax} displays the sets of names associated with the
translation functions of a theory under \texttt{parse_ast_translation}, etc.
You can add new ones via the {\tt ML} section\index{*ML section} of a theory
definition file.  Even though the {\tt ML} section is the very last part of
the file, newly installed translation functions are already effective when
processing all of the preceding sections.

The {\tt ML} section's contents are simply copied verbatim near the
beginning of the \ML\ file generated from a theory definition file.
Definitions made here are accessible as components of an \ML\ 
structure; to make some parts private, use an \ML{} {\tt local}
declaration.  The {\ML} code may install translation functions by
declaring any of the following identifiers:
\begin{ttbox}
val parse_ast_translation   : (string * (ast list -> ast)) list
val print_ast_translation   : (string * (ast list -> ast)) list
val parse_translation       : (string * (term list -> term)) list
val print_translation       : (string * (term list -> term)) list
val typed_print_translation :
    (string * (bool -> typ -> term list -> term)) list
\end{ttbox}

\subsection{The translation strategy}
The different kinds of translation functions are called during the
transformations between parse trees, \AST{}s and terms (recall
Fig.\ts\ref{fig:parse_print}).  Whenever a combination of the form
$(\mtt"c\mtt"~x@1 \ldots x@n)$ is encountered, and a translation
function $f$ of appropriate kind exists for $c$, the result is
computed by the \ML{} function call $f \mtt[ x@1, \ldots, x@n \mtt]$.

For \AST{} translations, the arguments $x@1, \ldots, x@n$ are \AST{}s.
A combination has the form $\Constant c$ or $\Appl{\Constant c, x@1,
  \ldots, x@n}$.  For term translations, the arguments are terms and a
combination has the form $\ttfct{Const} (c, \tau)$ or $\ttfct{Const}
(c, \tau) \ttapp x@1 \ttapp \ldots \ttapp x@n$.  Terms allow more
sophisticated transformations than \AST{}s do, typically involving
abstractions and bound variables. {\em Typed} print translations may
even peek at the type $\tau$ of the constant they are invoked on; they
are also passed the current value of the \ttindex{show_sorts} flag.

Regardless of whether they act on terms or \AST{}s, translation
functions called during the parsing process differ from those for
printing more fundamentally in their overall behaviour:
\begin{description}
\item[Parse translations] are applied bottom-up.  The arguments are already in
  translated form.  The translations must not fail; exceptions trigger an
  error message.  There may never be more than one function associated with
  any syntactic name.
  
\item[Print translations] are applied top-down.  They are supplied with
  arguments that are partly still in internal form.  The result again
  undergoes translation; therefore a print translation should not introduce as
  head the very constant that invoked it.  The function may raise exception
  \xdx{Match} to indicate failure; in this event it has no effect.  Multiple
  functions associated with some syntactic name are tried in an unspecified
  order.
\end{description}

Only constant atoms --- constructor \ttindex{Constant} for \AST{}s and
\ttindex{Const} for terms --- can invoke translation functions.  This
causes another difference between parsing and printing.

Parsing starts with a string and the constants are not yet identified.
Only parse tree heads create {\tt Constant}s in the resulting \AST, as
described in \S\ref{sec:astofpt}.  Macros and parse \AST{} translations may
introduce further {\tt Constant}s.  When the final \AST{} is converted to a
term, all {\tt Constant}s become {\tt Const}s, as described in
\S\ref{sec:termofast}.

Printing starts with a well-typed term and all the constants are known.  So
all logical constants and type constructors may invoke print translations.
These, and macros, may introduce further constants.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "ref"
%%% End:
author	wenzelm
	Tue, 19 Jun 2012 20:51:15 +0200
changeset 48116	fd773574cb81
parent 48115	d868e4f7905b
child 48117	e21f4d5b9636
permissions	-rw-r--r--