src/Doc/Codegen/Foundations.thy
author wenzelm
Fri, 11 Apr 2014 12:40:12 +0200
changeset 56534 3ff16a7f0b2e
parent 54890 cb892d835803
child 59377 056945909f60
permissions -rw-r--r--
more formal dependencies via 'document_files';

theory Foundations
imports Introduction
begin

section {* Code generation foundations \label{sec:foundations} *}

subsection {* Code generator architecture \label{sec:architecture} *}

text {*
  The code generator is actually a framework consisting of different
  components which can be customised individually.

  Conceptually all components operate on Isabelle's logic framework
  @{theory Pure}.  Practically, the object logic @{theory HOL}
  provides the necessary facilities to make use of the code generator,
  mainly since it is an extension of @{theory Pure}.

  The constellation of the different components is visualized in the
  following picture.

  \begin{figure}[h]
    \def\sys#1{\emph{#1}}
    \begin{tikzpicture}[x = 4cm, y = 1cm]
      \tikzstyle positive=[color = black, fill = white];
      \tikzstyle negative=[color = white, fill = black];
      \tikzstyle entity=[rounded corners, draw, thick];
      \tikzstyle process=[ellipse, draw, thick];
      \tikzstyle arrow=[-stealth, semithick];
      \node (spec) at (0, 3) [entity, positive] {specification tools};
      \node (user) at (1, 3) [entity, positive] {user proofs};
      \node (spec_user_join) at (0.5, 3) [shape=coordinate] {};
      \node (raw) at (0.5, 4) [entity, positive] {raw code equations};
      \node (pre) at (1.5, 4) [process, positive] {preprocessing};
      \node (eqn) at (2.5, 4) [entity, positive] {code equations};
      \node (iml) at (0.5, 0) [entity, positive] {intermediate program};
      \node (seri) at (1.5, 0) [process, positive] {serialisation};
      \node (SML) at (2.5, 3) [entity, positive] {\sys{SML}};
      \node (OCaml) at (2.5, 2) [entity, positive] {\sys{OCaml}};
      \node (Haskell) at (2.5, 1) [entity, positive] {\sys{Haskell}};
      \node (Scala) at (2.5, 0) [entity, positive] {\sys{Scala}};
      \draw [semithick] (spec) -- (spec_user_join);
      \draw [semithick] (user) -- (spec_user_join);
      \draw [-diamond, semithick] (spec_user_join) -- (raw);
      \draw [arrow] (raw) -- (pre);
      \draw [arrow] (pre) -- (eqn);
      \draw [arrow] (eqn) -- node (transl) [process, positive] {translation} (iml);
      \draw [arrow] (iml) -- (seri);
      \draw [arrow] (seri) -- (SML);
      \draw [arrow] (seri) -- (OCaml);
      \draw [arrow] (seri) -- (Haskell);
      \draw [arrow] (seri) -- (Scala);
    \end{tikzpicture}
    \caption{Code generator architecture}
    \label{fig:arch}
  \end{figure}

  \noindent Central to code generation is the notion of \emph{code
  equations}.  A code equation as a first approximation is a theorem
  of the form @{text "f t\<^sub>1 t\<^sub>2 \<dots> t\<^sub>n \<equiv> t"} (an equation headed by a
  constant @{text f} with arguments @{text "t\<^sub>1 t\<^sub>2 \<dots> t\<^sub>n"} and right
  hand side @{text t}).

  \begin{itemize}

    \item Starting point of code generation is a collection of (raw)
      code equations in a theory. It is not relevant where they stem
      from, but typically they were either produced by specification
      tools or proved explicitly by the user.
      
    \item These raw code equations can be subjected to theorem
      transformations.  This \qn{preprocessor} (see
      \secref{sec:preproc}) can apply the full expressiveness of
      ML-based theorem transformations to code generation.  The result
      of preprocessing is a structured collection of code equations.

    \item These code equations are \qn{translated} to a program in an
      abstract intermediate language.  Think of it as a kind of
      \qt{Mini-Haskell} with four \qn{statements}: @{text data} (for
      datatypes), @{text fun} (stemming from code equations), also
      @{text class} and @{text inst} (for type classes).

    \item Finally, the abstract program is \qn{serialised} into
      concrete source code of a target language.  This step only
      produces concrete syntax but does not change the program in
      essence; all conceptual transformations occur in the translation
      step.

  \end{itemize}

  \noindent From these steps, only the last two are carried out
  outside the logic; by keeping this layer as thin as possible, the
  amount of code to trust is kept to a minimum.
*}


subsection {* The pre- and postprocessor \label{sec:preproc} *}

text {*
  Before selected function theorems are turned into abstract code, a
  chain of definitional transformation steps is carried out:
  \emph{preprocessing}.  The preprocessor consists of two
  components: a \emph{simpset} and \emph{function transformers}.

  The preprocessor simpset has a disparate brother, the
  \emph{postprocessor simpset}.
  In the theory-to-code scenario depicted in the picture 
  above, it plays no role.  But if generated code is used
  to evaluate expressions (cf.~\secref{sec:evaluation}), the
  postprocessor simpset is applied to the resulting expression before this
  is turned back.

  The pre- and postprocessor \emph{simpsets} can apply the
  full generality of the Isabelle
  simplifier.  Due to the interpretation of theorems as code
  equations, rewrites are applied to the right hand side and the
  arguments of the left hand side of an equation, but never to the
  constant heading the left hand side.

  Pre- and postprocessor can be setup to broker between
  expressions suitable for logical reasoning and expressions 
  suitable for execution.  As example, take list membership; logically
  is is just expressed as @{term "x \<in> set xs"}.  But for execution
  the intermediate set is not desirable.  Hence the following
  specification:
*}

definition %quote member :: "'a list \<Rightarrow> 'a \<Rightarrow> bool"
where
  [code_abbrev]: "member xs x \<longleftrightarrow> x \<in> set xs"

text {*
  \noindent The \emph{@{attribute code_abbrev} attribute} declares
  its theorem a rewrite rule for the postprocessor and the symmetric
  of its theorem as rewrite rule for the preprocessor.  Together,
  this has the effect that expressions @{thm (rhs) member_def [no_vars]}
  are replaced by @{thm (lhs) member_def [no_vars]} in generated code, but
  are turned back into @{thm (rhs) member_def [no_vars]} if generated
  code is used for evaluation.

  Rewrite rules for pre- or postprocessor may be
  declared independently using \emph{@{attribute code_unfold}}
  or \emph{@{attribute code_post}} respectively.

  \emph{Function transformers} provide a very general
  interface, transforming a list of function theorems to another list
  of function theorems, provided that neither the heading constant nor
  its type change.  The @{term "0\<Colon>nat"} / @{const Suc} pattern
  used in theory @{text Code_Abstract_Nat} (see \secref{abstract_nat})
  uses this interface.

  \noindent The current setup of the pre- and postprocessor may be inspected
  using the @{command_def print_codeproc} command.  @{command_def
  code_thms} (see \secref{sec:equations}) provides a convenient
  mechanism to inspect the impact of a preprocessor setup on code
  equations.
*}


subsection {* Understanding code equations \label{sec:equations} *}

text {*
  As told in \secref{sec:principle}, the notion of code equations is
  vital to code generation.  Indeed most problems which occur in
  practice can be resolved by an inspection of the underlying code
  equations.

  It is possible to exchange the default code equations for constants
  by explicitly proving alternative ones:
*}

lemma %quote [code]:
  "dequeue (AQueue xs []) =
     (if xs = [] then (None, AQueue [] [])
       else dequeue (AQueue [] (rev xs)))"
  "dequeue (AQueue xs (y # ys)) =
     (Some y, AQueue xs ys)"
  by (cases xs, simp_all) (cases "rev xs", simp_all)

text {*
  \noindent The annotation @{text "[code]"} is an @{text attribute}
  which states that the given theorems should be considered as code
  equations for a @{text fun} statement -- the corresponding constant
  is determined syntactically.  The resulting code:
*}

text %quotetypewriter {*
  @{code_stmts dequeue (consts) dequeue (Haskell)}
*}

text {*
  \noindent You may note that the equality test @{term "xs = []"} has
  been replaced by the predicate @{term "List.null xs"}.  This is due
  to the default setup of the \qn{preprocessor}.

  This possibility to select arbitrary code equations is the key
  technique for program and datatype refinement (see
  \secref{sec:refinement}).

  Due to the preprocessor, there is the distinction of raw code
  equations (before preprocessing) and code equations (after
  preprocessing).

  The first can be listed (among other data) using the @{command_def
  print_codesetup} command.

  The code equations after preprocessing are already are blueprint of
  the generated program and can be inspected using the @{command
  code_thms} command:
*}

code_thms %quote dequeue

text {*
  \noindent This prints a table with the code equations for @{const
  dequeue}, including \emph{all} code equations those equations depend
  on recursively.  These dependencies themselves can be visualized using
  the @{command_def code_deps} command.
*}


subsection {* Equality *}

text {*
  Implementation of equality deserves some attention.  Here an example
  function involving polymorphic equality:
*}

primrec %quote collect_duplicates :: "'a list \<Rightarrow> 'a list \<Rightarrow> 'a list \<Rightarrow> 'a list" where
  "collect_duplicates xs ys [] = xs"
| "collect_duplicates xs ys (z#zs) = (if z \<in> set xs
    then if z \<in> set ys
      then collect_duplicates xs ys zs
      else collect_duplicates xs (z#ys) zs
    else collect_duplicates (z#xs) (z#ys) zs)"

text {*
  \noindent During preprocessing, the membership test is rewritten,
  resulting in @{const List.member}, which itself performs an explicit
  equality check, as can be seen in the corresponding @{text SML} code:
*}

text %quotetypewriter {*
  @{code_stmts collect_duplicates (SML)}
*}

text {*
  \noindent Obviously, polymorphic equality is implemented the Haskell
  way using a type class.  How is this achieved?  HOL introduces an
  explicit class @{class equal} with a corresponding operation @{const
  HOL.equal} such that @{thm equal [no_vars]}.  The preprocessing
  framework does the rest by propagating the @{class equal} constraints
  through all dependent code equations.  For datatypes, instances of
  @{class equal} are implicitly derived when possible.  For other types,
  you may instantiate @{text equal} manually like any other type class.
*}


subsection {* Explicit partiality \label{sec:partiality} *}

text {*
  Partiality usually enters the game by partial patterns, as
  in the following example, again for amortised queues:
*}

definition %quote strict_dequeue :: "'a queue \<Rightarrow> 'a \<times> 'a queue" where
  "strict_dequeue q = (case dequeue q
    of (Some x, q') \<Rightarrow> (x, q'))"

lemma %quote strict_dequeue_AQueue [code]:
  "strict_dequeue (AQueue xs (y # ys)) = (y, AQueue xs ys)"
  "strict_dequeue (AQueue xs []) =
    (case rev xs of y # ys \<Rightarrow> (y, AQueue [] ys))"
  by (simp_all add: strict_dequeue_def) (cases xs, simp_all split: list.split)

text {*
  \noindent In the corresponding code, there is no equation
  for the pattern @{term "AQueue [] []"}:
*}

text %quotetypewriter {*
  @{code_stmts strict_dequeue (consts) strict_dequeue (Haskell)}
*}

text {*
  \noindent In some cases it is desirable to have this
  pseudo-\qt{partiality} more explicitly, e.g.~as follows:
*}

axiomatization %quote empty_queue :: 'a

definition %quote strict_dequeue' :: "'a queue \<Rightarrow> 'a \<times> 'a queue" where
  "strict_dequeue' q = (case dequeue q of (Some x, q') \<Rightarrow> (x, q') | _ \<Rightarrow> empty_queue)"

lemma %quote strict_dequeue'_AQueue [code]:
  "strict_dequeue' (AQueue xs []) = (if xs = [] then empty_queue
     else strict_dequeue' (AQueue [] (rev xs)))"
  "strict_dequeue' (AQueue xs (y # ys)) =
     (y, AQueue xs ys)"
  by (simp_all add: strict_dequeue'_def split: list.splits)

text {*
  Observe that on the right hand side of the definition of @{const
  "strict_dequeue'"}, the unspecified constant @{const empty_queue} occurs.

  Normally, if constants without any code equations occur in a
  program, the code generator complains (since in most cases this is
  indeed an error).  But such constants can also be thought
  of as function definitions which always fail,
  since there is never a successful pattern match on the left hand
  side.  In order to categorise a constant into that category
  explicitly, use the @{attribute code} attribute with
  @{text abort}:
*}

declare %quote [[code abort: empty_queue]]

text {*
  \noindent Then the code generator will just insert an error or
  exception at the appropriate position:
*}

text %quotetypewriter {*
  @{code_stmts strict_dequeue' (consts) empty_queue strict_dequeue' (Haskell)}
*}

text {*
  \noindent This feature however is rarely needed in practice.  Note
  also that the HOL default setup already declares @{const undefined},
  which is most likely to be used in such situations, as
  @{text "code abort"}.
*}


subsection {* If something goes utterly wrong \label{sec:utterly_wrong} *}

text {*
  Under certain circumstances, the code generator fails to produce
  code entirely.  To debug these, the following hints may prove
  helpful:

  \begin{description}

    \ditem{\emph{Check with a different target language}.}  Sometimes
      the situation gets more clear if you switch to another target
      language; the code generated there might give some hints what
      prevents the code generator to produce code for the desired
      language.

    \ditem{\emph{Inspect code equations}.}  Code equations are the central
      carrier of code generation.  Most problems occurring while generating
      code can be traced to single equations which are printed as part of
      the error message.  A closer inspection of those may offer the key
      for solving issues (cf.~\secref{sec:equations}).

    \ditem{\emph{Inspect preprocessor setup}.}  The preprocessor might
      transform code equations unexpectedly; to understand an
      inspection of its setup is necessary (cf.~\secref{sec:preproc}).

    \ditem{\emph{Generate exceptions}.}  If the code generator
      complains about missing code equations, in can be helpful to
      implement the offending constants as exceptions
      (cf.~\secref{sec:partiality}); this allows at least for a formal
      generation of code, whose inspection may then give clues what is
      wrong.

    \ditem{\emph{Remove offending code equations}.}  If code
      generation is prevented by just a single equation, this can be
      removed (cf.~\secref{sec:equations}) to allow formal code
      generation, whose result in turn can be used to trace the
      problem.  The most prominent case here are mismatches in type
      class signatures (\qt{wellsortedness error}).

  \end{description}
*}

end