src/Doc/Codegen/Foundations.thy
author blanchet
Tue Nov 07 15:16:42 2017 +0100 (20 months ago)
changeset 67022 49309fe530fd
parent 61781 e1e6bb36b27a
child 68254 3a7f257dcac7
permissions -rw-r--r--
more robust parsing for THF proofs (esp. polymorphic Leo-III proofs)
     1 theory Foundations
     2 imports Introduction
     3 begin
     4 
     5 section \<open>Code generation foundations \label{sec:foundations}\<close>
     6 
     7 subsection \<open>Code generator architecture \label{sec:architecture}\<close>
     8 
     9 text \<open>
    10   The code generator is actually a framework consisting of different
    11   components which can be customised individually.
    12 
    13   Conceptually all components operate on Isabelle's logic framework
    14   @{theory Pure}.  Practically, the object logic @{theory HOL}
    15   provides the necessary facilities to make use of the code generator,
    16   mainly since it is an extension of @{theory Pure}.
    17 
    18   The constellation of the different components is visualized in the
    19   following picture.
    20 
    21   \begin{figure}[h]
    22     \def\sys#1{\emph{#1}}
    23     \begin{tikzpicture}[x = 4cm, y = 1cm]
    24       \tikzstyle positive=[color = black, fill = white];
    25       \tikzstyle negative=[color = white, fill = black];
    26       \tikzstyle entity=[rounded corners, draw, thick];
    27       \tikzstyle process=[ellipse, draw, thick];
    28       \tikzstyle arrow=[-stealth, semithick];
    29       \node (spec) at (0, 3) [entity, positive] {specification tools};
    30       \node (user) at (1, 3) [entity, positive] {user proofs};
    31       \node (spec_user_join) at (0.5, 3) [shape=coordinate] {};
    32       \node (raw) at (0.5, 4) [entity, positive] {raw code equations};
    33       \node (pre) at (1.5, 4) [process, positive] {preprocessing};
    34       \node (eqn) at (2.5, 4) [entity, positive] {code equations};
    35       \node (iml) at (0.5, 0) [entity, positive] {intermediate program};
    36       \node (seri) at (1.5, 0) [process, positive] {serialisation};
    37       \node (SML) at (2.5, 3) [entity, positive] {\sys{SML}};
    38       \node (OCaml) at (2.5, 2) [entity, positive] {\sys{OCaml}};
    39       \node (Haskell) at (2.5, 1) [entity, positive] {\sys{Haskell}};
    40       \node (Scala) at (2.5, 0) [entity, positive] {\sys{Scala}};
    41       \draw [semithick] (spec) -- (spec_user_join);
    42       \draw [semithick] (user) -- (spec_user_join);
    43       \draw [-diamond, semithick] (spec_user_join) -- (raw);
    44       \draw [arrow] (raw) -- (pre);
    45       \draw [arrow] (pre) -- (eqn);
    46       \draw [arrow] (eqn) -- node (transl) [process, positive] {translation} (iml);
    47       \draw [arrow] (iml) -- (seri);
    48       \draw [arrow] (seri) -- (SML);
    49       \draw [arrow] (seri) -- (OCaml);
    50       \draw [arrow] (seri) -- (Haskell);
    51       \draw [arrow] (seri) -- (Scala);
    52     \end{tikzpicture}
    53     \caption{Code generator architecture}
    54     \label{fig:arch}
    55   \end{figure}
    56 
    57   \noindent Central to code generation is the notion of \emph{code
    58   equations}.  A code equation as a first approximation is a theorem
    59   of the form @{text "f t\<^sub>1 t\<^sub>2 \<dots> t\<^sub>n \<equiv> t"} (an equation headed by a
    60   constant @{text f} with arguments @{text "t\<^sub>1 t\<^sub>2 \<dots> t\<^sub>n"} and right
    61   hand side @{text t}).
    62 
    63   \begin{itemize}
    64 
    65     \item Starting point of code generation is a collection of (raw)
    66       code equations in a theory. It is not relevant where they stem
    67       from, but typically they were either produced by specification
    68       tools or proved explicitly by the user.
    69       
    70     \item These raw code equations can be subjected to theorem
    71       transformations.  This \qn{preprocessor} (see
    72       \secref{sec:preproc}) can apply the full expressiveness of
    73       ML-based theorem transformations to code generation.  The result
    74       of preprocessing is a structured collection of code equations.
    75 
    76     \item These code equations are \qn{translated} to a program in an
    77       abstract intermediate language.  Think of it as a kind of
    78       \qt{Mini-Haskell} with four \qn{statements}: @{text data} (for
    79       datatypes), @{text fun} (stemming from code equations), also
    80       @{text class} and @{text inst} (for type classes).
    81 
    82     \item Finally, the abstract program is \qn{serialised} into
    83       concrete source code of a target language.  This step only
    84       produces concrete syntax but does not change the program in
    85       essence; all conceptual transformations occur in the translation
    86       step.
    87 
    88   \end{itemize}
    89 
    90   \noindent From these steps, only the last two are carried out
    91   outside the logic; by keeping this layer as thin as possible, the
    92   amount of code to trust is kept to a minimum.
    93 \<close>
    94 
    95 
    96 subsection \<open>The pre- and postprocessor \label{sec:preproc}\<close>
    97 
    98 text \<open>
    99   Before selected function theorems are turned into abstract code, a
   100   chain of definitional transformation steps is carried out:
   101   \emph{preprocessing}.  The preprocessor consists of two
   102   components: a \emph{simpset} and \emph{function transformers}.
   103 
   104   The preprocessor simpset has a disparate brother, the
   105   \emph{postprocessor simpset}.
   106   In the theory-to-code scenario depicted in the picture 
   107   above, it plays no role.  But if generated code is used
   108   to evaluate expressions (cf.~\secref{sec:evaluation}), the
   109   postprocessor simpset is applied to the resulting expression before this
   110   is turned back.
   111 
   112   The pre- and postprocessor \emph{simpsets} can apply the
   113   full generality of the Isabelle
   114   simplifier.  Due to the interpretation of theorems as code
   115   equations, rewrites are applied to the right hand side and the
   116   arguments of the left hand side of an equation, but never to the
   117   constant heading the left hand side.
   118 
   119   Pre- and postprocessor can be setup to transfer between
   120   expressions suitable for logical reasoning and expressions 
   121   suitable for execution.  As example, take list membership; logically
   122   it is expressed as @{term "x \<in> set xs"}.  But for execution
   123   the intermediate set is not desirable.  Hence the following
   124   specification:
   125 \<close>
   126 
   127 definition %quote member :: "'a list \<Rightarrow> 'a \<Rightarrow> bool"
   128 where
   129   [code_abbrev]: "member xs x \<longleftrightarrow> x \<in> set xs"
   130 
   131 text \<open>
   132   \noindent The \emph{@{attribute code_abbrev} attribute} declares
   133   its theorem a rewrite rule for the postprocessor and the symmetric
   134   of its theorem as rewrite rule for the preprocessor.  Together,
   135   this has the effect that expressions @{thm (rhs) member_def [no_vars]}
   136   are replaced by @{thm (lhs) member_def [no_vars]} in generated code, but
   137   are turned back into @{thm (rhs) member_def [no_vars]} if generated
   138   code is used for evaluation.
   139 
   140   Rewrite rules for pre- or postprocessor may be
   141   declared independently using \emph{@{attribute code_unfold}}
   142   or \emph{@{attribute code_post}} respectively.
   143 
   144   \emph{Function transformers} provide a very general
   145   interface, transforming a list of function theorems to another list
   146   of function theorems, provided that neither the heading constant nor
   147   its type change.  The @{term "0::nat"} / @{const Suc} pattern
   148   used in theory @{text Code_Abstract_Nat} (see \secref{abstract_nat})
   149   uses this interface.
   150 
   151   \noindent The current setup of the pre- and postprocessor may be inspected
   152   using the @{command_def print_codeproc} command.  @{command_def
   153   code_thms} (see \secref{sec:equations}) provides a convenient
   154   mechanism to inspect the impact of a preprocessor setup on code
   155   equations.
   156 \<close>
   157 
   158 
   159 subsection \<open>Understanding code equations \label{sec:equations}\<close>
   160 
   161 text \<open>
   162   As told in \secref{sec:principle}, the notion of code equations is
   163   vital to code generation.  Indeed most problems which occur in
   164   practice can be resolved by an inspection of the underlying code
   165   equations.
   166 
   167   It is possible to exchange the default code equations for constants
   168   by explicitly proving alternative ones:
   169 \<close>
   170 
   171 lemma %quote [code]:
   172   "dequeue (AQueue xs []) =
   173      (if xs = [] then (None, AQueue [] [])
   174        else dequeue (AQueue [] (rev xs)))"
   175   "dequeue (AQueue xs (y # ys)) =
   176      (Some y, AQueue xs ys)"
   177   by (cases xs, simp_all) (cases "rev xs", simp_all)
   178 
   179 text \<open>
   180   \noindent The annotation @{text "[code]"} is an @{text attribute}
   181   which states that the given theorems should be considered as code
   182   equations for a @{text fun} statement -- the corresponding constant
   183   is determined syntactically.  The resulting code:
   184 \<close>
   185 
   186 text %quotetypewriter \<open>
   187   @{code_stmts dequeue (consts) dequeue (Haskell)}
   188 \<close>
   189 
   190 text \<open>
   191   \noindent You may note that the equality test @{term "xs = []"} has
   192   been replaced by the predicate @{term "List.null xs"}.  This is due
   193   to the default setup of the \qn{preprocessor}.
   194 
   195   This possibility to select arbitrary code equations is the key
   196   technique for program and datatype refinement (see
   197   \secref{sec:refinement}).
   198 
   199   Due to the preprocessor, there is the distinction of raw code
   200   equations (before preprocessing) and code equations (after
   201   preprocessing).
   202 
   203   The first can be listed (among other data) using the @{command_def
   204   print_codesetup} command.
   205 
   206   The code equations after preprocessing are already are blueprint of
   207   the generated program and can be inspected using the @{command
   208   code_thms} command:
   209 \<close>
   210 
   211 code_thms %quote dequeue
   212 
   213 text \<open>
   214   \noindent This prints a table with the code equations for @{const
   215   dequeue}, including \emph{all} code equations those equations depend
   216   on recursively.  These dependencies themselves can be visualized using
   217   the @{command_def code_deps} command.
   218 \<close>
   219 
   220 
   221 subsection \<open>Equality\<close>
   222 
   223 text \<open>
   224   Implementation of equality deserves some attention.  Here an example
   225   function involving polymorphic equality:
   226 \<close>
   227 
   228 primrec %quote collect_duplicates :: "'a list \<Rightarrow> 'a list \<Rightarrow> 'a list \<Rightarrow> 'a list" where
   229   "collect_duplicates xs ys [] = xs"
   230 | "collect_duplicates xs ys (z#zs) = (if z \<in> set xs
   231     then if z \<in> set ys
   232       then collect_duplicates xs ys zs
   233       else collect_duplicates xs (z#ys) zs
   234     else collect_duplicates (z#xs) (z#ys) zs)"
   235 
   236 text \<open>
   237   \noindent During preprocessing, the membership test is rewritten,
   238   resulting in @{const List.member}, which itself performs an explicit
   239   equality check, as can be seen in the corresponding @{text SML} code:
   240 \<close>
   241 
   242 text %quotetypewriter \<open>
   243   @{code_stmts collect_duplicates (SML)}
   244 \<close>
   245 
   246 text \<open>
   247   \noindent Obviously, polymorphic equality is implemented the Haskell
   248   way using a type class.  How is this achieved?  HOL introduces an
   249   explicit class @{class equal} with a corresponding operation @{const
   250   HOL.equal} such that @{thm equal [no_vars]}.  The preprocessing
   251   framework does the rest by propagating the @{class equal} constraints
   252   through all dependent code equations.  For datatypes, instances of
   253   @{class equal} are implicitly derived when possible.  For other types,
   254   you may instantiate @{text equal} manually like any other type class.
   255 \<close>
   256 
   257 
   258 subsection \<open>Explicit partiality \label{sec:partiality}\<close>
   259 
   260 text \<open>
   261   Partiality usually enters the game by partial patterns, as
   262   in the following example, again for amortised queues:
   263 \<close>
   264 
   265 definition %quote strict_dequeue :: "'a queue \<Rightarrow> 'a \<times> 'a queue" where
   266   "strict_dequeue q = (case dequeue q
   267     of (Some x, q') \<Rightarrow> (x, q'))"
   268 
   269 lemma %quote strict_dequeue_AQueue [code]:
   270   "strict_dequeue (AQueue xs (y # ys)) = (y, AQueue xs ys)"
   271   "strict_dequeue (AQueue xs []) =
   272     (case rev xs of y # ys \<Rightarrow> (y, AQueue [] ys))"
   273   by (simp_all add: strict_dequeue_def) (cases xs, simp_all split: list.split)
   274 
   275 text \<open>
   276   \noindent In the corresponding code, there is no equation
   277   for the pattern @{term "AQueue [] []"}:
   278 \<close>
   279 
   280 text %quotetypewriter \<open>
   281   @{code_stmts strict_dequeue (consts) strict_dequeue (Haskell)}
   282 \<close>
   283 
   284 text \<open>
   285   \noindent In some cases it is desirable to have this
   286   pseudo-\qt{partiality} more explicitly, e.g.~as follows:
   287 \<close>
   288 
   289 axiomatization %quote empty_queue :: 'a
   290 
   291 definition %quote strict_dequeue' :: "'a queue \<Rightarrow> 'a \<times> 'a queue" where
   292   "strict_dequeue' q = (case dequeue q of (Some x, q') \<Rightarrow> (x, q') | _ \<Rightarrow> empty_queue)"
   293 
   294 lemma %quote strict_dequeue'_AQueue [code]:
   295   "strict_dequeue' (AQueue xs []) = (if xs = [] then empty_queue
   296      else strict_dequeue' (AQueue [] (rev xs)))"
   297   "strict_dequeue' (AQueue xs (y # ys)) =
   298      (y, AQueue xs ys)"
   299   by (simp_all add: strict_dequeue'_def split: list.splits)
   300 
   301 text \<open>
   302   Observe that on the right hand side of the definition of @{const
   303   "strict_dequeue'"}, the unspecified constant @{const empty_queue} occurs.
   304 
   305   Normally, if constants without any code equations occur in a
   306   program, the code generator complains (since in most cases this is
   307   indeed an error).  But such constants can also be thought
   308   of as function definitions which always fail,
   309   since there is never a successful pattern match on the left hand
   310   side.  In order to categorise a constant into that category
   311   explicitly, use the @{attribute code} attribute with
   312   @{text abort}:
   313 \<close>
   314 
   315 declare %quote [[code abort: empty_queue]]
   316 
   317 text \<open>
   318   \noindent Then the code generator will just insert an error or
   319   exception at the appropriate position:
   320 \<close>
   321 
   322 text %quotetypewriter \<open>
   323   @{code_stmts strict_dequeue' (consts) empty_queue strict_dequeue' (Haskell)}
   324 \<close>
   325 
   326 text \<open>
   327   \noindent This feature however is rarely needed in practice.  Note
   328   also that the HOL default setup already declares @{const undefined},
   329   which is most likely to be used in such situations, as
   330   @{text "code abort"}.
   331 \<close>
   332 
   333 
   334 subsection \<open>If something goes utterly wrong \label{sec:utterly_wrong}\<close>
   335 
   336 text \<open>
   337   Under certain circumstances, the code generator fails to produce
   338   code entirely.  To debug these, the following hints may prove
   339   helpful:
   340 
   341   \begin{description}
   342 
   343     \ditem{\emph{Check with a different target language}.}  Sometimes
   344       the situation gets more clear if you switch to another target
   345       language; the code generated there might give some hints what
   346       prevents the code generator to produce code for the desired
   347       language.
   348 
   349     \ditem{\emph{Inspect code equations}.}  Code equations are the central
   350       carrier of code generation.  Most problems occurring while generating
   351       code can be traced to single equations which are printed as part of
   352       the error message.  A closer inspection of those may offer the key
   353       for solving issues (cf.~\secref{sec:equations}).
   354 
   355     \ditem{\emph{Inspect preprocessor setup}.}  The preprocessor might
   356       transform code equations unexpectedly; to understand an
   357       inspection of its setup is necessary (cf.~\secref{sec:preproc}).
   358 
   359     \ditem{\emph{Generate exceptions}.}  If the code generator
   360       complains about missing code equations, in can be helpful to
   361       implement the offending constants as exceptions
   362       (cf.~\secref{sec:partiality}); this allows at least for a formal
   363       generation of code, whose inspection may then give clues what is
   364       wrong.
   365 
   366     \ditem{\emph{Remove offending code equations}.}  If code
   367       generation is prevented by just a single equation, this can be
   368       removed (cf.~\secref{sec:equations}) to allow formal code
   369       generation, whose result in turn can be used to trace the
   370       problem.  The most prominent case here are mismatches in type
   371       class signatures (\qt{wellsortedness error}).
   372 
   373   \end{description}
   374 \<close>
   375 
   376 end