theory Foundations imports Introduction begin section ‹Code generation foundations \label{sec:foundations}› subsection ‹Code generator architecture \label{sec:architecture}› text ‹ The code generator is actually a framework consisting of different components which can be customised individually. Conceptually all components operate on Isabelle's logic framework @{theory Pure}. Practically, the object logic @{theory HOL} provides the necessary facilities to make use of the code generator, mainly since it is an extension of @{theory Pure}. The constellation of the different components is visualized in the following picture. \begin{figure}[h] \def\sys#1{\emph{#1}} \begin{tikzpicture}[x = 4cm, y = 1cm] \tikzstyle positive=[color = black, fill = white]; \tikzstyle negative=[color = white, fill = black]; \tikzstyle entity=[rounded corners, draw, thick]; \tikzstyle process=[ellipse, draw, thick]; \tikzstyle arrow=[-stealth, semithick]; \node (spec) at (0, 3) [entity, positive] {specification tools}; \node (user) at (1, 3) [entity, positive] {user proofs}; \node (spec_user_join) at (0.5, 3) [shape=coordinate] {}; \node (raw) at (0.5, 4) [entity, positive] {raw code equations}; \node (pre) at (1.5, 4) [process, positive] {preprocessing}; \node (eqn) at (2.5, 4) [entity, positive] {code equations}; \node (iml) at (0.5, 0) [entity, positive] {intermediate program}; \node (seri) at (1.5, 0) [process, positive] {serialisation}; \node (SML) at (2.5, 3) [entity, positive] {\sys{SML}}; \node (OCaml) at (2.5, 2) [entity, positive] {\sys{OCaml}}; \node (Haskell) at (2.5, 1) [entity, positive] {\sys{Haskell}}; \node (Scala) at (2.5, 0) [entity, positive] {\sys{Scala}}; \draw [semithick] (spec) -- (spec_user_join); \draw [semithick] (user) -- (spec_user_join); \draw [-diamond, semithick] (spec_user_join) -- (raw); \draw [arrow] (raw) -- (pre); \draw [arrow] (pre) -- (eqn); \draw [arrow] (eqn) -- node (transl) [process, positive] {translation} (iml); \draw [arrow] (iml) -- (seri); \draw [arrow] (seri) -- (SML); \draw [arrow] (seri) -- (OCaml); \draw [arrow] (seri) -- (Haskell); \draw [arrow] (seri) -- (Scala); \end{tikzpicture} \caption{Code generator architecture} \label{fig:arch} \end{figure} \noindent Central to code generation is the notion of \emph{code equations}. A code equation as a first approximation is a theorem of the form @{text "f t⇩1 t⇩2 … t⇩n ≡ t"} (an equation headed by a constant @{text f} with arguments @{text "t⇩1 t⇩2 … t⇩n"} and right hand side @{text t}). \begin{itemize} \item Starting point of code generation is a collection of (raw) code equations in a theory. It is not relevant where they stem from, but typically they were either produced by specification tools or proved explicitly by the user. \item These raw code equations can be subjected to theorem transformations. This \qn{preprocessor} (see \secref{sec:preproc}) can apply the full expressiveness of ML-based theorem transformations to code generation. The result of preprocessing is a structured collection of code equations. \item These code equations are \qn{translated} to a program in an abstract intermediate language. Think of it as a kind of \qt{Mini-Haskell} with four \qn{statements}: @{text data} (for datatypes), @{text fun} (stemming from code equations), also @{text class} and @{text inst} (for type classes). \item Finally, the abstract program is \qn{serialised} into concrete source code of a target language. This step only produces concrete syntax but does not change the program in essence; all conceptual transformations occur in the translation step. \end{itemize} \noindent From these steps, only the last two are carried out outside the logic; by keeping this layer as thin as possible, the amount of code to trust is kept to a minimum. › subsection ‹The pre- and postprocessor \label{sec:preproc}› text ‹ Before selected function theorems are turned into abstract code, a chain of definitional transformation steps is carried out: \emph{preprocessing}. The preprocessor consists of two components: a \emph{simpset} and \emph{function transformers}. The preprocessor simpset has a disparate brother, the \emph{postprocessor simpset}. In the theory-to-code scenario depicted in the picture above, it plays no role. But if generated code is used to evaluate expressions (cf.~\secref{sec:evaluation}), the postprocessor simpset is applied to the resulting expression before this is turned back. The pre- and postprocessor \emph{simpsets} can apply the full generality of the Isabelle simplifier. Due to the interpretation of theorems as code equations, rewrites are applied to the right hand side and the arguments of the left hand side of an equation, but never to the constant heading the left hand side. Pre- and postprocessor can be setup to transfer between expressions suitable for logical reasoning and expressions suitable for execution. As example, take list membership; logically it is expressed as @{term "x ∈ set xs"}. But for execution the intermediate set is not desirable. Hence the following specification: › definition %quote member :: "'a list ⇒ 'a ⇒ bool" where [code_abbrev]: "member xs x ⟷ x ∈ set xs" text ‹ \noindent The \emph{@{attribute code_abbrev} attribute} declares its theorem a rewrite rule for the postprocessor and the symmetric of its theorem as rewrite rule for the preprocessor. Together, this has the effect that expressions @{thm (rhs) member_def [no_vars]} are replaced by @{thm (lhs) member_def [no_vars]} in generated code, but are turned back into @{thm (rhs) member_def [no_vars]} if generated code is used for evaluation. Rewrite rules for pre- or postprocessor may be declared independently using \emph{@{attribute code_unfold}} or \emph{@{attribute code_post}} respectively. \emph{Function transformers} provide a very general interface, transforming a list of function theorems to another list of function theorems, provided that neither the heading constant nor its type change. The @{term "0::nat"} / @{const Suc} pattern used in theory @{text Code_Abstract_Nat} (see \secref{abstract_nat}) uses this interface. \noindent The current setup of the pre- and postprocessor may be inspected using the @{command_def print_codeproc} command. @{command_def code_thms} (see \secref{sec:equations}) provides a convenient mechanism to inspect the impact of a preprocessor setup on code equations. › subsection ‹Understanding code equations \label{sec:equations}› text ‹ As told in \secref{sec:principle}, the notion of code equations is vital to code generation. Indeed most problems which occur in practice can be resolved by an inspection of the underlying code equations. It is possible to exchange the default code equations for constants by explicitly proving alternative ones: › lemma %quote [code]: "dequeue (AQueue xs []) = (if xs = [] then (None, AQueue [] []) else dequeue (AQueue [] (rev xs)))" "dequeue (AQueue xs (y # ys)) = (Some y, AQueue xs ys)" by (cases xs, simp_all) (cases "rev xs", simp_all) text ‹ \noindent The annotation @{text "[code]"} is an @{text attribute} which states that the given theorems should be considered as code equations for a @{text fun} statement -- the corresponding constant is determined syntactically. The resulting code: › text %quotetypewriter ‹ @{code_stmts dequeue (consts) dequeue (Haskell)} › text ‹ \noindent You may note that the equality test @{term "xs = []"} has been replaced by the predicate @{term "List.null xs"}. This is due to the default setup of the \qn{preprocessor}. This possibility to select arbitrary code equations is the key technique for program and datatype refinement (see \secref{sec:refinement}). Due to the preprocessor, there is the distinction of raw code equations (before preprocessing) and code equations (after preprocessing). The first can be listed (among other data) using the @{command_def print_codesetup} command. The code equations after preprocessing are already are blueprint of the generated program and can be inspected using the @{command code_thms} command: › code_thms %quote dequeue text ‹ \noindent This prints a table with the code equations for @{const dequeue}, including \emph{all} code equations those equations depend on recursively. These dependencies themselves can be visualized using the @{command_def code_deps} command. › subsection ‹Equality› text ‹ Implementation of equality deserves some attention. Here an example function involving polymorphic equality: › primrec %quote collect_duplicates :: "'a list ⇒ 'a list ⇒ 'a list ⇒ 'a list" where "collect_duplicates xs ys [] = xs" | "collect_duplicates xs ys (z#zs) = (if z ∈ set xs then if z ∈ set ys then collect_duplicates xs ys zs else collect_duplicates xs (z#ys) zs else collect_duplicates (z#xs) (z#ys) zs)" text ‹ \noindent During preprocessing, the membership test is rewritten, resulting in @{const List.member}, which itself performs an explicit equality check, as can be seen in the corresponding @{text SML} code: › text %quotetypewriter ‹ @{code_stmts collect_duplicates (SML)} › text ‹ \noindent Obviously, polymorphic equality is implemented the Haskell way using a type class. How is this achieved? HOL introduces an explicit class @{class equal} with a corresponding operation @{const HOL.equal} such that @{thm equal [no_vars]}. The preprocessing framework does the rest by propagating the @{class equal} constraints through all dependent code equations. For datatypes, instances of @{class equal} are implicitly derived when possible. For other types, you may instantiate @{text equal} manually like any other type class. › subsection ‹Explicit partiality \label{sec:partiality}› text ‹ Partiality usually enters the game by partial patterns, as in the following example, again for amortised queues: › definition %quote strict_dequeue :: "'a queue ⇒ 'a × 'a queue" where "strict_dequeue q = (case dequeue q of (Some x, q') ⇒ (x, q'))" lemma %quote strict_dequeue_AQueue [code]: "strict_dequeue (AQueue xs (y # ys)) = (y, AQueue xs ys)" "strict_dequeue (AQueue xs []) = (case rev xs of y # ys ⇒ (y, AQueue [] ys))" by (simp_all add: strict_dequeue_def) (cases xs, simp_all split: list.split) text ‹ \noindent In the corresponding code, there is no equation for the pattern @{term "AQueue [] []"}: › text %quotetypewriter ‹ @{code_stmts strict_dequeue (consts) strict_dequeue (Haskell)} › text ‹ \noindent In some cases it is desirable to have this pseudo-\qt{partiality} more explicitly, e.g.~as follows: › axiomatization %quote empty_queue :: 'a definition %quote strict_dequeue' :: "'a queue ⇒ 'a × 'a queue" where "strict_dequeue' q = (case dequeue q of (Some x, q') ⇒ (x, q') | _ ⇒ empty_queue)" lemma %quote strict_dequeue'_AQueue [code]: "strict_dequeue' (AQueue xs []) = (if xs = [] then empty_queue else strict_dequeue' (AQueue [] (rev xs)))" "strict_dequeue' (AQueue xs (y # ys)) = (y, AQueue xs ys)" by (simp_all add: strict_dequeue'_def split: list.splits) text ‹ Observe that on the right hand side of the definition of @{const "strict_dequeue'"}, the unspecified constant @{const empty_queue} occurs. Normally, if constants without any code equations occur in a program, the code generator complains (since in most cases this is indeed an error). But such constants can also be thought of as function definitions which always fail, since there is never a successful pattern match on the left hand side. In order to categorise a constant into that category explicitly, use the @{attribute code} attribute with @{text abort}: › declare %quote [[code abort: empty_queue]] text ‹ \noindent Then the code generator will just insert an error or exception at the appropriate position: › text %quotetypewriter ‹ @{code_stmts strict_dequeue' (consts) empty_queue strict_dequeue' (Haskell)} › text ‹ \noindent This feature however is rarely needed in practice. Note also that the HOL default setup already declares @{const undefined}, which is most likely to be used in such situations, as @{text "code abort"}. › subsection ‹If something goes utterly wrong \label{sec:utterly_wrong}› text ‹ Under certain circumstances, the code generator fails to produce code entirely. To debug these, the following hints may prove helpful: \begin{description} \ditem{\emph{Check with a different target language}.} Sometimes the situation gets more clear if you switch to another target language; the code generated there might give some hints what prevents the code generator to produce code for the desired language. \ditem{\emph{Inspect code equations}.} Code equations are the central carrier of code generation. Most problems occurring while generating code can be traced to single equations which are printed as part of the error message. A closer inspection of those may offer the key for solving issues (cf.~\secref{sec:equations}). \ditem{\emph{Inspect preprocessor setup}.} The preprocessor might transform code equations unexpectedly; to understand an inspection of its setup is necessary (cf.~\secref{sec:preproc}). \ditem{\emph{Generate exceptions}.} If the code generator complains about missing code equations, in can be helpful to implement the offending constants as exceptions (cf.~\secref{sec:partiality}); this allows at least for a formal generation of code, whose inspection may then give clues what is wrong. \ditem{\emph{Remove offending code equations}.} If code generation is prevented by just a single equation, this can be removed (cf.~\secref{sec:equations}) to allow formal code generation, whose result in turn can be used to trace the problem. The most prominent case here are mismatches in type class signatures (\qt{wellsortedness error}). \end{description} › end