src/Tools/WWW_Find/doc/design.tex
changeset 33817 f6a4da31f2f1
child 36862 952b2b102a0a
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/src/Tools/WWW_Find/doc/design.tex	Fri Nov 20 18:36:44 2009 +1100
@@ -0,0 +1,330 @@
+% $Id$
+%
+% vim:nojs: tw=76 sw=4 sts=4 fo=awn fdm=marker
+%
+% 20090406 T. Bourke
+%	Original document.
+%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\documentclass[a4paper,draft]{article} % XXX
+%\documentclass[a4paper,final]{article}
+% Preamble
+\usepackage[T1]{fontenc}
+\usepackage{textcomp}
+\usepackage{ifdraft}
+
+\bibliographystyle{abbrv} % alpha
+
+\newcommand{\refsec}[1]{Section~\ref{sec:#1}}
+\newcommand{\reffig}[1]{Figure~\ref{fig:#1}}
+\newcommand{\reftab}[1]{Table~\ref{tab:#1}}
+
+%
+\usepackage{acronym}
+\usepackage{pst-all}
+\usepackage{url}
+
+\title{Isabelle find\_theorems on the web}
+\author{T. Bourke\thanks{NICTA}}
+
+\special{!/pdfmark where
+ {pop} {userdict /pdfmark /cleartomark load put} ifelse
+ [ /Author   (T. Bourke, NICTA)
+   /Title    (IsabelleWWW: find_theorems
+	      ($Date: 2008-10-03 16:09:02 +1000 (Fri, 03 Oct 2008) $))
+   /Subject  (Web interface to find_theorems)
+   /Keywords (isabelle, sml, http, www)
+   /DOCINFO pdfmark}
+
+\begin{document}
+% Title page and abstract
+\maketitle
+
+\begin{abstract}
+The Isabelle find\_theorems command processes queries against a theory 
+database and returns a list of matching theorems.
+It is usually given from the Proof General or command-line interface.
+This document describes a web server implementation.
+Two design alternatives are presented and an overview of an implementation 
+of one is described.
+\end{abstract}
+
+\section{Introduction}\label{sec:intro}
+
+This document describes the design and implementation of a web interface for 
+the Isabelle find\_theorems command.
+The design requirements and their consequences are detailed in \refsec{req}.
+Two architectures were considered: \begin{enumerate}
+
+\item \emph{one process}: integrate a web server into Isabelle.
+
+\item \emph{two processes}: run Isabelle as a web server module.
+
+\end{enumerate}
+A brief evaluation of the one process architecture is presented in 
+\refsec{oneproc}.
+An implementation of the two process is presented in \refsec{twoproc}.
+
+\section{Design requirements}\label{sec:req}
+
+The main requirements are:\begin{enumerate}
+\item The system will allow users to search for theorems from a web browser.
+\item It will allow queries across disparate Isabelle theories.
+\item It will, at a minimum, handle theories from the L4.verified project.
+\item It will run on a secure network.
+\item There will be at most ten simultaneous users.
+\end{enumerate}
+
+\noindent
+Several \emph{a priori} choices are fixed:\begin{enumerate}
+\item The search will run against an Isabelle heap.
+\item A single heap will be built from the theories to be searched.
+\item The system must be persistent, for two reasons: \begin{enumerate}
+    \item Isabelle is slow to start against large heaps.
+    \item Later enhancements may require tracking states at the server.
+\end{enumerate}
+\end{enumerate}
+
+\section{Evaluation: Isabelle web server}\label{sec:oneproc}
+
+The advantage of integrating a web server into Isabelle is that the 
+find\_theorems service could be provided by a single process, which, in 
+principle, would simplify administration.
+Implementing even a simple \ac{HTTP} service from scratch is an unacceptable 
+cost and fraught with potential problems and limitations.
+It is preferable to adapt an existing system.
+
+As Isabelle is written in \ac{SML}, only \ac{HTTP} services also written in 
+\ac{SML} can realistically be considered.
+In principle Isabelle compiles on both Poly/ML and \ac{SML/NJ}, but in 
+practice the former is faster, more widely used, and better supported.
+In particular, the L4.verified project does not build effectively on 
+\ac{SML/NJ}.
+This further limits the potential to adapt an existing system.
+
+There are three \ac{SML} web server projects:\\
+\centerline{\begin{tabular}{ll}
+SMLServer &
+    \url{http://www.smlserver.org}\\
+FoxNet web server &
+    \url{http://www.cs.cmu.edu/~fox/}\\
+Swerve web server &
+    \url{http://mlton.org/Swerve}
+\end{tabular}}
+
+Unfortunately, none of these projects is suitable.
+
+The SMLServer is an Apache web server plugin.
+It runs \ac{SML} programs that generate dynamic web pages.
+SMLServer is based on the MLKit compiler.
+It is unlikely that Isabelle and the l4.verified heaps could be compiled in 
+MLKit, at least not without significant effort.
+
+The FoxNet web server was developed as part of the Fox project at \ac{CMU}.
+The source is not readily available.
+
+The Swerve web server is part of an unpublished book on systems programming 
+in \ac{SML}.
+The source code is available and it is readily patched to run under the 
+latest version of SML/NJ (110.67).
+Adapting it to Poly/ML would require non-trivial effort because it is based 
+on \ac{CML}, whose implementation on SML/NJ makes use of continuations 
+(SMLofNJ.cont).
+
+\section{Implementation: Isabelle web module}\label{sec:twoproc}
+
+The description of the two process solution is divided into two subsections.
+The first contains an overview of the architecture and web server specifics.
+The second contains a summary of an \ac{SML} implementation of the web 
+module in Isabelle.
+
+\subsection{Architecture and web server}\label{sec:oneproc:arch}
+
+\newcommand{\component}[1]{%
+    \rput(0,0){\psframe(-.8,-.6)(.8,.6)}%
+    \rput(0,0){\parbox{4.3em}{\centering{#1}}}}
+
+\begin{figure}
+\centering%
+\begin{pspicture}(-4.8,0)(3.3,4)%\psgrid
+    \newpsstyle{conn}{arrows=->}%
+    %
+    \rput(-2.2,3.3){\component{web server}}%
+    \rput( 2.2,3.3){\component{web module}}%
+    \rput(-2.2,0.7){\component{web client}}%
+    \rput(-4.2,3.4){%
+	\psellipse(0,-.2)(.4,.2)%
+	\psframe[linestyle=none,fillstyle=solid,fillcolor=white]%
+		(-.4,-.2)(.4,.1)%
+	\psellipse(0,.1)(.4,.2)%
+	\psline(-.38,-.2)(-.38,.1)\psline(.38,-.2)(.38,.1)%
+    }%
+    \psline[style=conn,arrows=<->](-3.0,3.3)(-3.8,3.3)%
+    %
+    \rput[rB](3.3,2.15){server}%
+    \psline[linestyle=dashed](-4.8,2)(3.3,2)%
+    \rput[rt](3.3,1.90){network}%
+    %
+    \rput[B](0.0,3.55){\psframebox*{module protocol}}%
+    \psline[style=conn](-1.4,3.4)(1.4,3.4)%
+    \psline[style=conn](1.4,3.2)(-1.4,3.2)%
+    %
+    \rput[B]{90}(-2.4,2.0){\psframebox*{\ac{HTTP}}}%
+    \psline[style=conn](-2.1,2.7)(-2.1,1.3)%
+    \psline[style=conn](-2.3,1.3)(-2.3,2.7)%
+\end{pspicture}
+\caption{Overview of web module architecture\label{fig:modulearch}}
+\end{figure}
+
+An overview of a simple web server architecture is presented in 
+\reffig{modulearch}.
+A \emph{web client} requests a \ac{URL} from a \emph{web server} over 
+\ac{HTTP}.
+The web server processes the request by fetching static elements from its 
+local file system and communicating with \emph{web modules} to dynamically 
+generate other elements.
+The elements are sent back across the network to the web client.
+
+The web server communicates with web modules over a \emph{module protocol}, 
+which dictates a means of passing requests and receiving responses.
+There are several common module protocols.
+
+In the \ac{CGI}, the web server executes processes to service dynamic 
+\acp{URL}.
+Request details are placed in environment variables and sent on the standard 
+input channels of processes, responses are read from the standard output 
+channels of processes and transmitted back to web clients.
+
+The chief disadvantage of \ac{CGI} is that it creates a new process for 
+every request.
+Fast \ac{CGI} web modules, on the other hand, run persistently in a loop.
+They receive and respond to web server requests over a duplex socket.
+The Fast \ac{CGI} protocol is quite complicated.
+There are, however, alternatives like \ac{SCGI} that are easier for web 
+modules to support.
+This is important when programming in \ac{SML} because both the number of 
+developers and available libraries are small.
+
+An \ac{SCGI} web module listens on a socket for requests from a web server.
+Requests are sent as stream of bytes.
+The first part of the request is a null-delimited sequence of name and value 
+pairs.
+The second part is unparsed text sent from the web client.
+The web module responds by sending text, usually \ac{HTTP} headers followed 
+by \ac{HTML} data, back over the socket.
+The whole protocol can be described in two 
+pages.\footnote{\url{http://python.ca/scgi/protocol.txt}}
+
+Both the Apache and Lighttpd web servers support the \ac{SCGI} protocol.
+Lighttpd was chosen because it seemed to be small, fast, and easy to 
+configure.
+Two settings are required to connect lighttpd to an \ac{SCGI} web module: 
+\begin{verbatim}
+server.modules = ( "mod_scgi" )
+scgi.server = ("/isabelle" => ((
+		  "host" => "127.0.0.1",
+		  "port" => 64000,
+		  "check-local" => "disable")))
+\end{verbatim}
+In this example, the \texttt{scgi.server} entry directs the web server to 
+pass all \acp{URL} beginning with \texttt{/isabelle} to the web module 
+listening on port \texttt{64000}.
+
+\subsection{Implementation in \acs{SML}}\label{sec:oneproc:impl}
+
+\begin{table}
+\begin{tabular}{lp{.70\textwidth}}
+\textbf{Module}	     &	\textbf{Description}\\\hline
+Mime		     &	Rudimentary support for mime types.\\
+HttpStatus	     &	\ac{HTTP} header status codes.\\
+HttpUtil	     &	Produce \ac{HTTP} headers and parse query strings.\\
+Xhtml		     &	Rudimentary support for generating \ac{HTML}.\\
+SocketUtil	     &	Routines from The Standard ML Basis Library
+			book.\footnote{Chapter 10, Gansner and Reppy, 
+			Cambridge University Press.}\\
+\textbf{ScgiReq}     &	Parse \ac{SCGI} requests.\\
+\textbf{ScgiServer}  &	\ac{SCGI} server event loop and dispatch.\\
+\textbf{FindTheorems}&	\ac{HTML} interface to find\_theorems.
+\end{tabular}
+\caption{Web module implementation\label{tab:moduleimpl}}
+\end{table}
+
+The find\_theorems web module is implemented in \ac{SML}.
+It uses elements of the Pure library in Isabelle.
+The program comprises the nine modules shown in \reftab{moduleimpl}.
+The three most important ones are shown in bold: ScgiReq, ScgiServer, and 
+FindTheorems.
+
+ScgiReq implements the client-side of the \ac{SCGI} protocol.
+It maps a binary input stream into a request data type.
+
+ScgiServer is a generic, multi-threaded \ac{SCGI} request dispatcher.
+It accepts \ac{SCGI} requests on a designated socket, selects an appropriate 
+handler from an internal list, and calls the handler in a new thread.
+
+FindTheorems registers a handler with ScgiServer.
+It parses \ac{SCGI}/\ac{HTTP} requests, calls the Isabelle find\_theorems 
+function, and returns the results as \ac{HTML}.
+The \ac{HTML} generation of individual theorems is handled by the \ac{HTML} 
+print mode of Isabelle, but the form fields and page structure were manually 
+implemented.
+
+The module is built by using a \texttt{ROOT.ML} file inside a heap that 
+contains the theories to be searched.
+The server is started by calling \texttt{ScgiServer.server}.
+Scripts have been created to automate this process.
+
+\subsection{Handling symbols}\label{sec:unicode}
+
+Isabelle theorems are usually written in mathematical notation.
+Internally, however, Isabelle only manipulates \acs{ASCII} strings.
+Symbols are encoded by strings that begin with a backslash and contain a 
+symbol name between angle braces, for example, the symbol~$\longrightarrow$ 
+becomes~\verb+\<longrightarrow>+.
+The existing Thy/Html module in the Isabelle Pure theory converts many of 
+these symbols to \ac{HTML} entities.
+Custom routines are required to convert the missing symbols to \ac{HTML} 
+\emph{numeric character references}, which are the Unicode codepoints of 
+symbols printed in decimal between \verb+&#+ and \verb+;+.
+Further, other routines were required for converting \acs{UTF-8} encoded 
+strings sent from web browsers into Isabelle's symbol encoding.
+
+Isabelle is distributed with a text file that maps Isabelle symbols to 
+Unicode codepoints.
+A module was written to parse this file into symbol tables that map back and 
+forth between Isabelle symbols and Unicode codepoints, and also between 
+Isabelle \acs{ASCII} abbreviations (like \verb+-->+ for $\longrightarrow$) 
+and Unicode codepoints.
+
+The conversion from Isabelle symbols to \ac{HTML} numeric character 
+references is handled by a new printing mode, which is based in large part 
+on the existing \ac{HTML} printing mode.
+The new mode is used in combination with the existing \texttt{xsymbol} mode, 
+to ensure that Isabelle symbols are used instead of \acs{ASCII} 
+abbreviations.
+
+The conversion from \acs{UTF-8} is handled by a custom routine.
+Additionally, there is a JavaScript routine that converts from Isabelle 
+symbol encodings to \acs{UTF-8}, so that users can conveniently view 
+manually-entered or pasted mathematical characters in the web browser 
+interface.
+
+\section{Abbreviations}\label{sec:abbr}
+
+\begin{acronym}[SML/NJ] % longest acronym here
+    \acro{ASCII}{American Standard Code for Information Interchange}
+    \acro{CGI}{Common Gateway Interface}
+    \acro{CML}{Concurrent ML}
+    \acro{CMU}{Carnegie Mellon University}
+    \acro{HTML}{Hyper Text Markup Language}
+    \acro{HTTP}{Hyper Text Transfer Protocol}
+    \acro{ML}{Meta Language}
+    \acro{SCGI}{Simple CGI}
+    \acro{SML}{Standard ML}
+    \acro{SML/NJ}{Standard ML of New Jersey}
+    \acro{URL}{Universal Resource Locator}
+    \acro{UTF-8}{8-bit Unicode Transformation Format}
+    \acro{WWW}{World Wide Web}
+\end{acronym}
+
+\end{document}