src/Tools/WWW_Find/doc/design.tex
author blanchet
Mon, 26 Nov 2012 13:35:05 +0100
changeset 50222 40e3c3be6bca
parent 36862 952b2b102a0a
child 51085 d90218288d51
permissions -rw-r--r--
added file headers

% vim:nojs: tw=76 sw=4 sts=4 fo=awn fdm=marker
%
% 20090406 T. Bourke
%	Original document.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[a4paper,draft]{article} % XXX
%\documentclass[a4paper,final]{article}
% Preamble
\usepackage[T1]{fontenc}
\usepackage{textcomp}
\usepackage{ifdraft}

\bibliographystyle{abbrv} % alpha

\newcommand{\refsec}[1]{Section~\ref{sec:#1}}
\newcommand{\reffig}[1]{Figure~\ref{fig:#1}}
\newcommand{\reftab}[1]{Table~\ref{tab:#1}}

%
\usepackage{acronym}
\usepackage{pst-all}
\usepackage{url}

\title{Isabelle find\_theorems on the web}
\author{T. Bourke\thanks{NICTA}}

\special{!/pdfmark where
 {pop} {userdict /pdfmark /cleartomark load put} ifelse
 [ /Author   (T. Bourke, NICTA)
   /Title    (IsabelleWWW: find_theorems
	      ($Date: 2008-10-03 16:09:02 +1000 (Fri, 03 Oct 2008) $))
   /Subject  (Web interface to find_theorems)
   /Keywords (isabelle, sml, http, www)
   /DOCINFO pdfmark}

\begin{document}
% Title page and abstract
\maketitle

\begin{abstract}
The Isabelle find\_theorems command processes queries against a theory 
database and returns a list of matching theorems.
It is usually given from the Proof General or command-line interface.
This document describes a web server implementation.
Two design alternatives are presented and an overview of an implementation 
of one is described.
\end{abstract}

\section{Introduction}\label{sec:intro}

This document describes the design and implementation of a web interface for 
the Isabelle find\_theorems command.
The design requirements and their consequences are detailed in \refsec{req}.
Two architectures were considered: \begin{enumerate}

\item \emph{one process}: integrate a web server into Isabelle.

\item \emph{two processes}: run Isabelle as a web server module.

\end{enumerate}
A brief evaluation of the one process architecture is presented in 
\refsec{oneproc}.
An implementation of the two process is presented in \refsec{twoproc}.

\section{Design requirements}\label{sec:req}

The main requirements are:\begin{enumerate}
\item The system will allow users to search for theorems from a web browser.
\item It will allow queries across disparate Isabelle theories.
\item It will, at a minimum, handle theories from the L4.verified project.
\item It will run on a secure network.
\item There will be at most ten simultaneous users.
\end{enumerate}

\noindent
Several \emph{a priori} choices are fixed:\begin{enumerate}
\item The search will run against an Isabelle heap.
\item A single heap will be built from the theories to be searched.
\item The system must be persistent, for two reasons: \begin{enumerate}
    \item Isabelle is slow to start against large heaps.
    \item Later enhancements may require tracking states at the server.
\end{enumerate}
\end{enumerate}

\section{Evaluation: Isabelle web server}\label{sec:oneproc}

The advantage of integrating a web server into Isabelle is that the 
find\_theorems service could be provided by a single process, which, in 
principle, would simplify administration.
Implementing even a simple \ac{HTTP} service from scratch is an unacceptable 
cost and fraught with potential problems and limitations.
It is preferable to adapt an existing system.

As Isabelle is written in \ac{SML}, only \ac{HTTP} services also written in 
\ac{SML} can realistically be considered.
In principle Isabelle compiles on both Poly/ML and \ac{SML/NJ}, but in 
practice the former is faster, more widely used, and better supported.
In particular, the L4.verified project does not build effectively on 
\ac{SML/NJ}.
This further limits the potential to adapt an existing system.

There are three \ac{SML} web server projects:\\
\centerline{\begin{tabular}{ll}
SMLServer &
    \url{http://www.smlserver.org}\\
FoxNet web server &
    \url{http://www.cs.cmu.edu/~fox/}\\
Swerve web server &
    \url{http://mlton.org/Swerve}
\end{tabular}}

Unfortunately, none of these projects is suitable.

The SMLServer is an Apache web server plugin.
It runs \ac{SML} programs that generate dynamic web pages.
SMLServer is based on the MLKit compiler.
It is unlikely that Isabelle and the l4.verified heaps could be compiled in 
MLKit, at least not without significant effort.

The FoxNet web server was developed as part of the Fox project at \ac{CMU}.
The source is not readily available.

The Swerve web server is part of an unpublished book on systems programming 
in \ac{SML}.
The source code is available and it is readily patched to run under the 
latest version of SML/NJ (110.67).
Adapting it to Poly/ML would require non-trivial effort because it is based 
on \ac{CML}, whose implementation on SML/NJ makes use of continuations 
(SMLofNJ.cont).

\section{Implementation: Isabelle web module}\label{sec:twoproc}

The description of the two process solution is divided into two subsections.
The first contains an overview of the architecture and web server specifics.
The second contains a summary of an \ac{SML} implementation of the web 
module in Isabelle.

\subsection{Architecture and web server}\label{sec:oneproc:arch}

\newcommand{\component}[1]{%
    \rput(0,0){\psframe(-.8,-.6)(.8,.6)}%
    \rput(0,0){\parbox{4.3em}{\centering{#1}}}}

\begin{figure}
\centering%
\begin{pspicture}(-4.8,0)(3.3,4)%\psgrid
    \newpsstyle{conn}{arrows=->}%
    %
    \rput(-2.2,3.3){\component{web server}}%
    \rput( 2.2,3.3){\component{web module}}%
    \rput(-2.2,0.7){\component{web client}}%
    \rput(-4.2,3.4){%
	\psellipse(0,-.2)(.4,.2)%
	\psframe[linestyle=none,fillstyle=solid,fillcolor=white]%
		(-.4,-.2)(.4,.1)%
	\psellipse(0,.1)(.4,.2)%
	\psline(-.38,-.2)(-.38,.1)\psline(.38,-.2)(.38,.1)%
    }%
    \psline[style=conn,arrows=<->](-3.0,3.3)(-3.8,3.3)%
    %
    \rput[rB](3.3,2.15){server}%
    \psline[linestyle=dashed](-4.8,2)(3.3,2)%
    \rput[rt](3.3,1.90){network}%
    %
    \rput[B](0.0,3.55){\psframebox*{module protocol}}%
    \psline[style=conn](-1.4,3.4)(1.4,3.4)%
    \psline[style=conn](1.4,3.2)(-1.4,3.2)%
    %
    \rput[B]{90}(-2.4,2.0){\psframebox*{\ac{HTTP}}}%
    \psline[style=conn](-2.1,2.7)(-2.1,1.3)%
    \psline[style=conn](-2.3,1.3)(-2.3,2.7)%
\end{pspicture}
\caption{Overview of web module architecture\label{fig:modulearch}}
\end{figure}

An overview of a simple web server architecture is presented in 
\reffig{modulearch}.
A \emph{web client} requests a \ac{URL} from a \emph{web server} over 
\ac{HTTP}.
The web server processes the request by fetching static elements from its 
local file system and communicating with \emph{web modules} to dynamically 
generate other elements.
The elements are sent back across the network to the web client.

The web server communicates with web modules over a \emph{module protocol}, 
which dictates a means of passing requests and receiving responses.
There are several common module protocols.

In the \ac{CGI}, the web server executes processes to service dynamic 
\acp{URL}.
Request details are placed in environment variables and sent on the standard 
input channels of processes, responses are read from the standard output 
channels of processes and transmitted back to web clients.

The chief disadvantage of \ac{CGI} is that it creates a new process for 
every request.
Fast \ac{CGI} web modules, on the other hand, run persistently in a loop.
They receive and respond to web server requests over a duplex socket.
The Fast \ac{CGI} protocol is quite complicated.
There are, however, alternatives like \ac{SCGI} that are easier for web 
modules to support.
This is important when programming in \ac{SML} because both the number of 
developers and available libraries are small.

An \ac{SCGI} web module listens on a socket for requests from a web server.
Requests are sent as stream of bytes.
The first part of the request is a null-delimited sequence of name and value 
pairs.
The second part is unparsed text sent from the web client.
The web module responds by sending text, usually \ac{HTTP} headers followed 
by \ac{HTML} data, back over the socket.
The whole protocol can be described in two 
pages.\footnote{\url{http://python.ca/scgi/protocol.txt}}

Both the Apache and Lighttpd web servers support the \ac{SCGI} protocol.
Lighttpd was chosen because it seemed to be small, fast, and easy to 
configure.
Two settings are required to connect lighttpd to an \ac{SCGI} web module: 
\begin{verbatim}
server.modules = ( "mod_scgi" )
scgi.server = ("/isabelle" => ((
		  "host" => "127.0.0.1",
		  "port" => 64000,
		  "check-local" => "disable")))
\end{verbatim}
In this example, the \texttt{scgi.server} entry directs the web server to 
pass all \acp{URL} beginning with \texttt{/isabelle} to the web module 
listening on port \texttt{64000}.

\subsection{Implementation in \acs{SML}}\label{sec:oneproc:impl}

\begin{table}
\begin{tabular}{lp{.70\textwidth}}
\textbf{Module}	     &	\textbf{Description}\\\hline
Mime		     &	Rudimentary support for mime types.\\
HttpStatus	     &	\ac{HTTP} header status codes.\\
HttpUtil	     &	Produce \ac{HTTP} headers and parse query strings.\\
Xhtml		     &	Rudimentary support for generating \ac{HTML}.\\
SocketUtil	     &	Routines from The Standard ML Basis Library
			book.\footnote{Chapter 10, Gansner and Reppy, 
			Cambridge University Press.}\\
\textbf{ScgiReq}     &	Parse \ac{SCGI} requests.\\
\textbf{ScgiServer}  &	\ac{SCGI} server event loop and dispatch.\\
\textbf{FindTheorems}&	\ac{HTML} interface to find\_theorems.
\end{tabular}
\caption{Web module implementation\label{tab:moduleimpl}}
\end{table}

The find\_theorems web module is implemented in \ac{SML}.
It uses elements of the Pure library in Isabelle.
The program comprises the nine modules shown in \reftab{moduleimpl}.
The three most important ones are shown in bold: ScgiReq, ScgiServer, and 
FindTheorems.

ScgiReq implements the client-side of the \ac{SCGI} protocol.
It maps a binary input stream into a request data type.

ScgiServer is a generic, multi-threaded \ac{SCGI} request dispatcher.
It accepts \ac{SCGI} requests on a designated socket, selects an appropriate 
handler from an internal list, and calls the handler in a new thread.

FindTheorems registers a handler with ScgiServer.
It parses \ac{SCGI}/\ac{HTTP} requests, calls the Isabelle find\_theorems 
function, and returns the results as \ac{HTML}.
The \ac{HTML} generation of individual theorems is handled by the \ac{HTML} 
print mode of Isabelle, but the form fields and page structure were manually 
implemented.

The module is built by using a \texttt{ROOT.ML} file inside a heap that 
contains the theories to be searched.
The server is started by calling \texttt{ScgiServer.server}.
Scripts have been created to automate this process.

\subsection{Handling symbols}\label{sec:unicode}

Isabelle theorems are usually written in mathematical notation.
Internally, however, Isabelle only manipulates \acs{ASCII} strings.
Symbols are encoded by strings that begin with a backslash and contain a 
symbol name between angle braces, for example, the symbol~$\longrightarrow$ 
becomes~\verb+\<longrightarrow>+.
The existing Thy/Html module in the Isabelle Pure theory converts many of 
these symbols to \ac{HTML} entities.
Custom routines are required to convert the missing symbols to \ac{HTML} 
\emph{numeric character references}, which are the Unicode codepoints of 
symbols printed in decimal between \verb+&#+ and \verb+;+.
Further, other routines were required for converting \acs{UTF-8} encoded 
strings sent from web browsers into Isabelle's symbol encoding.

Isabelle is distributed with a text file that maps Isabelle symbols to 
Unicode codepoints.
A module was written to parse this file into symbol tables that map back and 
forth between Isabelle symbols and Unicode codepoints, and also between 
Isabelle \acs{ASCII} abbreviations (like \verb+-->+ for $\longrightarrow$) 
and Unicode codepoints.

The conversion from Isabelle symbols to \ac{HTML} numeric character 
references is handled by a new printing mode, which is based in large part 
on the existing \ac{HTML} printing mode.
The new mode is used in combination with the existing \texttt{xsymbol} mode, 
to ensure that Isabelle symbols are used instead of \acs{ASCII} 
abbreviations.

The conversion from \acs{UTF-8} is handled by a custom routine.
Additionally, there is a JavaScript routine that converts from Isabelle 
symbol encodings to \acs{UTF-8}, so that users can conveniently view 
manually-entered or pasted mathematical characters in the web browser 
interface.

\section{Abbreviations}\label{sec:abbr}

\begin{acronym}[SML/NJ] % longest acronym here
    \acro{ASCII}{American Standard Code for Information Interchange}
    \acro{CGI}{Common Gateway Interface}
    \acro{CML}{Concurrent ML}
    \acro{CMU}{Carnegie Mellon University}
    \acro{HTML}{Hyper Text Markup Language}
    \acro{HTTP}{Hyper Text Transfer Protocol}
    \acro{ML}{Meta Language}
    \acro{SCGI}{Simple CGI}
    \acro{SML}{Standard ML}
    \acro{SML/NJ}{Standard ML of New Jersey}
    \acro{URL}{Universal Resource Locator}
    \acro{UTF-8}{8-bit Unicode Transformation Format}
    \acro{WWW}{World Wide Web}
\end{acronym}

\end{document}