doc-src/IsarImplementation/Thy/Prelim.thy
changeset 34925 38a44d813a3c
parent 34924 520727474bbe
child 34926 19294b07e445
--- a/doc-src/IsarImplementation/Thy/Prelim.thy	Fri Jan 29 23:59:03 2010 +0100
+++ b/doc-src/IsarImplementation/Thy/Prelim.thy	Sun Jan 31 21:40:44 2010 +0100
@@ -490,29 +490,27 @@
 
 section {* Names \label{sec:names} *}
 
-text {*
-  In principle, a name is just a string, but there are various
-  convention for encoding additional structure.  For example, ``@{text
-  "Foo.bar.baz"}'' is considered as a qualified name consisting of
-  three basic name components.  The individual constituents of a name
-  may have further substructure, e.g.\ the string
-  ``\verb,\,\verb,<alpha>,'' encodes as a single symbol.
+text {* In principle, a name is just a string, but there are various
+  conventions for representing additional structure.  For example,
+  ``@{text "Foo.bar.baz"}'' is considered as a qualified name
+  consisting of three basic name components.  The individual
+  constituents of a name may have further substructure, e.g.\ the
+  string ``\verb,\,\verb,<alpha>,'' encodes as a single symbol.
 *}
 
 
 subsection {* Strings of symbols *}
 
-text {*
-  A \emph{symbol} constitutes the smallest textual unit in Isabelle
-  --- raw characters are normally not encountered at all.  Isabelle
-  strings consist of a sequence of symbols, represented as a packed
-  string or a list of strings.  Each symbol is in itself a small
-  string, which has either one of the following forms:
+text {* A \emph{symbol} constitutes the smallest textual unit in
+  Isabelle --- raw ML characters are normally not encountered at all!
+  Isabelle strings consist of a sequence of symbols, represented as a
+  packed string or an exploded list of strings.  Each symbol is in
+  itself a small string, which has either one of the following forms:
 
   \begin{enumerate}
 
-  \item a single ASCII character ``@{text "c"}'', for example
-  ``\verb,a,'',
+  \item a single ASCII character ``@{text "c"}'' or raw byte in the
+  range of 128\dots 255, for example ``\verb,a,'',
 
   \item a regular symbol ``\verb,\,\verb,<,@{text "ident"}\verb,>,'',
   for example ``\verb,\,\verb,<alpha>,'',
@@ -521,7 +519,7 @@
   for example ``\verb,\,\verb,<^bold>,'',
 
   \item a raw symbol ``\verb,\,\verb,<^raw:,@{text text}\verb,>,''
-  where @{text text} constists of printable characters excluding
+  where @{text text} consists of printable characters excluding
   ``\verb,.,'' and ``\verb,>,'', for example
   ``\verb,\,\verb,<^raw:$\sum_{i = 1}^n$>,'',
 
@@ -540,17 +538,26 @@
   may occur within regular Isabelle identifiers.
 
   Since the character set underlying Isabelle symbols is 7-bit ASCII
-  and 8-bit characters are passed through transparently, Isabelle may
-  also process Unicode/UCS data in UTF-8 encoding.  Unicode provides
-  its own collection of mathematical symbols, but there is no built-in
-  link to the standard collection of Isabelle.
+  and 8-bit characters are passed through transparently, Isabelle can
+  also process Unicode/UCS data in UTF-8 encoding.\footnote{When
+  counting precise source positions internally, bytes in the range of
+  128\dots 191 are ignored.  In UTF-8 encoding, this interval covers
+  the additional trailer bytes, so Isabelle happens to count Unicode
+  characters here, not bytes in memory.  In ISO-Latin encoding, the
+  ignored range merely includes some extra punctuation characters that
+  even have replacements within the standard collection of Isabelle
+  symbols; the accented letters range is counted properly.} Unicode
+  provides its own collection of mathematical symbols, but within the
+  core Isabelle/ML world there is no link to the standard collection
+  of Isabelle regular symbols.
 
   \medskip Output of Isabelle symbols depends on the print mode
   (\secref{print-mode}).  For example, the standard {\LaTeX} setup of
   the Isabelle document preparation system would present
   ``\verb,\,\verb,<alpha>,'' as @{text "\<alpha>"}, and
   ``\verb,\,\verb,<^bold>,\verb,\,\verb,<alpha>,'' as @{text
-  "\<^bold>\<alpha>"}.
+  "\<^bold>\<alpha>"}.  On-screen rendering usually works by mapping a finite
+  subset of Isabelle symbols to suitable Unicode characters.
 *}
 
 text %mlref {*
@@ -575,7 +582,10 @@
   \item @{ML "Symbol.explode"}~@{text "str"} produces a symbol list
   from the packed form.  This function supercedes @{ML
   "String.explode"} for virtually all purposes of manipulating text in
-  Isabelle!
+  Isabelle!\footnote{The runtime overhead for exploded strings is
+  mainly that of the list structure: individual symbols that happen to
+  be a singleton string --- which is the most common case --- do not
+  require extra memory in Poly/ML.}
 
   \item @{ML "Symbol.is_letter"}, @{ML "Symbol.is_digit"}, @{ML
   "Symbol.is_quasi"}, @{ML "Symbol.is_blank"} classify standard
@@ -591,6 +601,16 @@
   symbol into the datatype version.
 
   \end{description}
+
+  \paragraph{Historical note.} In the original SML90 standard the
+  primitive ML type @{ML_type char} did not exists, and the basic @{ML
+  "explode: string -> string list"} operation would produce a list of
+  singleton strings as in Isabelle/ML today.  When SML97 came out,
+  Isabelle ignored its slightly anachronistic 8-bit characters, but
+  the idea of exploding a string into a list of small strings was
+  extended to ``symbols'' as explained above.  Thus Isabelle sources
+  can refer to an infinite store of user-defined symbols, without
+  having to worry about the multitude of Unicode encodings.
 *}