--- a/doc-src/IsarImplementation/Thy/Prelim.thy Fri Jan 29 23:59:03 2010 +0100
+++ b/doc-src/IsarImplementation/Thy/Prelim.thy Sun Jan 31 21:40:44 2010 +0100
@@ -490,29 +490,27 @@
section {* Names \label{sec:names} *}
-text {*
- In principle, a name is just a string, but there are various
- convention for encoding additional structure. For example, ``@{text
- "Foo.bar.baz"}'' is considered as a qualified name consisting of
- three basic name components. The individual constituents of a name
- may have further substructure, e.g.\ the string
- ``\verb,\,\verb,<alpha>,'' encodes as a single symbol.
+text {* In principle, a name is just a string, but there are various
+ conventions for representing additional structure. For example,
+ ``@{text "Foo.bar.baz"}'' is considered as a qualified name
+ consisting of three basic name components. The individual
+ constituents of a name may have further substructure, e.g.\ the
+ string ``\verb,\,\verb,<alpha>,'' encodes as a single symbol.
*}
subsection {* Strings of symbols *}
-text {*
- A \emph{symbol} constitutes the smallest textual unit in Isabelle
- --- raw characters are normally not encountered at all. Isabelle
- strings consist of a sequence of symbols, represented as a packed
- string or a list of strings. Each symbol is in itself a small
- string, which has either one of the following forms:
+text {* A \emph{symbol} constitutes the smallest textual unit in
+ Isabelle --- raw ML characters are normally not encountered at all!
+ Isabelle strings consist of a sequence of symbols, represented as a
+ packed string or an exploded list of strings. Each symbol is in
+ itself a small string, which has either one of the following forms:
\begin{enumerate}
- \item a single ASCII character ``@{text "c"}'', for example
- ``\verb,a,'',
+ \item a single ASCII character ``@{text "c"}'' or raw byte in the
+ range of 128\dots 255, for example ``\verb,a,'',
\item a regular symbol ``\verb,\,\verb,<,@{text "ident"}\verb,>,'',
for example ``\verb,\,\verb,<alpha>,'',
@@ -521,7 +519,7 @@
for example ``\verb,\,\verb,<^bold>,'',
\item a raw symbol ``\verb,\,\verb,<^raw:,@{text text}\verb,>,''
- where @{text text} constists of printable characters excluding
+ where @{text text} consists of printable characters excluding
``\verb,.,'' and ``\verb,>,'', for example
``\verb,\,\verb,<^raw:$\sum_{i = 1}^n$>,'',
@@ -540,17 +538,26 @@
may occur within regular Isabelle identifiers.
Since the character set underlying Isabelle symbols is 7-bit ASCII
- and 8-bit characters are passed through transparently, Isabelle may
- also process Unicode/UCS data in UTF-8 encoding. Unicode provides
- its own collection of mathematical symbols, but there is no built-in
- link to the standard collection of Isabelle.
+ and 8-bit characters are passed through transparently, Isabelle can
+ also process Unicode/UCS data in UTF-8 encoding.\footnote{When
+ counting precise source positions internally, bytes in the range of
+ 128\dots 191 are ignored. In UTF-8 encoding, this interval covers
+ the additional trailer bytes, so Isabelle happens to count Unicode
+ characters here, not bytes in memory. In ISO-Latin encoding, the
+ ignored range merely includes some extra punctuation characters that
+ even have replacements within the standard collection of Isabelle
+ symbols; the accented letters range is counted properly.} Unicode
+ provides its own collection of mathematical symbols, but within the
+ core Isabelle/ML world there is no link to the standard collection
+ of Isabelle regular symbols.
\medskip Output of Isabelle symbols depends on the print mode
(\secref{print-mode}). For example, the standard {\LaTeX} setup of
the Isabelle document preparation system would present
``\verb,\,\verb,<alpha>,'' as @{text "\<alpha>"}, and
``\verb,\,\verb,<^bold>,\verb,\,\verb,<alpha>,'' as @{text
- "\<^bold>\<alpha>"}.
+ "\<^bold>\<alpha>"}. On-screen rendering usually works by mapping a finite
+ subset of Isabelle symbols to suitable Unicode characters.
*}
text %mlref {*
@@ -575,7 +582,10 @@
\item @{ML "Symbol.explode"}~@{text "str"} produces a symbol list
from the packed form. This function supercedes @{ML
"String.explode"} for virtually all purposes of manipulating text in
- Isabelle!
+ Isabelle!\footnote{The runtime overhead for exploded strings is
+ mainly that of the list structure: individual symbols that happen to
+ be a singleton string --- which is the most common case --- do not
+ require extra memory in Poly/ML.}
\item @{ML "Symbol.is_letter"}, @{ML "Symbol.is_digit"}, @{ML
"Symbol.is_quasi"}, @{ML "Symbol.is_blank"} classify standard
@@ -591,6 +601,16 @@
symbol into the datatype version.
\end{description}
+
+ \paragraph{Historical note.} In the original SML90 standard the
+ primitive ML type @{ML_type char} did not exists, and the basic @{ML
+ "explode: string -> string list"} operation would produce a list of
+ singleton strings as in Isabelle/ML today. When SML97 came out,
+ Isabelle ignored its slightly anachronistic 8-bit characters, but
+ the idea of exploding a string into a list of small strings was
+ extended to ``symbols'' as explained above. Thus Isabelle sources
+ can refer to an infinite store of user-defined symbols, without
+ having to worry about the multitude of Unicode encodings.
*}