# HG changeset patch # User wenzelm # Date 1264970444 -3600 # Node ID 38a44d813a3c7136aed4622f8bc72034d929c7ec # Parent 520727474bbe52702488e5d6b7ed8a24fbd912bc more details on Isabelle symbols; diff -r 520727474bbe -r 38a44d813a3c doc-src/IsarImplementation/Thy/Prelim.thy --- a/doc-src/IsarImplementation/Thy/Prelim.thy Fri Jan 29 23:59:03 2010 +0100 +++ b/doc-src/IsarImplementation/Thy/Prelim.thy Sun Jan 31 21:40:44 2010 +0100 @@ -490,29 +490,27 @@ section {* Names \label{sec:names} *} -text {* - In principle, a name is just a string, but there are various - convention for encoding additional structure. For example, ``@{text - "Foo.bar.baz"}'' is considered as a qualified name consisting of - three basic name components. The individual constituents of a name - may have further substructure, e.g.\ the string - ``\verb,\,\verb,,'' encodes as a single symbol. +text {* In principle, a name is just a string, but there are various + conventions for representing additional structure. For example, + ``@{text "Foo.bar.baz"}'' is considered as a qualified name + consisting of three basic name components. The individual + constituents of a name may have further substructure, e.g.\ the + string ``\verb,\,\verb,,'' encodes as a single symbol. *} subsection {* Strings of symbols *} -text {* - A \emph{symbol} constitutes the smallest textual unit in Isabelle - --- raw characters are normally not encountered at all. Isabelle - strings consist of a sequence of symbols, represented as a packed - string or a list of strings. Each symbol is in itself a small - string, which has either one of the following forms: +text {* A \emph{symbol} constitutes the smallest textual unit in + Isabelle --- raw ML characters are normally not encountered at all! + Isabelle strings consist of a sequence of symbols, represented as a + packed string or an exploded list of strings. Each symbol is in + itself a small string, which has either one of the following forms: \begin{enumerate} - \item a single ASCII character ``@{text "c"}'', for example - ``\verb,a,'', + \item a single ASCII character ``@{text "c"}'' or raw byte in the + range of 128\dots 255, for example ``\verb,a,'', \item a regular symbol ``\verb,\,\verb,<,@{text "ident"}\verb,>,'', for example ``\verb,\,\verb,,'', @@ -521,7 +519,7 @@ for example ``\verb,\,\verb,<^bold>,'', \item a raw symbol ``\verb,\,\verb,<^raw:,@{text text}\verb,>,'' - where @{text text} constists of printable characters excluding + where @{text text} consists of printable characters excluding ``\verb,.,'' and ``\verb,>,'', for example ``\verb,\,\verb,<^raw:$\sum_{i = 1}^n$>,'', @@ -540,17 +538,26 @@ may occur within regular Isabelle identifiers. Since the character set underlying Isabelle symbols is 7-bit ASCII - and 8-bit characters are passed through transparently, Isabelle may - also process Unicode/UCS data in UTF-8 encoding. Unicode provides - its own collection of mathematical symbols, but there is no built-in - link to the standard collection of Isabelle. + and 8-bit characters are passed through transparently, Isabelle can + also process Unicode/UCS data in UTF-8 encoding.\footnote{When + counting precise source positions internally, bytes in the range of + 128\dots 191 are ignored. In UTF-8 encoding, this interval covers + the additional trailer bytes, so Isabelle happens to count Unicode + characters here, not bytes in memory. In ISO-Latin encoding, the + ignored range merely includes some extra punctuation characters that + even have replacements within the standard collection of Isabelle + symbols; the accented letters range is counted properly.} Unicode + provides its own collection of mathematical symbols, but within the + core Isabelle/ML world there is no link to the standard collection + of Isabelle regular symbols. \medskip Output of Isabelle symbols depends on the print mode (\secref{print-mode}). For example, the standard {\LaTeX} setup of the Isabelle document preparation system would present ``\verb,\,\verb,,'' as @{text "\"}, and ``\verb,\,\verb,<^bold>,\verb,\,\verb,,'' as @{text - "\<^bold>\"}. + "\<^bold>\"}. On-screen rendering usually works by mapping a finite + subset of Isabelle symbols to suitable Unicode characters. *} text %mlref {* @@ -575,7 +582,10 @@ \item @{ML "Symbol.explode"}~@{text "str"} produces a symbol list from the packed form. This function supercedes @{ML "String.explode"} for virtually all purposes of manipulating text in - Isabelle! + Isabelle!\footnote{The runtime overhead for exploded strings is + mainly that of the list structure: individual symbols that happen to + be a singleton string --- which is the most common case --- do not + require extra memory in Poly/ML.} \item @{ML "Symbol.is_letter"}, @{ML "Symbol.is_digit"}, @{ML "Symbol.is_quasi"}, @{ML "Symbol.is_blank"} classify standard @@ -591,6 +601,16 @@ symbol into the datatype version. \end{description} + + \paragraph{Historical note.} In the original SML90 standard the + primitive ML type @{ML_type char} did not exists, and the basic @{ML + "explode: string -> string list"} operation would produce a list of + singleton strings as in Isabelle/ML today. When SML97 came out, + Isabelle ignored its slightly anachronistic 8-bit characters, but + the idea of exploding a string into a list of small strings was + extended to ``symbols'' as explained above. Thus Isabelle sources + can refer to an infinite store of user-defined symbols, without + having to worry about the multitude of Unicode encodings. *}