doc-src/IsarImplementation/Thy/document/Prelim.tex
changeset 37533 d775bd70f571
parent 36611 b0c047d03208
child 39688 a6e703a1fb4f
--- a/doc-src/IsarImplementation/Thy/document/Prelim.tex	Thu Jun 24 16:27:40 2010 +0100
+++ b/doc-src/IsarImplementation/Thy/document/Prelim.tex	Thu Jun 24 21:57:18 2010 +0200
@@ -648,8 +648,10 @@
 
   \begin{enumerate}
 
-  \item a single ASCII character ``\isa{c}'' or raw byte in the
-  range of 128\dots 255, for example ``\verb,a,'',
+  \item a single ASCII character ``\isa{c}'', for example
+  ``\verb,a,'',
+
+  \item a codepoint according to UTF8 (non-ASCII byte sequence),
 
   \item a regular symbol ``\verb,\,\verb,<,\isa{ident}\verb,>,'',
   for example ``\verb,\,\verb,<alpha>,'',
@@ -673,19 +675,17 @@
   ``\verb,\,\verb,<alpha>,'' is classified as a letter, which means it
   may occur within regular Isabelle identifiers.
 
-  Since the character set underlying Isabelle symbols is 7-bit ASCII
-  and 8-bit characters are passed through transparently, Isabelle can
-  also process Unicode/UCS data in UTF-8 encoding.\footnote{When
-  counting precise source positions internally, bytes in the range of
-  128\dots 191 are ignored.  In UTF-8 encoding, this interval covers
-  the additional trailer bytes, so Isabelle happens to count Unicode
-  characters here, not bytes in memory.  In ISO-Latin encoding, the
-  ignored range merely includes some extra punctuation characters that
-  even have replacements within the standard collection of Isabelle
-  symbols; the accented letters range is counted properly.} Unicode
-  provides its own collection of mathematical symbols, but within the
-  core Isabelle/ML world there is no link to the standard collection
-  of Isabelle regular symbols.
+  The character set underlying Isabelle symbols is 7-bit ASCII, but
+  8-bit character sequences are passed-through unchanged.  Unicode/UCS
+  data in UTF-8 encoding is processed in a non-strict fashion, such
+  that well-formed code sequences are recognized
+  accordingly.\footnote{Note that ISO-Latin-1 differs from UTF-8 only
+  in some special punctuation characters that even have replacements
+  within the standard collection of Isabelle symbols.  Text consisting
+  of ASCII plus accented letters can be processed in either encoding.}
+  Unicode provides its own collection of mathematical symbols, but
+  within the core Isabelle/ML world there is no link to the standard
+  collection of Isabelle regular symbols.
 
   \medskip Output of Isabelle symbols depends on the print mode
   (\secref{print-mode}).  For example, the standard {\LaTeX} setup of
@@ -733,7 +733,7 @@
   \cite{isabelle-isar-ref}.
 
   \item \verb|Symbol.sym| is a concrete datatype that represents
-  the different kinds of symbols explicitly, with constructors \verb|Symbol.Char|, \verb|Symbol.Sym|, \verb|Symbol.Ctrl|, \verb|Symbol.Raw|.
+  the different kinds of symbols explicitly, with constructors \verb|Symbol.Char|, \verb|Symbol.Sym|, \verb|Symbol.UTF8|, \verb|Symbol.Ctrl|, \verb|Symbol.Raw|.
 
   \item \verb|Symbol.decode| converts the string representation of a
   symbol into the datatype version.