| LMLML |
@author YAMATODANI Kiyoshi
@version $Id: Overview.html,v 1.3 2007/04/19 04:10:32 kiyoshiy Exp $
'LMLML' is Library of MultiLingualization for ML, which aims to support writing multi-linugalized program in ML. The current version supports multi-byte string processing only.
String manipulation modules of existing ML compilers and ML Basis library assume, in fact, a codec which encodes a character in a byte. They do not expect codecs which encode a character in multi-bytes. Therefore, it is hard for ML programmer to write applications which have to handle texts encoded in various codecs. SML# project developed LMLML to support development of such multi-byte string applications.
With LMLML, you can select used codec dynamically for each string. And, you can manipulate strings encoded in heterogeneous codecs as instances of the same type. Therefore, with LMLML, you can isolate program codes which depends on spcific codecs from program codes of codec-independent.
You can select encoding method dynamically with MultiByteString structure.
decode functions of MultiByteString take a string which specify the encoding method to use.
signature MULTI_BYTE_STRING =
sig
structure Char :
sig
type char
val decodeBytesSlice
: String.string -> Word8VectorSlice.slice -> char option
val decodeBytes : String.string -> Word8Vector.vector -> char option
val decodeString : String.string -> String.string -> char option
:
end
structure String :
sig
type string
val decodeBytesSlice : String.string -> Word8VectorSlice.slice -> string
val decodeBytes : String.string -> Word8Vector.vector -> string
val decodeString : String.string -> String.string -> string
:
end
end
structure MultiByteString : MULTI_BYTE_STRING
For example, you can decode a byteVector : Word8Vector.vector in UTF-16 encoding as follows.
# val s1 = MultiByteString.String.decodeBytes "UTF-16" byteVector; val s1 = - : MultiByteString.String.string
Available encoding methods are listed by MultiByteString.getCodecNames.
# MultiByteString.getCodecNames();
val it =
[
"UTF-16",
"UTF-16LE",
"UTF-16BE",
"UTF-8",
"SHIFT_JIS",
"MS_KANJI",
"CSSHIFTJIS",
"ISO-2022-JP",
"CSISO2022JP",
"GB2312",
"CSGB2312",
"GBK",
"CP936",
"MS936",
"WINDOWS-936",
"EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE",
"CSEUCPKDFMTJAPANESE",
"EUC-JP",
"ANSI_X3.4-1968",
"ISO-IR-6",
"ANSI_X3.4-1986",
"ISO_646.IRV:1991",
"ASCII",
"ISO646-US",
"US-ASCII",
"US",
"IBM367",
"CP367",
"CSASCII"
]
: string list
As described below, new encoding method can be added.
If the encoding method to use is statically fixed, it is more efficient to use a structure that implements the encoding method.
# val s2 = UTF16Codec.String.fromBytes byteVector; val s2 = - : UTF16Codec.String.string
However, strings obtained in these ways are not compatible each other.
# MultiByteString.String.size s2;
stdIn:5.1-5.30 Error:
operator and operand don't agree
operator domain: MultiByteString.String.string
operand: UTF16Codec.String.string
MultiByteString and encoding-specific modules, such as UTF16Codec, provide interfaces almost compatible with Char and String of Basis.
It is easy to upgrade existing ML program to support multi-byte string codecs with minor changes only.
signature MB_CHAR =
sig
type char
val compare : char * char -> order
val isAscii : char -> bool
:
end
signature MB_STRING =
sig
type string
type char
val sub : string * int -> char
val explode : string -> char list
:
end
signature MULTI_BYTE_STRING =
sig
structure Char : MB_CHAR
structure String : MB_STRING
sharing type Char.string = String.string
sharing type Char.char = String.char
end
structure MultiByteString : MULTI_BYTE_STRING
signature CODEC =
sig
structure Char : MB_CHAR
structure String : MB_STRING
sharing type Char.string = String.string
sharing type Char.char = String.char
end
structure UTF16Codec : CODEC
And, LMLML provides functors to generate multibyte-string version of Substring, StringCvt and ParserComb.
functor SubstringBase functor StringConverterBase functor ParserCombinatorBase
Note: For the current version, functors are not loaded in the prelude. You should load "LMLML/extension.sml" as follows to use these functors.
# use "LMLML/extension.sml";
For example, you can obtain a multibyte-string version of Substring for UTF-16 encoding as follows.
local
structure MBS = UTF16Codec.String
structure MBC = UTF16Codec.Char
structure P =
struct
type char = MBS.char
type string = MBS.string
val sub = MBS.sub
val substring = MBS.substring
val size = MBS.size
val concat = MBS.concat
val compare = MBS.compare
val compareChar = MBC.compare
end
in
structure UTF16Substring : MB_SUBSTRING = SubstringBase(P)
end
LMLML supports major codecs, such as ShiftJIS and UTF-16 already. You can extend LMLML by adding a new module that supports an encoding method you need without changing LMLML.
First, you have to define a structure that implements PRIM_CODEC signature.
For example, to support an encoding "foo", define a structure as follows.
structure FooCodecPrim : PRIM_CODEC =
struct
val names = ["foo"]
:
end
Then, apply Codec functor to it.
structure FooCodec = Codec(FooCodecPrim);
This code registers foo codec to MultiByteString.
You can decode in foo codec as follows.
val mbs1 = MultiByteString.String.decodeBytes "foo" byteVector;
Of course, you can use FooCodec directly.
val mbs2 = FooCodec.String.fromBytes byteVector;
Major features of LMLML are implemented without depending on SML# features. LMLML can be used with any compiler that conform to the Definition of Standard ML, including of cource SML# but probably also SML/NJ, MLton, and many others.
LMLML is installed with SML# system. And, its core modules are loaded in prelude.
In current version of SML#, Codec functor is not loaded in prelude for an implementation reason. To use Codec functor to extend LMLML with new codec, you have to load "LMLML/extension.sml" as follows.
# use "LMLML/extension.sml";
Use sources.cm with SML/NJ CM.
Use sources.mlb with MLton Basis system.
As an example programming with LMLML, we try to search a character '剣' in a string "白血病abc剣道".
In Shift_JIS, "白血病abc剣道" is encded into the following byte vector:
0wx94, 0wx92, 0wx8C, 0wx8C, 0wx95, 0wx61, (* 白血病 *) 0wx61, 0wx62, 0wx63, (* abc *) 0wx8C, 0wx95, 0wx93, 0wxB9 (* 剣道 *)
A pair of the second byte of '血' and the first byte of '血' is ( 0wx8C, 0wx95 ), which eqauls to '剣'. Therefore, if we search '剣'(0wx8C, 0wx95) in this byte vector, we find incorrectly the byte sequence spanning the second character and the third character of "白血病".
With LMLML, we can obtain the correct result.
At first, we decode a Shift_JIS string from the byte vector.
(* "白血病abc剣道" *)
val bytes =
Word8Vector.fromList
[
0wx94, 0wx92, 0wx8C, 0wx8C, 0wx95, 0wx61, (* 白血病 *)
0wx61, 0wx62, 0wx63, (* abc *)
0wx8C, 0wx95, 0wx93, 0wxB9 (* 剣道 *)
];
val string = MultiByteString.String.decodeBytes "Shift_JIS" bytes;
(* "剣" *)
val KenBytes = Word8Vector.fromList [0wx8C, 0wx95]; (* 剣 *)
val KenString = MultiByteString.String.decodeBytes "Shift_JIS" KenBytes;
Then, search "剣" in "白血病abc剣道" .
val (leftSS, rightSS) =
MBSubstring.position KenString (MBSubstring.full substring);
The obtained leftSS is first 6 characters of "白血病abc", which indicates that the 7th charcter "剣" is found correctly.
# MBSubstring.size leftSS; val it = 6 : int
In this example, codec is fixed at Shift_JIS, you can write by using Shift_JIS specific module as follows.
val ShiftJISString = ShiftJISCodec.String.fromBytes bytes;
val ShiftJISKenString = ShiftJISCodec.String.fromBytes KenBytes;
structure ShiftJISSubstring =
SubstringBase
(struct
open ShiftJISCodec.String
val compareChar = ShiftJISCodec.Char.compare
end);
val (leftSS, rightSS) =
ShiftJISSubstring.position
ShiftJISKenString (ShiftJISSubstring.full ShiftJISString);
val len = ShiftJISSubstring.size leftSS;
| Inner Signature summary |
|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Inner Structure summary |
|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Inner Functor summary |
|---|
|
|
| LMLML: Library of MultiLingualization for ML |