Overview  Index  Help 

- Library of MultiLingualization for ML -

@author YAMATODANI Kiyoshi
@version $Id: Overview.html,v 1.3 2007/04/19 04:10:32 kiyoshiy Exp $

'LMLML' is Library of MultiLingualization for ML, which aims to support writing multi-linugalized program in ML. The current version supports multi-byte string processing only.

String manipulation modules of existing ML compilers and ML Basis library assume, in fact, a codec which encodes a character in a byte. They do not expect codecs which encode a character in multi-bytes. Therefore, it is hard for ML programmer to write applications which have to handle texts encoded in various codecs. SML# project developed LMLML to support development of such multi-byte string applications.


1, isolating codec-specific code.

With LMLML, you can select used codec dynamically for each string. And, you can manipulate strings encoded in heterogeneous codecs as instances of the same type. Therefore, with LMLML, you can isolate program codes which depends on spcific codecs from program codes of codec-independent.

You can select encoding method dynamically with MultiByteString structure.

decode functions of MultiByteString take a string which specify the encoding method to use.

  signature MULTI_BYTE_STRING =
    structure Char :
      type char
      val decodeBytesSlice
          : String.string -> Word8VectorSlice.slice -> char option
      val decodeBytes : String.string -> Word8Vector.vector -> char option
      val decodeString : String.string -> String.string -> char option
    structure String :
      type string
      val decodeBytesSlice : String.string -> Word8VectorSlice.slice -> string
      val decodeBytes : String.string -> Word8Vector.vector -> string
      val decodeString : String.string -> String.string -> string
  structure MultiByteString : MULTI_BYTE_STRING

For example, you can decode a byteVector : Word8Vector.vector in UTF-16 encoding as follows.

  # val s1 = MultiByteString.String.decodeBytes "UTF-16" byteVector;
  val s1 = - : MultiByteString.String.string

Available encoding methods are listed by MultiByteString.getCodecNames.

  # MultiByteString.getCodecNames();
  val it =
      : string list

As described below, new encoding method can be added.

If the encoding method to use is statically fixed, it is more efficient to use a structure that implements the encoding method.

  # val s2 = UTF16Codec.String.fromBytes byteVector;
  val s2 = - : UTF16Codec.String.string

However, strings obtained in these ways are not compatible each other.

  # MultiByteString.String.size s2;
  stdIn:5.1-5.30 Error:
    operator and operand don't agree
    operator domain: MultiByteString.String.string
    operand: UTF16Codec.String.string

2, almost compatibility with SML Basis

MultiByteString and encoding-specific modules, such as UTF16Codec, provide interfaces almost compatible with Char and String of Basis. It is easy to upgrade existing ML program to support multi-byte string codecs with minor changes only.

  signature MB_CHAR =
    type char

    val compare : char * char -> order
    val isAscii : char -> bool

  signature MB_STRING =
    type string
    type char

    val sub : string * int -> char
    val explode : string -> char list

  signature MULTI_BYTE_STRING =
    structure Char : MB_CHAR
    structure String : MB_STRING
    sharing type Char.string = String.string
    sharing type Char.char = String.char
  structure MultiByteString : MULTI_BYTE_STRING

  signature CODEC =
    structure Char : MB_CHAR
    structure String : MB_STRING
    sharing type Char.string = String.string
    sharing type Char.char = String.char
  structure UTF16Codec : CODEC

And, LMLML provides functors to generate multibyte-string version of Substring, StringCvt and ParserComb.

  functor SubstringBase
  functor StringConverterBase
  functor ParserCombinatorBase

Note: For the current version, functors are not loaded in the prelude. You should load "LMLML/extension.sml" as follows to use these functors.

# use "LMLML/extension.sml";

For example, you can obtain a multibyte-string version of Substring for UTF-16 encoding as follows.

    structure MBS = UTF16Codec.String
    structure MBC = UTF16Codec.Char
    structure P =
      type char = MBS.char
      type string = MBS.string
      val sub = MBS.sub
      val substring = MBS.substring
      val size = MBS.size
      val concat = MBS.concat
      val compare = MBS.compare
      val compareChar = MBC.compare
  structure UTF16Substring : MB_SUBSTRING = SubstringBase(P)

3, support of new encoding method

LMLML supports major codecs, such as ShiftJIS and UTF-16 already. You can extend LMLML by adding a new module that supports an encoding method you need without changing LMLML.

First, you have to define a structure that implements PRIM_CODEC signature. For example, to support an encoding "foo", define a structure as follows.

  structure FooCodecPrim : PRIM_CODEC =
    val names = ["foo"]

Then, apply Codec functor to it.

  structure FooCodec = Codec(FooCodecPrim);

This code registers foo codec to MultiByteString.

You can decode in foo codec as follows.

  val mbs1 = MultiByteString.String.decodeBytes "foo" byteVector;

Of course, you can use FooCodec directly.

  val mbs2 = FooCodec.String.fromBytes byteVector;

4, Standard ML compliance.

Major features of LMLML are implemented without depending on SML# features. LMLML can be used with any compiler that conform to the Definition of Standard ML, including of cource SML# but probably also SML/NJ, MLton, and many others.


LMLML is installed with SML# system. And, its core modules are loaded in prelude.

In current version of SML#, Codec functor is not loaded in prelude for an implementation reason. To use Codec functor to extend LMLML with new codec, you have to load "LMLML/extension.sml" as follows.

# use "LMLML/extension.sml";


Use sources.cm with SML/NJ CM.


Use sources.mlb with MLton Basis system.


As an example programming with LMLML, we try to search a character '剣' in a string "白血病abc剣道".

In Shift_JIS, "白血病abc剣道" is encded into the following byte vector:

 0wx94, 0wx92, 0wx8C, 0wx8C, 0wx95, 0wx61, (* 白血病 *)
 0wx61, 0wx62, 0wx63, (* abc *)
 0wx8C, 0wx95, 0wx93, 0wxB9 (* 剣道 *)

A pair of the second byte of '血' and the first byte of '血' is ( 0wx8C, 0wx95 ), which eqauls to '剣'. Therefore, if we search '剣'(0wx8C, 0wx95) in this byte vector, we find incorrectly the byte sequence spanning the second character and the third character of "白血病".

With LMLML, we can obtain the correct result.

At first, we decode a Shift_JIS string from the byte vector.

 (* "白血病abc剣道" *)
 val bytes =
       0wx94, 0wx92, 0wx8C, 0wx8C, 0wx95, 0wx61, (* 白血病 *)
       0wx61, 0wx62, 0wx63, (* abc *)
       0wx8C, 0wx95, 0wx93, 0wxB9 (* 剣道 *)
 val string = MultiByteString.String.decodeBytes "Shift_JIS" bytes;

 (* "剣" *)
 val KenBytes = Word8Vector.fromList [0wx8C, 0wx95]; (* 剣 *)
 val KenString = MultiByteString.String.decodeBytes "Shift_JIS" KenBytes;

Then, search "剣" in "白血病abc剣道" .

 val (leftSS, rightSS) =
     MBSubstring.position KenString (MBSubstring.full substring);

The obtained leftSS is first 6 characters of "白血病abc", which indicates that the 7th charcter "剣" is found correctly.

 # MBSubstring.size leftSS;
 val it = 6 : int

In this example, codec is fixed at Shift_JIS, you can write by using Shift_JIS specific module as follows.

 val ShiftJISString = ShiftJISCodec.String.fromBytes bytes;
 val ShiftJISKenString = ShiftJISCodec.String.fromBytes KenBytes;
 structure ShiftJISSubstring =
                  open ShiftJISCodec.String
                  val compareChar = ShiftJISCodec.Char.compare 
 val (leftSS, rightSS) =
         ShiftJISKenString (ShiftJISSubstring.full ShiftJISString);
 val len = ShiftJISSubstring.size leftSS;
Inner Signature summary

signature CODEC

signature CODECS
           This signature spacifies the interface of the Codecs structure.

           MULTI_BYTE_CHAR signature specifies manipulations on multibyte character encoded in a particular encoding.

           A multibyte string version of PARSER_COMB in SML/NJ library ((c) 1996 AT&T Research).

           MULTI_BYTE_STRING signature specifies manipulations on multibyte strings which encode sequences of multibyte characters in a particular encoding.


           SUBSTRING signature specifies manipulations on an abstract representation of a sequence of contiguous characters in a multibyte string.

           The main user interface of the multibyte text library.

Inner Structure summary

structure ASCIICodec
           fundamental functions to access ASCII encoded characters.

structure Codecs
           This module packages definitions shared by codecs.

structure EUCJPCodec
           fundamental functions to access EUCJP encoded characters.

structure GB2312Codec
           fundamental functions to access GB2312 encoded characters.

structure GBKCodec
           fundamental functions to access GBK encoded characters.

structure ISO2022JPCodec
           fundamental functions to access ISO-2022-JP encoded characters.

structure MultiByteText
           The main user interface of the multibyte text library.

structure ShiftJISCodec
           fundamental functions to access ShiftJIS encoded characters.

structure UTF16BECodec
           fundamental functions to access UTF-16BE encoded characters.

structure UTF16Codec
           fundamental functions to access UTF16 encoded characters.

structure UTF16LECodec
           fundamental functions to access UTF-16LE encoded characters.

structure UTF8Codec
           fundamental functions to access UTF-8 encoded characters.

Inner Functor summary

functor UTF16CodecPrimArgBase
           implementation of variations of UTF-16 codec: UTF-16LE, UTF-16BE, UTF-16.



Overview  Index  Help 
LMLML: Library of MultiLingualization for ML