Thalassa CMS logo

Thalassa CMS

Scriptpp library: string manipulations

Contents:

  1. Overview
  2. ScriptVariable: the string representation
  3. ScriptVector class and string tokenization
  4. Macroprocessor

Overview

The scriptpp (Script Plus Plus) is a relatively small C++ class library written by Andrey V. Stolyarov, initially intended to replace the well-known (and generally unusable) std::string class; as years passed, the library became much more than just another string class implementation. The official web home page for scriptpp is http://www.croco.net/software/scriptpp/.

Despite its name, the library generally has nothing to do with scripting. The name was based on a (wrong) assumption that script programming, as a paradigm, consists of two basic things: using text strings as the representation for all kinds of data, and extensively relying on external programs and text streams. The library was expected to turn C++ into a kind of scripting language — that is, not provide a scripting language to be used from within C++ programs, but, in contrast, to enable people to use C++ itself for scripting. Well, as the author of this concept, I admit it was all wrong; in reality, scripting is not about strings nor external programs, scripting is about small and primitive programs (scripts) that control bigger and more complicated programs, either glueing them (like Bourne Shell and various other shells do), or running inside them, like, by the way, Tcl was initially supposed to be used. In both cases, a scripting language is about interpreted execution, so C++, by its very nature, can not play this role. However, I, the author of scriptpp, realized all this stuff like 15 years after the first release of the library, and decided not to rename anything.

Thalassa CMS uses the library for:

Unluckily enough, there's actually almost no documentation for the library, only some doxygen-style commentary inside its header files can explain something about it. The page you're now reading is in no way a replacement for such documentation (which is still to be written), it is only here to provide some glue on what's the hell is going on and how to deal with it.

ScriptVariable: the string representation

To get the ScriptVariable class available, one must #include <scriptpp/scrvar.hpp> header file, which may be considered the main header of the library, as all the other features provided by the library actually depend on this class anyway.

The ScriptVariable class is the replacement for that std::string, so basically it has similar functionality and even some compatibility features, such as the c_str() method, which returns the respective const char * value. A converting constructor from the const char * type is also available, as well as an assignment operator accepting const char * as the argument; the default constructor creates an object holding empty string.

The current length of the string can be obtained with the Length() method, which is named accordingly to the style generally used in the library, but there's a compatibility alias length() as well.

Just like these ``standard'' strings, the strings represented by ScriptVariable objects can be concatenaded with operator+. The index operator (operator[]) is also available, allowing to access chars of the string in an obvious way. The operator+= is available for chars and C strings (that is, const char *), as well as for other ScriptVariable objects.

Objects of the class have strictly the size of a single pointer, which allows to pass them by value, return them from functions, assign them here and there, with no harm to efficiency. Unlike the ``modern'' versions of std::string, the ScriptVariable class implements the copy-on-write strategy and will always do. Yes, this is principially incompatible with multithreading, and that's good: be good and never ever use multithreading in your programs, because only bad and evil people do so.

Another feature worth mentioning is that a ScriptVariable class object is capable to be invalid. It can be constructed in the invalid state by passing (char*)0 value to the converting constructor, and there's a special (tiny) derived class named ScriptVariableInv which is created in the invalid state by its default constructor. An object can also be invalidated by calling the Invalidate() method, and can be made valid again by assigning a valid string to it. An object can be queried for its validity by methods IsValid() and IsInvalid().

There's a slave class ScriptVariable::Substring, which represents a substring of a given string. The object is created in several cases, the most important of them is an explicit call to the Range(indexlength) method. The range selected this way can then be erased or replaced with another string, e.g.:

      str.Range(0, 5).Erase();
      str.Range(7, -1).Replace("foobar");

(well, passing -1 as the length means “all the rest”). These Erase() and Replace(string) are methods of the ScriptVariable::Substring, which is the return type for the Range method.

It is useful to remember that the substring's method Get() makes a copy of the selected substring and returns it as a new ScriptVariable object.

Another important feature of the ScriptVariable class is to convert a string representing a number to the represented numeric value. Here are the four methods for this:

    bool GetLong(long &l, int radix = 0) const;
    bool GetLongLong(long long &l, int radix = 0) const;
    bool GetDouble(double &d) const;
    bool GetRational(long &p, long &q) const;

All methods return true in case the conversion is successful, otherwise they return false and leave the arguments untouched.

Please be warned that the zero value for radix, which is the default, means to convert like in the C language, that is, "0x25" will get converted as hexadecimal (so the decimal value will be 37), and "025" will be considered octal (decimal 21). In most cases the thing you really want is achieved specifying 10 as the radix value for these GetLong and GetLongLong.

The opposite conversion — from numbers to strings — is done by the ScriptNumber class, which is a direct descendent of ScriptVariable. It has converting constructors for all existing base numeric types.

ScriptVector class and string tokenization

The ScriptVector class generally represents a (resizeable) vector of ScriptVariable objects. However, its main indended use is not to store these objects, but rather to break a given string down to words or tokens, which is performed by its constructors. A vector of strings is perhaps the most natural representation for a result of such operations, hence the class.

To get the ScriptVector class available, one must #include <scriptpp/scrvect.hpp> header file.

It is important to understand the difference between words and tokens, in terms used in the scriptpp library. Generally, words in a string can be separated by any non-zero amount of whitespace chars (or any other chars choosen to be used as separators), and hence words can't be empty, while tokens are expected to be separated by exactly one delimiter between them, and in case there are more than one delimiter, it is assumed there are empty tokens between the delimiters. For example, if we for some reason decide to use the “#” char as the only separator/delimiter, then the string "#abra###cadabra#" will break down to two words "abra" and "cadabra", but to six tokens: "", "abra", "", "", "cadabra", "".

Tokens (in the given sense), by their nature, can be optionally trimmed off the leading and trailing whitespace (or whatever chars we decide to trim off as if they were whitespace). Words, by contrast, should not need any trimming, we just tell the tokenizer to consider all unwanted chars as being separators. This allows to use a (kinda non-intuitive) rule: if we specify the string, the set of delimiter chars and the set of whitespace chars to be trimmed off (even empty), then we are requesting tokens, but if we only specify the string and the separator (whitespace) chars, then we are asking for words. For example:

      ScriptVariable s("#abra###cadabra#");
      ScriptVector v1(s, "#");
      ScriptVector v2(s, "#", " \t\r\n");
      ScriptVector v3(s, "#", "");

Once constructed this way, v1 will contain the two words, and both v2 and v3 will (each) hold the six tokens.

The default constructor creates an empty vector (that is, a vector of zero length); there's operator[], which gives access to the elements of the vector (and the vector authomatically resize in case the index is out of range). Some other important methods are Length(), which returns the current amount of elements in the vector, and AddItem(string), which adds another string to the vector's end (well... like push_back in STL).

For more methods, take a look at the header file.

Macroprocessor

The macroprocessor is implemented by the ScriptMacroprocessor class, #include <scriptpp/scrmacro.hpp> to access it. In the header file there's a relatively long commentary describing the class.

From the user's point of view the macroprocessor is explained on the Macroprocessor introduction page. Be sure to read it before going further.

The class allows to redefine the escape char as well as all the special chars used by the macroprocessor, but this possibility is not used in Thalassa; we use the defaults for all the five chars %[]{}.

Being constructed by default, the macroprocessor class contains no macros; the macros are expected to be added by the user (the module or program which uses the macroprocessor). There's an abstract class named ScriptMacroprocessorMacro intended to represent the notion of a generic macro with a name; all macros are implemented by its subclasses.

The macroexpansion is supposed to be done by the Expand methods; there are two of them, one accepts a ScriptVector containing arguments, and the other accepts no arguments. The version that accepts the vector of parameters is pure virtual in the base class, while the no-args version has a default implementation which creates an empty ScriptVector object and calls the other version of the method. This version can be overriden, too, to gain efficiency for macros that don't accept arguments.

The library provides several convenient predefined subclasses of the ScriptMacroprocessorMacro class. We'll describe some of them.

The (abstract) ScriptMacroVariable implements the generic notion of a macro accepting no arguments. It introduces the pure virtual method Text() which is supposed to return the string to which the macro is to be expanded.

Two heavily used classes are derived from ScriptMacroVariable: the ScriptMacroConst class implements a macro which always expand to the same string, passed to the object as its constructor argument, and the ScriptMacroScrVar class represents a macro that expands to the current value of a ScriptVariable object, which itself exists somewhere else and whose address is given to the ScriptMacroScrVar class as its constructor argument.

A slightly more complicated thing is implemented by the ScriptMacroDictionary class. The macro implemented with this class accepts exactly one argument, which is a dictionary key; the call expands to the value associated with the key. The dictionary is represented by a ScriptVector of even length, the 0th, 2nd, 4th... items being the keys, and the 1st, 3rd, 5th... representing values to expand to. The object can either make a copy of your vector, or (if you wish so) assume the vector remains existing somewhere else, and there's no need for a copy.

For anything not covered with these classes, one has to derive his/her own class from the abstract ScriptMacroprocessorMacro class, overriding the Expand method as appropriate.

All these objects representing macros are to be passed to the macroprocessor object using the AddMacro method. Please note the class assumes ownership over all these objects, so they must be created in the dynamic memory (with operator new); once passed to the macroprocessor, they start to belong to it in the sense the macroprocessor actually deletes all macro objects from its destructor.

TODO: explain positionals

TODO: describe the Process methods

© Andrey V. Stolyarov, 2023, 2024