Contents:
Text files with a clearly separated header are used by Thalassa CMS as a source for the following types of content:
Every file of this type serves as a source for exactly one page, or exactly one comment. In contrast with ini files, a headed file can have its own format and may specify its own character encoding. Macroprocessing is NOT performed on the content of these files, and sometimes this may be deadly important.
The idea behind these files is that they are intended to be created
programmatically, rather than written manually, despite that it is easy to
write them by hands as well. In particular, this documentation was
prepared as a set of headed text files, written in a text editor (what
editor? well, vim) and processed as a
page set. From the other hand,
comments are received with the Thalassa CGI program
(thalcgi.cgi
) and stored as headed text files, so this is
an obvious case when they are created by a program.
Furthermore, Thalassa CMS was initially written keeping in mind a task of saving a particular site, which was initially created in 2009 with Drupal (5.* version of it, which was declared end-of-life just a year later). After several years, the version of MySQL used for that site was declared unsupported as well, and the ancient version of Drupal refused to work with later versions of MySQL or MariaDB. Even the PHP interpreters of newer versions started to output warnings trying to run that old Drupal. After working for more than a decade in a virtual box containing older versions of all involved software, it became clear for the site owner that the only way to keep the site running is to break out of this software slavery, built by people who don't care about backward compatibility.
This is how Thalassa was born; the site from this story is http://stolyarov.info, it is in Russian, so unfortunately most of people reading this text will not be able to rate it for themselves, but as of the moment of migration to Thalassa, it had 200+ pages and 6000+ user comments, and to give up all the content would be a catastrophe. What was actually done is all pages and comments were fetched from the MySQL database and stored in files; all these blocks, menus and the like, together with page layout and color scheme, were all recreated manually.
This makes another example when a set of headed text files was created programmatically, this time including the source files for pages, not only for comments. And here's why Thalassa allows these files to be in a character encoding different from the one used for ini files (and for the generated HTML files, too): the author prefers koi8-r as THE cyrillic encoding, while Drupal since ancient times is only capable of utf8, and author preferred to store the source information in files exactly how it was fetched from the database (in order not to loose anything accidentally).
The real objective of these files is to work instead of another damn database table, holding both the content (that is, the text of a page or a comment, possibly not in the final form) and additional attributes such as title, date, tags and the like. And yes, they perform well. Actually, they work far better than the author expected.
The format is much like the well-known since rfc822 email message format: headers come first, and every header consists of a name, a colon char, a space and a value. Placing a line starting with whitespace right after a header line, one can continue the header line, thus making it longer. The body is separated from the header by an empty line (strictly speaking, a line either of a zero lenght, or consisting of whitespace chars only). The most notable difference with rfc822 is that header names are case-sensitive, and for Thalassa CMS they are usually written all-lowercase.
To get the idea, here's an example:
id: testpg title: Test page unixdate: 1681890000 encoding: us-ascii format: texbreaks This is an example of an HTML page created from a source stored as a headed text file. BTW, thanks to the format, this phrase will go into a separate paragraph.
The header fields id
, encoding
and
format
are used internally and are not accessible directly
with macros. Actually, the id
(both for a page and for a
comment) is accessible, but it must be set to the same value as determined
by the file name (or, for set items represented with a directory, by the
name of the directory). It is unspecified what happens in case they don't
match.
Actually, it is very possible nothing will happen at all. Please don't rely on that. In future versions, checks are likely to be added.
The encoding
and format
fields determine what
filters must be applied to the body and some of the header files
before the content is made available as macro values. Both can be omitted
or left blank. If encoding
is not specified explicitly, it is
assumed to be the same as for the ini files (and for the site), so no
recoding is needed. Character encodings are
discussed later.
If format
is not specified, it is assumed to be
“verbatim
”, so, again, no conversion is needed. It is
notably inconvenient to write raw HTML manually; see the
Formats section below for information on available
formats.
Some of the fields are recognized by Thalassa, but their values are
available via macros, directly or indirectly. All information from
unrecognized fields is extracted and is available directly by the names of
the fields (via the %[li: ]
macro for pages, via the
%[cmt: ]
macro for comments), so the user can add any
fields as needed and use their values for site generation.
Some of the recognized fields are not passed through filters at all (e.g.,
the unixdate:
field, as it is just a number), some are passed
through the same set of filters as the body (as determined by
format:
and encoding:
; actually, only the
descr:
fields for pages is converted this way in the present
version), and for the rest of the fields, including any unrecognized
fields, only the encoding conversion is applied as appropriate, but no
format conversion is done.
The exact list of recognized fields and their roles differ for pages and comments, so it is discussed along with page sets and comment sections, respectively.
Thalassa mostly acts in encoding-agnostic manner. In particular, it assumes ini files are written in the same encoding as the site pages should be generated, so it never performs any encoding transformations for all content that comes from ini files.
Unlike that, headed text files are intended to be supplied by other programs, and the author of Thalassa had at least one task at hands for which encoding conversion was desired. Hence Thalassa supports encoding conversions for headed text files' content. To activate this feature, the following conditions must be met:
encoding
parameter of the
[format]
configuration section;encoding:
header field;If any item in this list is not met, Thalassa will silently decide to do no encoding converion.
Currently, only the following encodings are supported:
utf8
,
ascii
,
koi8-r
,
cp1251
.
Thalassa recognizes the following synonims for the supported encodings:
utf-8
,
us-ascii
,
koi8r
,
koi8
,
1251
,
win1251
,
win-1251
,
windows-1251
.
Names are case-insensitive.
It is relatively easy to add more encodings; this involves making tables to
convert from that encoding to unicode and back, and adding them to the
source code of the stfilter
library used by Thalassa. If you
need support for an encoding not listed above, please contact the author.
When a character encountered in the source being converted is not
representable in the target encoding, Thalassa replaces it with the
appropriate HTML4 entity, such as «
or
—
, if there's such entity for the particular
character in HTML4 (not HTML5). If no appropriate entity
exists, the character is replaced with HTML hexadecimal unicode entity,
like Ы
.
Thalassa is designed to pass the body of a headed text file through filters
that somehow adjust the markup, that is, convert the content from one
format to another. Format of the body (and possibly some of the header
fields) is indicated by the format:
header field; the value of
this field is expected to be a comma-separated list of tokens designating
centain aspects of the format, or, strictly speaking, of the desired
conversion.
The default format is verbatim
, which means the body is
already in HTML and needs no conversion at all. No conversion will be
performed as well in case the format:
header is missing,
empty, or contains no recognized tokens; it is, however, strongly
recommended to specify the verbatim
format explicitly as the
defaults may change in the future.
The recognized format tokens are verbatim
,
breaks
, texbreaks
, tags
and
web
.
The web
token was initially intended to indicate that URLs and
email addresses encountered in the text should be automatically turned into
clickable web links. As of present, this conversion is not
implemented, and the token is silently ignored by Thalassa.
The tags
token means that HTML tags must be stripped off the
text being converted, with exception for the tags explicitly allowed for
user-supplied content. The list of allowed tags is set by
tags
parameter in the
[format]
configuration section.
Both breaks
and texbreaks
turn on conversion of
newline characters into paragraph breaks, but in different manner. The
filter enabled by texbreaks
works in TeX/LaTeX style: it takes
empty lines (or, strictly speaking, lines that contain nothing but possibly
whitespace) as paragraph breaks, and wraps continuous text fragments (those
not containing empty lines) with <p>
and
</p>
. This mode is suitable if you use a text editor
that wraps long lines, such as vim. In particular, in all source files of
this documentation the texbreaks
token is used.
The breaks
token does essentially the same, but in addition it
replaces every lonely newline char (that is, a newline char which is not
followed by an empty line) with <br />
. This mode is
inconvenient for files prepared manually in a text editor, but is useful
for texts received through web-forms, such as user comments.
In both modes the filter is aware of some HTML/XHTML basics (but in a
very basic level): it knows that some tags are not
suitable inside paragraphs, and that the paragraph conversion shouldn't be
done within some tags. However, the filter is not a real (X)HTML parser
and has no idea about all these DTD schemes. It simply uses a list of
“special” tags: pre
, ul
, ol
,
table
, p
, blockquote
, h1
,
h2
,
h3
, h4
, h5
, h6
. When
any of these tags starts, the filter closes the current paragraph, if it is
open, and doesn't start any new paragraphs until the tag closes. The list
is hard-coded. It is very possible this will change in future versions, as
all this is, well, just a hack which solves certain problems arised here
and now.
Anyway, tags that are “special” anyhow in respect to paragraphs, and are not on the list above, shouldn't perhaps be allowed for user-supplied content. As of content created by the site's owner or administrator, it can always be checked for results of conversion.