Thalassa CMS logo

Thalassa CMS

Techniques, methods and other things banned at the project

Before joining the project, please note it is highly likely that some (if not all) tools you might be used to, are considered harmful here and hence are banned. Here is our local ban-list:


External dependencies, including interpreters and ecosystems

External dependencies can be build-time and run-time. So here's the rule:

The idea behind the rule is obvious: the less your program needs to build and run, the better. However, there's also another idea. If you, as the author of a program, by simply choosing what to use and what not to use, can solve in advance any problem which either a user of the program, or a package maintainer may experience, then you must do so for any program you're going to let anyone use.

First of all, forget about interpreted languages that have their “ecosystems”, such as Perl, Python, Ruby and so on. Interpreted execution is only acceptable for scripting and built-in DSL languages, but in both cases there must be no damn ecosystem, as the ecosystem will become run-time dependency even if the interpreter itself is built-in. Smaller languages such as Tcl are good for built-in interpreters. Often it even makes sense to implement your own language, like another small dialect of Lisp or whatever. So, languages like Perl, Python, Ruby and so on, are not acceptable as general-purpose languages because of their interpreted execution model (so there's obvious run-time dependency: the interpreter itself), and they are not acceptable as built-in languages because of their ecosystems. Actually they have no valid purpose at all, and learning such languages is simply waste of time.

All these semi-interpreted languages like Java or C# should not be considered for any task as well, because their “run-time environments”, such as JVM, dot-net, mono etc., are run-time external dependencies, too.

Besides that, you must always make sure your program can be built as a statically-linked binary. It is okay to keep dynamic linkage possible for cases when the program is transferred to the target machine as a source tarball and gets built there, but for all cases of binary distribution, including distribution as a package for a particular package manager (such as deb, yum, rpm and the like), all executable binaries must depend on nothing but the OS kernel. And no, packing all the “shared” libs together with the binary is NOT a solution.

No external libraries are tolerable, at all; the exception we unwillingly make for some parts of the C standard library is temporary. If you think you really need a certain library, please consider it carefully: you will, first of all, need to put it into the main source tree of the project, which effectively means it becomes another piece of code we (you!) have to maintain. And to build every time we decide to do a complete rebuild.

Next thing to remember is that we limit the language subsets used in the project, and all parts of the code within the project must obey, including the libraries. Also this ban list is in effect, too; chances are that your beloved library uses some crap like cmake, autoconf or something like that. You must remove all that crap and make the library build without these in order it to be acceptable for the project.

And, last but not least, hardly we agree to tolerate a huge library for one or two functions you needed from it. In most cases it is necessary to prepare a subset of a library, which often involves brutally hacking its modules.

From the other hand, if the library you want only consists of a single small module with a clear interface, written in C90 or K&R C, perhaps it is not a problem to add it.

If your program uses some binary data not intended to be supplied by the user, such as icons or other images, you must make sure they all are built into the executable. The most obvious way to do so is to convert every such binary into a C module containing a single initialized char array, and simply use such generated module in your program.

There are some points to mention about compile-time dependencies, too. First of all, forget about autoconf/autotools and the like. They tend to produce problems with portability, instead of solving them. And don't even think about cmake. Yes, there are systems around where cmake is not available by default, so a package maintainer (or, worse, a user) will have to build cmake itself in order to build your program. Furthermore, cmake is easy to use for programmers, like, “take this bunch of code and create a binary for me”, but is a real nightmare for package makers.

Another thing is not so obvious: be extremely discriminating with features provided by the so-called “standard library”. Some of them may even ruin your capability of building statically (like Glibc's getpwnam), some other, like locales, will silently make your binary dependent on external data files. And a lot of them may accidentally make your code less portable (this is specially true for GNU extensions and functions invented by “standards”). Remember one thing: the less you depend on, the better.

In particular, please do your best not to suck in all these locale-related stuff. E.g., don't use functions from <ctype.h>, such as isspace, isalpha, isdigit and so on. Single thoughtless

    if (isspace(c)) {

instead of just

    if (c == ' ' || c == '\t' || c == '\n' || c == '\r' ) {

will ruin everything.

In fact, it is highly likely we'll get rid of the standard library one day, completely. The less parts of it our program will depend, the less effort we'll have to spend doing it, and this should be always kept in mind.

Client-side execution

The situation when a code is downloaded from the Internet and then immediately executed, is too usual nowadays. It is not only JavaScript executed within the user's browser; automatic software updates are effectively the same sh*t. And all this is not only inappropriate, it is horrible. When you get your code executed on someone else's device, it means you take over the control, and this is not the thing to do without explicit consent of the device's owner. In fact, “download-and-execute” is an anti-pattern; well, ideally this should be considered criminal and punished by several years in prison; if it is not yet so, it's only because all these law-makers don't understand how computers work.

But we do understand, right?

So, as a rule, any program can be executed on someone's computer only after explicit installation performed by the user, and upon explicitly given user's well-informed consent.

Once again: the consent must be given by the user explicitly, for every particular case when any new program code is transferred to the user's device, before it is actually executed. All these “implicit consents”, and even some “explicit” cases like a paragraph in a EULA which noone reads, don't work here.

For the particular project — Thalassa CMS — this primarily means one thing: forget about JavaScript (and, generally speaking, about the very idea of executing anything within the browser), once and forever.

Some devil's advocates are likely to ask one question they're sure to be smart: hey, HTML is a code, too, and it is interpreted at the user's computer, and perhaps formats like jpeg or mp3 are not much different from programs, too. And if these are not programs but simply data, then why don't consider JavaScript being just data, not a program, as well? Where is the exact border between programs and data?

The obvious answer here is Turing-completeness of the interpreter, but unfortunately this doesn't work against devil's advocates. Once they hear about Turing machine, they always come up with the argument they tend to consider killing: all the real-world computers are finite, so they are always Turing-incomplete. Computer scientists often try to argue somehow, but the problem is that devil's advocates are not willing to gain any knowledge, they merely wish to convince the public (not the opponent) that the “download-and-execute” crime is not really a crime nor even a problem, that's all.

So, Turing-completeness doesn't work for us, and here's our answer on where's the border: a format must be considered representing programs, not data, in case it allows to make loops of any kind, be it explicit loop constructions, an implicit loop made by a jump backwards, or a recursion.

HTML, fortunately, doesn't have anything of these. Okay, we know about these XML entity bombs, but the possibility to create new XML entities is disabled in browsers.

Once again: if the data format has features to let a certain portion of data to be interpreted more than once, then it's no data, it's a program code.

Yes, PDF is no data. PDF “documents” are actually programs. And it is a real problem, but fortunately, as long as we discuss Thalassa CMS, it is not our problem.

Text-based formats with recursive nesting

All these “semi-structured” formats, primarily SGML and its descenders XML and HTML, were initially intended for native language text markup. The mere fact people come up with other markup languages, such as ReST, BBcode, various wiki languages, Gemtext etc., clearly demonstrates one thing: HTML is a failure. These “angle-bracketed tags” are too bad even for the purpose they were intended: for native-language document formatting.

As of using XML for machine-readable data (which is not a native-language text), it is simply a nonsense, and for healthy-minded people it was clear from the very start. But healthy-minded people are a dying-out species nowadays, as clearly shown by the mere fact such an ugly beast as JSON appeared long after it became clear nothing like this has any right nor even a reason to exist.

So here's the rule: data formats with recursive nesting are forbidden.

Primarily this applies to text-based formats, namely SGML-based languages (XML and HTML) and JSON, but in case you encounter binary format defined to have recursive nesting, please remember it is in no way better.

NB: if you think you really need recursive nesting, it means you just don't realize something important. Look, there's theory of databases. It is a well-known fact that relational model is able to solve any database-related task. And the First Normal Form (1NF) — the simplest one among the nine known normal forms — is, simply speaking, just a requirement for all values (elements in any column of any row) to be atomic.

Someone might argue that “modern” database engines allow non-atomic data and even JSON; the answer here is that the modern IT industry is driven by idiots who don't bother learning any theory, and, furthermore, never bother to think what they do.

Okay, there's one obvious exception from the rule, which is a forced measure — without it, we wouldn't be able to create any CMSes. Recursively nesting data formats are allowed for interoperability with external software, provided that they are not used internally. So, our program can both analyse and compose data in such formats, but must never store its own data in these formats.

Well, templates and snippets used to compose HTML data should be stored as they are, but this is no violation of our rule: all these snippets and templates are just strings from which a larger text is composed.

The last thing to mention specially for devil's advocates: most of programming languages allow unlimited recursive nesting. Well, not for everything, e.g., you can't place one program inside another, and for statically-compiled languages, you shouldn't try to place one function or procedure inside another one, even if it is supported by the compiler. Anyway, at least arithmetic equations and control statements allow for (formally unlimited) nesting.

Well, programs are not data, they are programs. Bye.

By the way, MIME is recursively nesting, as you can put one MIME container into another. We have to analyse MIME-formatted stream in exactly one situation: for file-uploading webforms. We never generate anything like MIME containers, messages etc; in particular, if we send emails, they are in plain text. Actually, this is not the only (and even not the main) reason to ban everything related to MIME.

Locales

As we already mentioned, locales should be avoided with the best possible effort. Once we need an international program, we should use our own system of replaceable message sets, created specifically for the task at hands.

Unicode (limitations)

Briefly speaking, Unicode is another committee-made bastard. However, as of now, there's no other numbered list of all printable characters known to the mankind, so we use it in this role. However, it is no reason to let commitee-made nonsense to slip into our software, hence there are some certain limits.

Using utf8 as the only encoding

If you don't want to bother with different character encodings, you are at your right, but once your program only supports a single character encoding, that encoding must be ASCII (or US-ASCII, if you prefer). If you decide to support utf8, then you must support other ASCII extensions as well, such as latin1, koi8r and the like.

It is hard to have recoding tables for each and every encoding that ever was in use in history, so this is not necessary. What is obligatory is to support at least some (two? three?) such encodings and provide a clear way to add more of them.

BTW, you can even omit the support for utf8 in preference for single-byte encodings, if you're brave enough. But it is strictly prohibited to support utf8 as the only encoding.

Byte-order mark (BOM) in utf8

BOM is a nonsense for utf8 as utf8 doesn't depend on byte order in any way. Hence it is prohibited to emit BOM as a part of utf8 text, and your program should (although not required to) produce a warning message if the BOM is encountered in utf8 input.

Unicode-based encodings other than utf8

Multibyte character encodings other than utf8, such as UCS-4, UTF-16, UTF-32 and their variants, well, don't exist. Period.

Unicode diacritical marks

All the code points from the so-called Combining Diacritical Marks Unicode block must either be ignored, or displayed and/or otherwise handled as completely separate characters, or even produce an error. Your program must not even try to handle them according to what damn Unicode demands.

Don't waste your time trying to comply to what all these irresponsible committees voted for. For all real-world combinations of a “main” glyph with a diacritical mark, separate code points exist.

Unicode emoji

These must either be ignored, or filtered off, or rejected, or otherwise refused. All this crap is far beyond all possible limits.

Internationalized domain names

Internationalized domain names (IDNs) are to be considered non-existing. No specific support for them must ever be provided.

It is more or less clear now that this particular nonsense has failed and is being slowly pushed out of everyday use. So, let it just die in peace. Don't invest your effort into making this death longer.

Multithreading (and shared memory)

Yes, multithreading is not allowed. Furthermore, it is not allowed to create any mutable shared memory (there's no problem with sharing a memory section in case it is read only for all processes; actually, it is exactly what happens with executable code when several processes run the same executable binary).

If you're going to explain how efficient it is to use shared memory, then don't. In reality, nobody bothers with any practical demonstration of the fact multithreading is really needed for efficiency reasons, and in most cases, if not all, from the user's point of view, multithreaded programs freeze every now and then, primarily because of cancellation points.

So no, don't even think about multithreading.

© Andrey V. Stolyarov, 2023, 2024