Thalassa CMS logo

Thalassa CMS

Coding style guide


Code formatting, indentation, spaces and the like

First of all, no auto formatters, such as well-known GNU indent program, are allowed. The rules from this section must be obeyed continuously, which means that your code in any given moment must be rules-compliant. Once you did something to the code so that it is no longer compliant, you must not do anything but making it compliant again, until it is.

The sacred 80 column rule

Thou shalt not cross 80 columns in thy file.

Once again: Thou shalt not cross 80 columns in thy file.

If you use tabs for indentation (which is not recommended, but still allowed), the 80 columns rule must be obeyed for 8-column tabs.

In fact, it is recommended to keep the lines no longer than 75 columns, but in case you really need so, 78 is still okay. Even 79 is still okay. 80 is not okay, but, well, tolerable. For 81-column and longer lines, zero-tolerance policy is in effect.

If your line doesn't want to fit into this limit, see the section devoted to long lines for further instructions (spoiler: no, there's no exception for the sacred 80 column rule).

People often argue there's no real reason to maintain the 80-column rule nowadays, when monitors are wide and so on. Some even recall that the figure of 80 in fact came from a punch card width; those people would tell you the punch card epoch is over so traditions should be revised.

Damn all the crap like this. To understand how misleading it is, just come to your bookshelf (well, you do have some books printed on paper, don't you? if you don't, then visit local library or one of your friends who still have books), take any arbitrary book, printed in any year from, say, XVIII century to the present time, in any place in the world, in any language, in any alphabet (well, not hieroglyphic, so a book in Japanese, Chinese or Corean will not fit — but any of English, Spanish, Russian, Armenian, Arabic — it doesn't matter that Arabic is written right to left — all of these work), open it on a random page, peek a line from somewhere in the middle of the page, and count letters, spaces and punctuation marks on that line.

The result will be 40 to 75. With 40 to 50 letters per line, books are often printed in two columns layout; for a single column typesetting, typical line length are from 58 to 67 “symbols” (including spaces), 73 is rare enough, but it is absolutely predictable you will never see a book having lines longer than 75. It is because lines longer than 75 symbols are hard to read for a human, and book publishers know this fact for centuries. That's why the well-known 80-column punch card was so popular; other formats existed, but were rarely used. First four columns were usually occupied by the line number, one was left blank, and the rest — 75 columns, you see — contained actual text. The width of 80 column was not in any way arbitrary, and nothing has actually changed in real reasons behind the 80-column rule when punch cards became ancient history.

So we repeat it once again: Thou shalt not cross 80 columns in thy file. It is unfair to others to force them keeping their terminal windows wider than the traditional 80 columns.

Basic indentation

The recommended indentation is four spaces, but we consider acceptable to use two spaces, three spaces or one tab for indentation. It is prohibited to use single space indentation, as well as more than four spaces and more than one tab.

Also the following rules are to be strictly obeyed:

Curly braces placement

Curly braces that delimit a function body are placed like this:

int f(int x)
{
    /* ... */
}

and not like this:

int f(int x) {

The only exception for this rule is made for C++ methods in case the body is placed right inside the class (or structure). Such a body must be short enough (one line, may be two, but never more that three), and for the sake of compactnes of the class header itself, it may be formatted other ways.

Curly braces within control statements are placed like this:

    while (a != b) {
        /* ... */
    }

    if (a == b) {
        /* ... */
    } else {
        /* ... */
    }

    do {
        /* ... */
    } while (a != b);

There's one important exception from this rule, which will be discussed in the section devoted to breaking up long lines.

Breaking up long lines

There are a lot of cases something doesn't fit on a single code line. One of the most important cases of this is when a head of a statement like if, while, for or even switch becomes too long because of the conditional expression. In this situation we do the following:

Together it looks like this:

   while (!the_collection->known_set->first &&
       the_collection->to_parse->first &&
       the_collection->to_parse->first->s == ' ')
   {
       skip_space(the_collection);
   }

What we explicitly disallow here are things like the following:

   while (!the_collection->known_set->first &&
       the_collection->to_parse->first &&
       the_collection->to_parse->first->s == ' ') {
       skip_space(the_collection);
   }

or like this:

   while (!the_collection->known_set->first &&
       the_collection->to_parse->first &&
       the_collection->to_parse->first->s == ' ')
       skip_space(the_collection);

or like this:

   while (!the_collection->known_set->first &&
     the_collection->to_parse->first &&
     the_collection->to_parse->first->s == ' ')
       skip_space(the_collection);

Alphabet and language

ASCII only

For most of programmers around the world, this is obvious, but unfortunately not for all; otherwise, all these “wide strings” would never slip into language specifications.

So here is the rule: any source file for any programming language, not only for C and C++, must only contain chars from ASCII alphabet. See 'man 7 ascii' for what ASCII alphabet is.

Non-ascii chars, such as latin letters with diacritics, letters from non-latin alphabets (be it cyrillic, greek or whatever else), hieroglyphs, math operators and so on, are not allowed in source code. Not only they are prohibited in identifiers, which is rejected by most iterpreters and compilers anyway; but also they must never appear in string constants and even in comments.

As a rule of a dumb: a correct source file must, first of all, be considered as a sequence of 8-bit bytes, one byte per character (fortunately, not many compilers agree to work with commitee-invented “encodings” such as ucs32, but the things are bad enough so this has to be mentioned explicitly), and these bytes can only have the following values: 9 (tab), 10 (carriage return, not recommended but still acceptable), 13 (newline), 32 (space), 33–126 (printable ASCII chars). That's all.

English only

There's only one native language to be used within source code, and that language is English. Identifiers must be derived from English words, not Spanish, not German, not Russian, not French, not Arabic — English. Comments must be written in English, or not written at all.

Damn, this has nothing to do with american or british chauvinism even if such chauvinism really exists (which is doubtful). The original author of this text (and of Thalassa CMS) is not a native English speaker, and this fact must be obvious for any native English speaker reading this text, heh (sorry guys, I realize how disgusting it is to read a text written in your native language by a non-native author).

The mere fact is that all programmers around the world understand English at least to some extent, so English is THE language we programmers can communicate with each other. It isn't so bad, as, among all more or less popular native languages, English is the simplest to learn.

Identifiers

In plain C, all identifiers but macro names are written lowercase, optionally using underscores to separate the words, like this: i, namelen, name_length and so on.

Please note all but macro names means exactly this: all but macro names. Hence, enum constants are written in lowercase. So, this is okay:

    enum traffic_lights { tl_red, tl_yellow, tl_green };

But the following is NOT okay, despite you might be used to this:

    enum traffic_lights { TL_RED, TL_YELLOW, TL_GREEN };

Macro names, and only macro names, are written all-uppercase, with optional underscores, and must never be shorter than five chars. So these both are okay:

    #define MYMESSAGE "This is a message"
    #define MY_MESSAGE "This is a message"

but all the following are NOT:

    #define MSG "This is a message"
    #define MyMessage "This is a message"
    #define mymessage "This is a message"

In plain C, mixed case in identifiers is never used, and never means never.

In C++, we use CamelCase for everything related to object-oriented programming and abstract data types (BTW, you don't confuse these two completely different paradigms, do you?) This means, effectively, that CamelCase (okay, every word starts with a capital, all the other letters are lowercase... hence, the first letter is always uppercase, do we make it clear?) for names of classes and methods. And that's all.

Structure names are written in CamelCase only when they are not, actually, structures as they are in plain C — e.g., if your structure has methods, or if it has some private members, than it is no longer a structure. It is up to you whether to use the class keyword for all such structures, or stick with struct sometimes, but they are no longer structures, so please name them in MixedCase.

Everything else, including

— is named all-lowercase.

Please note we never use identifiers such as isEmpty, getValue, feedTheCat and the like — that is, mixed case starting with lowercase.

Furthermore, we never use underscores in mixed-case identifiers.

And one more thing: all globally-visible identifiers must be reasonably long and as meaningful as possible. From the other hand, local variables should be named short, with rare exceptions. For example, if you're going to write a for loop with an integer loop variable that just increments or decrements (may be with inc/dec step other than 1), it would look stupid to name that variable anyhow longer than just i, j, n and so on. However, it is strictly prohibited to use 1-char identifiers l, o, I and O, because they can be confused with digits (yes, even the lowercase o, and yes, there are a lot of people around who don't use syntax highlighting), as well as any multichar identifiers that consist of only these four chars, such as Ill, IO, loo and so on.

More restrictions

No commitee-invented typedefs

Are you already used to all these size_t, off_t, time_t, uint32_t and the like? Now (at least if you work on Thalassa CMS code) please start avoiding these as long as it is possible.

Unfortunately, it is not always possible. For example, if you use a syscall or a standard library function which accepts or returns a pointer to such type, you can blame the commitee that invented it, but you actually have to obey. Fortunately, it is unlikely you'll need such calls (getgroup, accept, recvfrom and the like) in Thalassa CMS.

The well-known time syscall gives a perfect example of a situation where you can avoid these idiotic type names. Instead of

  time_t tm;
  time(&tm);

please write

  long long tm;
  tm = time(0);

(replace the 0 with NULL for plain C code; in C++, keep the zero as it is the representation for a null pointer).

Side effects

There are two rules for side effects, each with one exception. The rules are:

  1. no more than one side effect per expression statement;
  2. no side effects in conditional expressions.

The first rule means it is not good to write, e.g.,

  x = v[n++];

Instead, two statements must be written:

  x = v[n];
  n++;

BTW, this means we never make use of the difference between i++ and ++i, so we always write i++. These STL addicts may argue we should definitely always write ++i instead, but the fact is that we don't use STL, so their reasoning isn't valid for us.

The obvious exception is when you need to call a function which has a side effect but nonetheless it returns something important as its returning value. In most cases we shouldn't ignore such values, and sometimes attepmts to ignore them effectively make our program obviously wrong, like with the read syscall. So, the very minimum we have to do is to assign the value to a variable, and assignment is a side effect, too. So, the statement like

  res = func(arg1, arg2);

is considered valid, despite there are two side effects in it, but the expression in such a statement must only consist of the function call and the assignment operator. No additional operators are allowed, and no side effects are allowed for the function arguments, so the following (provided that func, foo and bar all have side effects):

  res = func(arg1) + 1;
  res = foo(bar(arg2));

both are not allowed.

The second rule means you must not write anything like this:

  if (close(fd) == -1) {

nor like this:

  if (-1 == close(fd)) {

Despite the latter is better than the former, it is still bad enough, because close has a side effect (actually, this side effect is what it exists for, heh...), and there must be no side effects in conditional expressions. However, for this rule there's one exception, too.

In practice, we often need to construct a loop according to the "get, check, handle" model. Examples for such a loop are reading from a stream and the main loop in an event-driven application; well, other examples exist, too.

The problem is that the check has to be placed between getting and handling, which means “in the middle of the loop”. Programming languages don't provide us a statement for this, in the best case they provide loops with precondition and postcondition, but not with a “in-the-middle-condition”. So, what is better, this?

  n = 0;
  c = getchar();
  while (c != EOF) {
      if(c == '\n') {
          printf("%d\n", n);
          n = 0;
      } else {
          n++;
      }
      c = getchar();
  }

Or, maybe, this?

  n = 0;
  for (;;) {
      c = getchar();
      if(c == EOF)
          break;
      if(c == '\n') {
          printf("%d\n", n);
          n = 0;
      } else {
          n++;
      }
  }

Or, well, finally this?

  n = 0;
  while ((c = getchar()) != EOF) {
      if(c == '\n') {
          printf("%d\n", n);
          n = 0;
      } else {
          n++;
      }
  }

Honestly speaking, all the three are ugly. But the first version involves duplication of the “get” in “get, check, hangle” — lucky we are if it is only a getchar, but consider the well-known select syscall with all the preparations (such as filling in the sets, computing timeout until the closest time-based event, all that), and you won't be any longer happy with duplicating such amount of code.

The second version might look better, but when an average reader of your program sees the for (;;) (or while (1), no matter), (s)he expects a real endless loop. It is okay for a main event loop in an event-driven program, because in that case loop only ends together with the program itself, but for a simple stream reading or the like, it might look misleading.

So, here is the exception to our second rule: it is only acceptable to have a side effect within the conditional expression of while loop (but not do-while, nor for) in case the loop is built according to the “get, check, handle” scheme and the side effect corresponds to the “get”.

Please note that there are no similar exceptions for if, switch, for and do-while. Side effects are NOT allowed in their conditional expressions.

Goto is only allowed in two situations

Many people argue goto must never be used at all. Some say exactly the opposite: that there's nothing wrong with goto (well, at all). BTW, Linus Torvalds often tells this in his interviews.

Okay, they are wrong. Even Linus Torvalds.

It is really easy to turn a piece of code into a complete mess, and goto is an efficient tool for that (although, surely, other tools exist for the same purpose).

However, those who prefer to deny goto once and forever, seem to be missing one important thing. The final goal is to make the code as clear as possible. Once again, the goal is not to make the code free of gotos or whatever else, it is to make the code clear.

There are exactly two situations when goto obviously makes the code easier to read, and attempts to write the same code without gotos surprisingly complicate the code. Always remember what is the final goal; whenever we see we're doing something that moves us away from the goal, it means we're doing wrong.

The first of the two situations is simple: it is when we need to bail out from inside several nested statements, such as loops and the switch statement. With only a single statement, we can use break, but it doesn't work for more than one statement.

Certainly, some obvious measures must be taken in order not to let the code become messy. The label must have a meaningfull and self-descriptive name, and it must be placed right after the outmost of the loops (or, well, loops and switches) we're jumping out. But if we do so, everything will be fine.

Some people will tell you it is easy to go without goto here. Yes, it is really so. We can isolate the nested statements into a separate function and do a return from it; we can add a flag checked in outer loops, set it in the innermost loop and do a break; we can invent other things as well. But the truth is that in this situation the code with goto will be the clearest one. Try it yourself if you don't believe.

The second situation is simple, too. Suppose you grab something valuable at the start of your function, and you need to, well, ungrab it before you return. The role of “something valuable” is most often played by dynamic memory, but it can also be, e.g., an open file (okay... it could be a mutex as well if we didn't ban multithreading, but we did). Anyway, you've got to do something right before you're done, no matter how your function finishes. And now you need to... guess what? quit your function from its middle.

Okay, you can duplicate all your cleanup code from the end of the function into every place where you're going to place another return. Please don't. Better write exactly one return as the last line of your function, place all the cleanup right before it, and mark the cleanup code with a label. The label should be named somehow short and meaningful; quit or cleanup may be good choices, just to name a couple. To quit the function “from the middle”, use goto quit instead of return.

Please note that in both cases goto is to be used to jump forward in the code, and at least one level from inner to outer code constructions. If you feel like doing goto in the backward direction, please recall there are three different loop statements both in C and C++ (namely while, do-while and for), so please don't invent another one with jumps. Please also don't jump from one point to another when they are at the same nesting level — this is exactly how goto turns your code into a snake wedding.

© Andrey V. Stolyarov, 2023, 2024