First of all, no auto formatters, such as well-known GNU
indent
program, are allowed. The rules from this section must
be obeyed continuously, which means that your code in any given
moment must be rules-compliant. Once you did something to the code so
that it is no longer compliant, you must not do anything
but making it compliant again, until it is.
Thou shalt not cross 80 columns in thy file.
Once again: Thou shalt not cross 80 columns in thy file.
If you use tabs for indentation (which is not recommended, but still allowed), the 80 columns rule must be obeyed for 8-column tabs.
In fact, it is recommended to keep the lines no longer than 75 columns, but in case you really need so, 78 is still okay. Even 79 is still okay. 80 is not okay, but, well, tolerable. For 81-column and longer lines, zero-tolerance policy is in effect.
If your line doesn't want to fit into this limit, see the section devoted to long lines for further instructions (spoiler: no, there's no exception for the sacred 80 column rule).
People often argue there's no real reason to maintain the 80-column rule nowadays, when monitors are wide and so on. Some even recall that the figure of 80 in fact came from a punch card width; those people would tell you the punch card epoch is over so traditions should be revised.
Damn all the crap like this. To understand how misleading it is, just come to your bookshelf (well, you do have some books printed on paper, don't you? if you don't, then visit local library or one of your friends who still have books), take any arbitrary book, printed in any year from, say, XVIII century to the present time, in any place in the world, in any language, in any alphabet (well, not hieroglyphic, so a book in Japanese, Chinese or Corean will not fit — but any of English, Spanish, Russian, Armenian, Arabic — it doesn't matter that Arabic is written right to left — all of these work), open it on a random page, peek a line from somewhere in the middle of the page, and count letters, spaces and punctuation marks on that line.
The result will be 40 to 75. With 40 to 50 letters per line, books are often printed in two columns layout; for a single column typesetting, typical line length are from 58 to 67 “symbols” (including spaces), 73 is rare enough, but it is absolutely predictable you will never see a book having lines longer than 75. It is because lines longer than 75 symbols are hard to read for a human, and book publishers know this fact for centuries. That's why the well-known 80-column punch card was so popular; other formats existed, but were rarely used. First four columns were usually occupied by the line number, one was left blank, and the rest — 75 columns, you see — contained actual text. The width of 80 column was not in any way arbitrary, and nothing has actually changed in real reasons behind the 80-column rule when punch cards became ancient history.
So we repeat it once again: Thou shalt not cross 80 columns in thy file. It is unfair to others to force them keeping their terminal windows wider than the traditional 80 columns.
The recommended indentation is four spaces, but we consider acceptable to use two spaces, three spaces or one tab for indentation. It is prohibited to use single space indentation, as well as more than four spaces and more than one tab.
Also the following rules are to be strictly obeyed:
Curly braces that delimit a function body are placed like this:
int f(int x) { /* ... */ }
and not like this:
int f(int x) {
The only exception for this rule is made for C++ methods in case the body is placed right inside the class (or structure). Such a body must be short enough (one line, may be two, but never more that three), and for the sake of compactnes of the class header itself, it may be formatted other ways.
Curly braces within control statements are placed like this:
while (a != b) { /* ... */ } if (a == b) { /* ... */ } else { /* ... */ } do { /* ... */ } while (a != b);
There's one important exception from this rule, which will be discussed in the section devoted to breaking up long lines.
There are a lot of cases something doesn't fit on a single code line. One
of the most important cases of this is when a head of a statement like
if
, while
, for
or even
switch
becomes too long because of the conditional expression.
In this situation we do the following:
{
” on a separate line, precisely
under the first char of the statement's name. This is an exception
for the general rule that prescribes to write the “{
” on the
same line with the statement's head.Together it looks like this:
while (!the_collection->known_set->first && the_collection->to_parse->first && the_collection->to_parse->first->s == ' ') { skip_space(the_collection); }
What we explicitly disallow here are things like the following:
while (!the_collection->known_set->first && the_collection->to_parse->first && the_collection->to_parse->first->s == ' ') { skip_space(the_collection); }
or like this:
while (!the_collection->known_set->first && the_collection->to_parse->first && the_collection->to_parse->first->s == ' ') skip_space(the_collection);
or like this:
while (!the_collection->known_set->first && the_collection->to_parse->first && the_collection->to_parse->first->s == ' ') skip_space(the_collection);
For most of programmers around the world, this is obvious, but unfortunately not for all; otherwise, all these “wide strings” would never slip into language specifications.
So here is the rule: any source file for any programming language, not only
for C and C++, must only contain chars from ASCII alphabet. See 'man
7 ascii
' for what ASCII alphabet is.
Non-ascii chars, such as latin letters with diacritics, letters from non-latin alphabets (be it cyrillic, greek or whatever else), hieroglyphs, math operators and so on, are not allowed in source code. Not only they are prohibited in identifiers, which is rejected by most iterpreters and compilers anyway; but also they must never appear in string constants and even in comments.
As a rule of a dumb: a correct source file must, first of all, be considered as a sequence of 8-bit bytes, one byte per character (fortunately, not many compilers agree to work with commitee-invented “encodings” such as ucs32, but the things are bad enough so this has to be mentioned explicitly), and these bytes can only have the following values: 9 (tab), 10 (carriage return, not recommended but still acceptable), 13 (newline), 32 (space), 33–126 (printable ASCII chars). That's all.
There's only one native language to be used within source code, and that language is English. Identifiers must be derived from English words, not Spanish, not German, not Russian, not French, not Arabic — English. Comments must be written in English, or not written at all.
Damn, this has nothing to do with american or british chauvinism even if such chauvinism really exists (which is doubtful). The original author of this text (and of Thalassa CMS) is not a native English speaker, and this fact must be obvious for any native English speaker reading this text, heh (sorry guys, I realize how disgusting it is to read a text written in your native language by a non-native author).
The mere fact is that all programmers around the world understand English at least to some extent, so English is THE language we programmers can communicate with each other. It isn't so bad, as, among all more or less popular native languages, English is the simplest to learn.
In plain C, all identifiers but macro names are written lowercase,
optionally using underscores to separate the words, like this:
i
, namelen
, name_length
and so on.
Please note all but macro names means exactly this: all but macro names. Hence, enum constants are written in lowercase. So, this is okay:
enum traffic_lights { tl_red, tl_yellow, tl_green };
But the following is NOT okay, despite you might be used to this:
enum traffic_lights { TL_RED, TL_YELLOW, TL_GREEN };
Macro names, and only macro names, are written all-uppercase, with optional underscores, and must never be shorter than five chars. So these both are okay:
#define MYMESSAGE "This is a message" #define MY_MESSAGE "This is a message"
but all the following are NOT:
#define MSG "This is a message" #define MyMessage "This is a message" #define mymessage "This is a message"
In plain C, mixed case in identifiers is never used
, and never means never.In C++, we use CamelCase for everything related to object-oriented programming and abstract data types (BTW, you don't confuse these two completely different paradigms, do you?) This means, effectively, that CamelCase (okay, every word starts with a capital, all the other letters are lowercase... hence, the first letter is always uppercase, do we make it clear?) for names of classes and methods. And that's all.
Structure names are written in CamelCase only when they are not, actually,
structures as they are in plain C — e.g., if your structure has
methods, or if it has some private members, than it is no longer a
structure. It is up to you whether to use the class
keyword
for all such structures, or stick with struct
sometimes, but
they are no longer structures, so please name them in MixedCase.
Everything else, including
— is named all-lowercase.
Please note we never use identifiers such as isEmpty
,
getValue
, feedTheCat
and the like — that
is, mixed case starting with lowercase.
Furthermore, we never use underscores in mixed-case identifiers.
And one more thing: all globally-visible identifiers must be
reasonably long and as meaningful as possible. From the other hand, local
variables should be named short, with rare exceptions. For example, if
you're going to write a for
loop with an integer loop variable
that just increments or decrements (may be with inc/dec step other
than 1), it would look stupid to name that variable anyhow longer than
just i
, j
, n
and so on. However, it
is strictly prohibited to use 1-char identifiers l
,
o
, I
and O
, because they can be
confused with digits (yes, even the lowercase o
, and yes,
there are a lot of people around who don't use syntax highlighting), as
well as any multichar identifiers that consist of only these four chars,
such as Ill
, IO
, loo
and so on.
Are you already used to all these size_t
, off_t
,
time_t
, uint32_t
and the like? Now (at least if
you work on Thalassa CMS code) please start avoiding these as long as it is
possible.
Unfortunately, it is not always possible. For example, if you use a syscall or a standard library function which accepts or returns a pointer to such type, you can blame the commitee that invented it, but you actually have to obey. Fortunately, it is unlikely you'll need such calls (getgroup, accept, recvfrom and the like) in Thalassa CMS.
The well-known time
syscall gives a perfect example of a
situation where you can avoid these idiotic type names. Instead
of
time_t tm; time(&tm);
please write
long long tm; tm = time(0);
(replace the 0
with NULL
for plain C code; in
C++, keep the zero as it is the
representation for a null pointer).
There are two rules for side effects, each with one exception. The rules are:
The first rule means it is not good to write, e.g.,
x = v[n++];
Instead, two statements must be written:
x = v[n]; n++;
BTW, this means we never make use of the difference between
i++
and ++i
, so we always write i++
.
These STL addicts may argue we should definitely always write
++i
instead, but the fact is that we don't use STL, so their
reasoning isn't valid for us.
The obvious exception is when you need to call a function which has a side
effect but nonetheless it returns something important as its returning
value. In most cases we shouldn't ignore such values, and sometimes
attepmts to ignore them effectively make our program obviously wrong, like
with the read
syscall. So, the very minimum we have to do is
to assign the value to a variable, and assignment is a side
effect, too. So, the statement like
res = func(arg1, arg2);
is considered valid, despite there are two side effects in it, but the
expression in such a statement must only consist of the function
call and the assignment operator. No additional operators are
allowed, and no side effects are allowed for the function arguments, so the
following (provided that func
, foo
and
bar
all have side effects):
res = func(arg1) + 1; res = foo(bar(arg2));
both are not allowed.
The second rule means you must not write anything like this:
if (close(fd) == -1) {
nor like this:
if (-1 == close(fd)) {
Despite the latter is better than the former, it is still bad enough,
because close
has a side effect (actually, this side effect is
what it exists for, heh...), and there must be no side effects in
conditional expressions. However, for this rule there's one
exception, too.
In practice, we often need to construct a loop according to the "get, check, handle" model. Examples for such a loop are reading from a stream and the main loop in an event-driven application; well, other examples exist, too.
The problem is that the check has to be placed between getting and handling, which means “in the middle of the loop”. Programming languages don't provide us a statement for this, in the best case they provide loops with precondition and postcondition, but not with a “in-the-middle-condition”. So, what is better, this?
n = 0; c = getchar(); while (c != EOF) { if(c == '\n') { printf("%d\n", n); n = 0; } else { n++; } c = getchar(); }
Or, maybe, this?
n = 0; for (;;) { c = getchar(); if(c == EOF) break; if(c == '\n') { printf("%d\n", n); n = 0; } else { n++; } }
Or, well, finally this?
n = 0; while ((c = getchar()) != EOF) { if(c == '\n') { printf("%d\n", n); n = 0; } else { n++; } }
Honestly speaking, all the three are ugly. But the first version involves
duplication of the “get” in “get, check, hangle” — lucky we are
if it is only a getchar, but consider the well-known select
syscall with all the preparations (such as filling in the sets, computing
timeout until the closest time-based event, all that), and you won't be any
longer happy with duplicating such amount of code.
The second version might look better, but when an average reader of your
program sees the for (;;)
(or while (1)
, no
matter), (s)he expects a real endless loop. It is okay for a main
event loop in an event-driven program, because in that case loop only ends
together with the program itself, but for a simple stream reading or the
like, it might look misleading.
So, here is the exception to our second rule: it is only acceptable to have
a side effect within the conditional expression of while
loop
(but not do-while
, nor for
) in case the loop is
built according to the “get, check, handle” scheme and the side effect
corresponds to the “get”.
Please note that there are no similar exceptions for if
,
switch
, for
and do-while
. Side
effects are NOT allowed in their conditional expressions.
Many people argue goto
must never be used at all. Some say
exactly the opposite: that there's nothing wrong with goto
(well, at all). BTW, Linus Torvalds often tells this in his interviews.
Okay, they are wrong. Even Linus Torvalds.
It is really easy to turn a piece of code into a complete mess, and
goto
is an efficient tool for that (although, surely, other
tools exist for the same purpose).
However, those who prefer to deny goto
once and forever, seem
to be missing one important thing. The final goal is to make the code as
clear as possible. Once again, the goal is not to make
the code free of goto
s or whatever else, it is to make the
code clear.
There are exactly two situations when goto
obviously makes the
code easier to read, and attempts to write the same code without
goto
s surprisingly complicate the code. Always remember what
is the final goal; whenever we see we're doing something that moves us away
from the goal, it means we're doing wrong.
The first of the two situations is simple: it is when we need to
bail out from inside several nested statements, such as
loops and the switch
statement. With only a single statement,
we can use break
, but it doesn't work for more than one
statement.
Certainly, some obvious measures must be taken in order not to let the code become messy. The label must have a meaningfull and self-descriptive name, and it must be placed right after the outmost of the loops (or, well, loops and switches) we're jumping out. But if we do so, everything will be fine.
Some people will tell you it is easy to go without goto here. Yes, it is
really so. We can isolate the nested statements into a separate function
and do a return
from it; we can add a flag checked in outer
loops, set it in the innermost loop and do a break
; we can
invent other things as well. But the truth is that in this
situation the code with goto
will be the clearest
one. Try it yourself if you don't believe.
The second situation is simple, too. Suppose you grab something valuable at the start of your function, and you need to, well, ungrab it before you return. The role of “something valuable” is most often played by dynamic memory, but it can also be, e.g., an open file (okay... it could be a mutex as well if we didn't ban multithreading, but we did). Anyway, you've got to do something right before you're done, no matter how your function finishes. And now you need to... guess what? quit your function from its middle.
Okay, you can duplicate all your cleanup code from the end of the function
into every place where you're going to place another return
.
Please don't. Better write exactly one
return
as the last line of your function, place all the
cleanup right before it, and mark the cleanup code with a
label. The label should be named somehow short and meaningful;
quit
or cleanup
may be good choices, just to name
a couple. To quit the function “from the middle”, use
goto quit
instead of return
.
Please note that in both cases goto
is to be used to jump
forward in the code, and at least one level from inner to
outer code constructions. If you feel like doing goto in the backward
direction, please recall there are three different loop
statements both in C and C++ (namely while, do-while and for), so please
don't invent another one with jumps. Please also don't jump from one point
to another when they are at the same nesting level — this is exactly
how goto
turns your code into a snake wedding.