Work sessions and the CAPTCHA test

Contents:

What is a work session, actually?
Work sessions in Thalassa
How sessions are implemented
Configuring CAPTCHA infrastructure

The base parameters
CAPTCHA solution form
The No Cookie special page
The Retry Captcha special page

The [sess: ] macro

What is a work session, actually?

Originally, HTTP is request-based, which means that client (browser) establishes a connection with server, sends exaclty one request, receives the server's response and closes the connection. Recent implementations of both browsers and servers are able to keep the connection to save some traffic in case several requests from the same client to the same server are needed in a row, which is a typical situation as a page can contain images, and downloading each of them takes another request. However, even if several requests are performed using one TCP session, each of them is still considered kinda self-containing, that is, from the server's point of view, single requests have no relationship between them.

From the user's point of view, the picture is very different. Web sites tend to “remember” what the user did, and even if they don't, it is a rare case when the user fills a single web form and gets what (s)he wants right from a single request. More usual situation is when the site offers the user several steps of a dialog.

Well, the most common situation nowadays is that the page containing the form changes right under the user's hands, and often it does so despite the user didn't request anything like that. Okay, all this is done with client-side scripting, which we not only don't use, actually we believe that publishing sites with client-side scripts should be considered criminal and pusnished with several years in prison.

What's even more important is that sometimes users somehow identify theirselves to a site, and all subsequent actions are assumed to be done by the same user. Also there are other good reasons (in contrast with bad reasons, like tracking) for a site to remember something about the user, like, as it is true for Thalassa, in case the user solved a CAPTCHA test once, it looks fair not to force the user to do it again and again.

So, it is obvious that web sites sometimes need to identify requests coming from the same user, or, to say it another way, to separate requests coming from different users. Such requests, known to come from the same client, are exactly what “work session” actually is. There are two ways to maintain sessions: the site needs either to use a cookie to store the session ID, or to embed the ID into the URL of each page.

The solution with session ID within the URL has several fundamental flows. First, users tend to share URLs of what they see, and an average user will not bother stripping off any session IDs from what (s)he copy/pastes to another messenger to send to a friend. The server may get confused seeing the same session continuing from two different locations, and, what's more serious, the “friend” who received the URL containing the session ID, may do some actions on behalf of the user who sent the URL (and the one who sent will hardly realize that (s)he shared not only the URL but also the unintended “procuracy”).

Another fundamental flow of keepig session ID in the URL is that it simply doesn't work for static HTML pages. Unlike various dynamically generated content, static pages are, well, static. They can “accept” (and ignore) query parameters in an URL, but once the user follows a link from a static page, that link obviously won't contain these query parameters, so the session ID, if any, will be lost. This is not a problem for these “modern” sites that hardly contain a single static page, but for sites made with Thalassa, which consist primarily of static pages (generated beforehand, not when a request is received), this makes URL-based session control effectively impossible.

So, the only thing left to us is a cookie. When a web-server sends a reponse to its client, it can add one or more cookies to the headers of the response. A cookie is effectively a pair of strings, first is the name, and the second is the value of the cookie. Sending another request to the same web site, the client (unless it is instructed otherwise by the user) adds to the request's header all cookies earlier set by this site. Actually, cookies are intended to make longer sessions out of otherwise-isolated requests.

Unfortunately, cookies often trigger some paranoia, and it has its reasons, as cookies can also be used for tracking and other bad things. It remains unclear though why all these people afraid of cookies but don't give a damn to JavaScript which is deadly dangerous, much much more dangerous and harmful than cookies. Anyway, there's no other option. At least Thalassa only sets a cookie when explicitly requested to do so by the user.

Work sessions in Thalassa

The thalassa program itself is a static content generator, so obviously neither the program itself, nor the content it generates, need any sessions. Hence, whenever we talk about work sessions in Thalassa, it must be clear that we only mean the CGI program, thalcgi.cgi.

The CGI program can do some things without a session, too. The session must be already established for any POST request, if only it is not the request that actually establishes a new session. A GET request will proceed normally without a session unless the respective virtual page is configured to require a session. In case the session_required parameter is set to yes for the requested page, the CGI program will not send the page to the user; instead, it will send a special page configured with the [nocookiepage] section.

The intention here is to display a webform on a page, and direct to the same URL the POST request that contains the data which the user entered into the form. In such configuration it makes sense to only display the form when the session is already established, otherwise a user may waste some time filling the form just to get a “no session” error right after pressing the submit button. Furthermore, it is an intended (by design) scenario when a user follows a link that points to a webform, gets the [nocookiepage] special page instead of the form, reads the text about the cookie, decides to set the cookie (and establish the session), does so by solving the CAPTCHA test, and finally gets to the desired webform because all this time the user stayed at the same virtual location (that is, the URL in the browser's address line didn't change).

Establising a session requires to solve the CAPTCHA test; in the present version the CAPTCHA can't be disabled, and it's unlikely such an option will appear in the future. This is because technically every session is a file within the user database directory, so there's an obvious risk of a DoS attack; the CAPTCHA test reduces this risk, because the session is actually created after the test is successfully passed.

The good news for users is that, at least in the present version, establishing the working session is the only situation when the CAPTCHA is to be solved. Once the session is active, the user is assumed to be the same (human).

As we mentioned already, the [nocookiepage] is a special page, which means it has no dedicated path (URI); to see it, a user with no active session must point the browser to any of the pages configured to require a session. After that, if everything is configured properly, the user receives the special page (with a CAPTCHA form), solves the CAPTCHA, submits the solution, optionally (in case either the solution is not correct, or something else went wrong) gets the [retrycaptchapage], which contains an explanation of what went wrong and another CAPTCHA picture, finally submits the correct solution, and after that receives the page (s)he initially requested, as the URL didn't change. But there's one thing to mention here: this is the moment when the session actually starts to exist, and together with the initially requested page, the user receives the cookie that contains the session ID.

How sessions are implemented

From the client's point of view, things look very simple: there's exactly one cookie, named thalassa_sessid (the name is hardcoded in the present version, which is likely to change). The cookie has a value that looks somewhat like PJBKANFHFJGBNJNM_PPIGKEIGKBOFHDKL, and some attentive users may even notice that the first part of the value (the chars before the underscore) remains the same, while the second part (after the underscore) changes every time when another page served by the CGI is loaded by the browser.

Let's now explain something. First of all, if for any reason you get confused by these characters: these latin letters are actually hexadecimal digits, but instead of the traditional 0..9,A..F, just the first 16 letters from the Latin alphabet (A..P) are used, A for zero, P for 15. Well, actually this doesn't really matter, it is only the way choosen for Thalassa CMS to generate random identifier strings: first, the desired amount of random bytes is acquired, and then these bytes are translated, each to two letters, as we've just explained. Thalassa doesn't use the fact these letters are hex digits, e.g., it never converts them back to numbers, they are just random strings of a certain length, that's all.

What's more important is that the real session identifier is the part of the cookie's value which doesn't change — that part of the string which is before the underscore (PJBKANFHFJGBNJNM in our example). The part after the underscore is a so-called token; Thalassa uses it for an additional check against session identifier interception.

The active sessions are stored in the subdirectory named _sessions under the user database directory, where exactly one file is created for each session, and the session ID is used as the file name.

Don't worry much. Before trying to access the file in any way, the CGI program checks whether it is acceptable. In the present version, this acceptability means the name only consists of uppercase latin letters, digits, the underscore char, and is exactly 16 chars long; so hardly it is possible to perform any injection here.

The session file is a text file containing several NAME = VALUE pairs, like this:

  token = PPIGKEIGKBOFHDKL
  oldtoken = ADHFMBOOBGDOIDMB
  created = 1674939409
  expire = 1675256475

If the user logs in or at least tells his/her username without actually logging in (e.g., when requesting new passwords), more pairs get added to the session file, so it can look like this:

  token = OMBBNIGHNPJKPOOF
  oldtoken = JGDGEIINIAPHKHAF
  created = 1674939409
  expire = 1675256475
  user = charlie
  logged_in = yes
  login_time = 1675038541

Furthermore, if the site is configured to allow anonymous comments, the user, while remaining not logged in and even nameless (in the sense of login name), still may use some arbitrary string as his/her “visible name” when posting a comment; this name is stored in the session file as an additional name-val pair, like this:

  realname = Johnny the Great

It is more or less clear that date/time is stored as a “unix date” value here. Expiration time is the last request time plus 72 hours, which is hardcoded, too (yes. yes, it definitely will change in next versions). The token is well, the token, that is, the part of the cookie value, to the right from the underscore — that part which changes. And now the oldtoken: it's the previous value of the token. How exactly Thalassa checks the token validity is as follows: if the token from the cookie is equal either to the token or the oldtoken field of the session file. This allows the session to remain active in case the script generated the content to be sent to the client, but something gone wrong with the connection between the client and the server (the CGI will never know about it).

Configuring CAPTCHA infrastructure

The base parameters

The CAPTCHA implementation in Thalassa uses a simple cryptographic trick explained here, so the CGI doesn't need to store any information locally on the server until the CAPTCHA test is actually passed. What it does need is an arbitrary but not easily guessable string which is kept private (the secret). If unsure, issue a command like the following:

  dd if=/dev/urandom bs=16 count=1 status=none | shasum -b

and use the sum as your secret.

Besides the secret string, there's one more parameter to set for the CAPTCHA subsystem: the timeout value. The user can't change the time value sent to the client inside the cookie response form, because otherwise the MD5 hash will not match, so the CGI can check if it took too long for the user to solve the captcha.

Both the secret and the timeout are configured with the [captcha] ini section, in which two parameters are recognized: secret for the secret string and expire for the timeout value, in seconds. For example:

  [captcha]
  secret = 11e61c749de5944b74770b2206c2bfd97472c9d3
  expire = 300

The timeout in this example is 5 minutes (300 seconds).

CAPTCHA solution form

Suppose the user wants to establish a session, so (s)he solves the CAPTCHA and submits the solution. It is important to understant that the Thalassa CGI program detects this situation by seeing the following conditions are met at once:

it is a POST request, with the application/x-www-form-urlencoded content type;
there's no active session (the cookie isn't set or doesn't correspond to an existing session);
there's a “form input value” (intended to come from a hidden input) named command, with the value setcookie.

It worth mentioning that actually the “form input” named command is not used anywhere else in the Thalassa CGI's work, and it is never used with any other value. So the pair command=setcookie should be considered a kind of magic word to detect this (very special) type of request.

Besides the command input value, the following inputs are expected for a setcookie request:

captcha_ip — the IP address of the client, in the most traditional decimal notation, like 198.51.100.173;
captcha_time — the time when the CAPTCHA was issued, in the form of a unix date represented as a decimal, like 1685300179;
captcha_nonce — the randomly-generated single-use number, also known as nonce, represented as a hexadecimal number, like C34A30B848CC2FC1;
captcha_token — the MD5 hash of the string built as a concatenation (in some unspecified order) of the IP, time, the correct CAPTCHA answer and the configured secret value; the hash is represented in base64, not in hex as you might be used to;
captcha_response — the response to the CAPTCHA challenge entered by the user.

Of all these “inputs”, only the captcha_response is intended to be a real input field for the user to fill. All the others clearly should be hidden inputs; indeed, hardly a user can fill any of them. Instead, the CGI program must “fill” them when a CAPTCHA-containing page is generated. All pages Thalassa CGI outputs (with the exception for a default error page) are derived from templates set in the configuration file, and this is true for the CAPTCHA pages, too; so one needs some means to access the respective values, as well as the CAPTCHA image. And here the %[captcha: ] macro enters the game.

The %[captcha: ] macro has several functions; when it is expanded for the first time during the particular run of the Thalassa CGI program, no matter which of its functions is used, the macro generates a brand new CAPTCHA challenge, the correct response for it, memorizes the current time and the client's IP address; after that, all functions of the macro simply return the memorized values. The macro has five functions:

%[captcha:image] — the CAPTCHA challenge image, as a PNG picture, encoded with base64;
%[captcha:ip] — the IP address of the client;
%[captcha:time] — the unix time value;
%[captcha:nonce] — the nonce value;
%[captcha:token] — the MD5 hash to be the value for the captcha_token form input.

For the present version of Thalassa, a CAPTCHA form should appear on two different pages, both “special”: [nocookiepage] and [retrycaptchapage]. The content of these pages is (well, should be) different, so it is recommended to prepare the CAPTCHA response form as a snippet within the [html] ini section. For example, the following may work:

  [html]
  cookieform =
  +<img alt="captcha" style="float:right;"
  +     src="data:image/jpg;base64,%[captcha:image]" />
  +<form name="captcha" action="%[req:script]%[req:path]" method="POST">
  +<input type="hidden" name="captcha_ip" value=%[q:%[captcha:ip]] />
  +<input type="hidden" name="captcha_time" value=%[q:%[captcha:time]] />
  +<input type="hidden" name="captcha_nonce" value=%[q:%[captcha:nonce]] />
  +<input type="hidden" name="captcha_token" value=%[q:%[captcha:token]] />
  +<label for="captcha_input">Please enter the string made by swapping
  + letters as shown at the picture to the right.  There are digits and
  + latin letters only, case is ignored:</label><br/>
  +<input type="text" id="captcha_input" name="captcha_response" /><br/>
  +<input type="hidden" name="command" value="setcookie" />
  +<input type="submit" value="Set cookie and create session" />
  +</form>
  +

The No Cookie special page

The No Cookie special page is the content to be displayed to the user in case the user requested a page which is configured as requiring an active session, and there's no active session.

The page is configured with a [nocookiepage] ini section which is presently supposed to contain only one parameter, named template. The parameter's value is passwd through the macroprocessor to build the actual HTML document to be sent to the user.

Certainly the No Cookie page must contain the CAPTCHA solution form. The rest is generally up to you, but perhaps the page should display a brief text explaining what's going on, stating that we're going to set a cookie, as well as that all the site's features — except for the interactive — should work without cookies, and that the cookie can be removed from the user's browser at any time.

The Retry Captcha special page

The Retry Captcha special page is the content displayed to the user in case the user just tried to solve the CAPTCHA (that is, submitted the CAPTCHA form in a request that meets the conditions — BTW, this implies that the Retry Captcha page can only be displayed in response to a POST request, which may be important), but for any reason the solution is rejected. Actually there are five different reasons why it can be rejected, and it seems a good idea to tell the user what went wrong. So the [retrycaptchapage] ini section, which configures the Retry Captcha page, is a bit more complicated than for the two other special pages.

Just like other sections defining pages, this section has the template parameter, whose value is passed through the macroprocessor to build the content to send to the user; only for this parameter's value there's an additional content-specific macro, %errmessage%, which expands to a short piece of text explaining what's the actual reason the CGI rejected the user's submission.

The error messages themselves are configurable, too; for that, there's the second parameter in the same section, named errmessage. The parameter should be set separately for each of the five predefined specifiers, pretty self-explanatory: ip_mismatch, expired, broken_data, wrong_answer and unknown. Setting the values for this parameter, you can choose your own wording and, e.g. translate the messages to a language other that English; for English, the following may make you a good start:

  [retrycaptchapage]
  errmessage:ip_mismatch = <strong>ip address mismatch</strong>.
     It is possible you reconnected to the Internet between you
     received the captcha and submitted the answer.
  errmessage:expired = <strong>time is over</strong> (5 minutes).
  errmessage:broken_data = <strong>broken data in the request</strong>.
     The most likely it is a server side problem; if it repeats,
     please contact the site owner.
  errmessage:wrong_answer = <strong>wrong captcha answer</strong>.
     Please try again.
  errmessage:unknown = <strong>unknown reason</strong>.
     This must never happen in normal circumstances.  Please report
     this to the site administration.

For errmessage:expired, in case you set the [captcha]/expire parameter to something different from 300, be sure to replace “5 munutes” with whatever explains the expiration period you choose.

Remember that the page must contain the CAPTCHA solution form; the rest is up to you, even explanations are not necessary as the user can only endup seeing this page after reading the No Cookie page.

The `%[sess: ]` macro

The session status (including absense of a session), as well as all data associated with the session in case it exists, can be examined with the %[sess: ] macro. When the session is associated with a user account (whether logged in or unverified), the macro provides access to the user account's properties as well. Furthermore, the same macro is used to access the moderation queue.

As usual, the macro accepts at least one argument which must be the function name, and for some functions it accepts more arguments. We'll be documenting the sess macro functions along with the features they are intended to be used with. As of now, we'll only discuss the functions related to work sessions as such.

%[sess:cookie] expands to the cookie value, if the session (and the cookie value) exists, otherwise it returns an empty string. It is important to note that this is the cookie's value going to be sent to the client in the HTTP headers — which means, from the client's point of view, that this value is the same as the new value of the cookie. The old value of the cookie — the one which was sent to the server along with the request for this particular page — can be taken with %[req:cookie:thalassa_sessid].

The ifvalid, ifhasuser and ifloggedin functions are conditional checkers; they all take two additional arguments: the then value and the else value, and return the former in case the condition is true and the latter if it is false. For ifvalid, the condition is if there is an established work session, so it can be used like this:

  %[sess:ifvalid:The session is active:You've got no session]

The ifhasuser's condition is if the session has an associated user name, and for ifloggedin the condition is if not only there's an associated user name, but the user entered a correct password thus passing the authentification — that is, the user's identity is confirmed. If this doesn't seem to be clear, please wait for the detailed description of user accounts and associated procedures.

Functions related to user accounts are described together with user accounts. Functions related to the moderation queue are described along with comments.

Thalassa CMS