Making section indices and blogs

Estimated reading time: 4 minutes.

Last update: 2019-12-05

This website and the blog are made with soupault, quite obviously. How exactly the blog is made? While soupault doesn't include any blog functionality, it allows you to extract metadata from existing pages and either render it using a built-in index generator or feed it to an external script.

Please note that if you want a full-grown blog with tags, archives searchable by date and so on, you should better use a specialized blog generator like Pelican. You can reimplement that functionality in a soupault workflow, but it may be more trouble than it's worth.

The indexing feature is there to simplify creating and maintaining lists of all pages in a site section, and it's also good for simple blogs, but it only gives you the index data, and what you do with that data is up to you.

How indexing works in soupault?

Since soupault works on the element tree of the page rather than its text, it can extract content from specific elements. For example, the first <h1> of a page is usually its title. The first paragraph is often (though not always) written to give the reader a summary of the page, and makes a good abstract/excerpt. Their content can be extracted using CSS selectors like h1 and p (or something more specific like main p, if you want the first paragraph inside the <main> element).

This way you can make use of your existing site structure to automatically create an index for it, rather than have to extract and duplicate all metadata.

In a classic static site generator, a post would be preceded by “front matter”: a metadata section that generator uses to create both the final page and its blog feed entry, something like:

---
title: Why soupault?
date: 2019-10-10
---
There are many...

There is nothing like it in my blog. The key idea of soupault is to work on existing HTML, so those posts are valid bodies of complete pages, not just a raw material for the generator. Other SSGs generate HTML, while soupault transforms and enhances it.

Let's examine the beginning of the first blog entry:

<h1 id="post-title">Why soupault?</h1>

<span>Date: <time id="post-date">2019-10-10</time> </span>

<p id="post-excerpt">
There are so many static site generators already that another one needs a pretty good justification.
I mainly made soupault for my own use, in fact it has grown out of a set of custom scripts that used to power
the <a href="https://baturin.org">baturin.org</a> website. Still, if I'm making it public, I may as well want to
explain the decisions behing it and reasons why anyone else may want to use it.
</p>

That's just normal HTML. However, you see there are some element ids that are not strictly necessary, like id="post-date". On my website, there are no special styles for those elements. They serve as a microformat that tells soupault what exactly to extract from the page to create its blog feed entry. Of course, they could also be used for styling or as page anchors.

Now let's see how the blog feed is created from those pages. First, we'll examine the [index] section from my soupault.conf. This is a slightly simplified version of it that you can easily copy into your config:

[index]
  index = true
  index_selector = "#blog-index"

  index_title_selector = ["h1#post-title", "h1"]
  index_date_selector = ["time#post-date", "time"]
  index_excerpt_selector = ["p#post-excerpt", "p"]

  newest_entries_first = true

  index_item_template = """
<h2><a href="{{url}}">{{title}}</a></h2>
<p><strong>Last update:</strong> {{date}}.</p>
<p>{{{excerpt}}}</p>
<a href="{{url}}">Read more</a>
"""

In short, it tells soupault what to extract, how to render the index, and where to extract it.

First of all, don't forget the index = true option. Index data extraction is disabled by default, so you need to enable it if you want to use it.

The index_selector option is a CSS selector of the element where generated index is inserted. In my site/blog/index.html it's a <div id="blog-index">, so it can be uniquely identified with #blog-index. This is equivalent to $("#blog-index") in jQuery or getElementById("blog-index") in plain JS.

The index_title_selector, index_date_selector, and index_excerpt_selector options tell soupault what to extract. Notice that there are multiple selectors for each of them. For example, index_excerpt_selector = ["p#post-excerpt", "p"] means “extract data from a <p id="post-excerpt"> if it exists, otherwise, extract the first paragraph”. This makes it possible to use something else than the first paragraph for the post excerpt, without duplicating the data. I can just mark the excerpt paragraph explicitly with an id attribute.

The newest_entries_first tells soupault to sort entries by date in descending order. The default date format is YYYY-MM-DD, though it's configurable. It uses the date extracted by the index_date_selector option.

Finally, there's a Mustache template used for rendering index entries. This is supported since version 1.6. Mustache is a simple, logicless template language that should cover the basics. You can read about available fields and other details in the reference manual.

Using external index processors

The built-in index generator is fast and easy to use, but it's not very flexible. However, soupault can export extracted data and feed it to an external script, then include its output back in the page. This way you can do literally anything with that data, though it requires some programming skill (on the other hand, it can be a good first project for learning programming).

Index data is encoded in JSON and sent to the script's input (stdin):

[
  {
    "nav_path": ["blog"],
    "url": "/blog/blogs-and-section-indices",
    "title": "Making section indices and blogs",
    "date": "2019-10-11",
    "author": null,
    "excerpt": "\nThis website and the blog are made with soupault, quite obviously..."
  },
  {
    "nav_path": ["blog"],
    "url": "/blog/why-soupault",
    "title": "Why soupault?",
    "date": "2019-10-10",
    "author": null,
    "excerpt": "\nThere are so many static site generators already..."
  }
]

I reformatted it for better readability and shortened the excerpts. In reality it's sent as a single line with a newline character as an end of input marker, and the full text of excerpt is included.

This is how we could reimplement what my setup does using a Python script:

#!/usr/bin/env python3
import sys
import json
import pystache

template = """
<h2><a href="{{url}}">{{title}}</a></h2>
<p>Last update: {{date}}</p>
<p>{{{excerpt}}}</p>
<a href="{{url}}">Read more</a>
"""

renderer = pystache.Renderer()

input = sys.stdin.readline()
index_entries = json.loads(input)

for entry in index_entries:
    print(renderer.render(template, entry))

How to make soupault run that script? Suppose you saved it to scripts/index.py. Then instead of the index_item_template option, you need to add index_processor = "scripts/index.py".

On UNIX-like systems, don't forget to make it executable (chmod +x scripts/index.py). On Windows, you should make sure .py files are associated with Python and you also need to adjust the path (see below).

A note for Windows users

Soupault uses native file path syntax on every OS, so for you the script option will be index_processor = 'scripts\index.py'. Note the single quotes! Inside double quotes, the back slash is an escape character that can be used for inserting special characters inside your strings (like "\n", the newline character). Inside single quotes it has no special meaning.

If that is not enough

This is a simple blog recipe that is easy to copy. However, there are more features that add flexibility. First, soupault is not limited to those built-in fields, you can define your own fields with their names and selectors, and they will be available to the built-in templates and external scripts alike. Second, it can dump the complete site index data to a JSON file that you can use to create taxonomies etc. That's for another time though.