Soupault 1.5 (it can work with unmodified websites as an HTML processor now)

Estimated reading time: 3 minutes.

Date: 2019-11-01

One hard limitation of classic website generators is that pages with unique layouts are excluded from any processing. You can add them as static assets, but you can't have template tags inside them.

Soupault used to have a soft version of that limitation—you could make unique pages by using a bare minimum template like <html> <body> </body> </html> and keeping most of the layout inside page content files, but for sites where most pages are not unique, it would create lots of duplicate code. That was clearly contrary to the goals of being friendly to Web 1.0 style sites that feature unique pages and making it easy to automate just what you want to automate.

Another problem was that you'd have to edit every page and strip it to the content just to give soupault a try. You could not take advantage of soupault's HTML processing capabilities without switching to the “empty page plus page bodies” workflow first. However, functionality like tables of contents, footnotes, file/snippet/script output insertion and so on could easily work with unmodified pages, so that limitation was artificial and unjustified. In this release I focused on removing that limitation.

Soupault 1.5 is easy to try out without modifying any existing page. It's now able to detect if a file is a page body or a complete page. Page bodies are inserted in the template, but complete pages are just processed by widgets/plugins. Moreover, there's now an “HTML processor mode” when it needs no template at all. Now it's much easier to make websites where many pages have a unique layout, or use soupault to automatically enhance an existing site, e.g. inject a viewport meta tag or a table of contents into every page, or add an autogenerated list of all pages.

You can download the executables from files.baturin.org/software/soupault/1.5/.

Complete pages vs page bodies

Earlier versions assumed that everything in site/ is a page body that should be inserted into the template. If you tried to put a complete page there, it would cheerfully insert it into the template and create a page with one <html> nested inside another, which is patent nonsense.

Now it checks if a page has an <html> tag in it. If it does, it runs the page through the widgets and saves to disk. If it doesn't, then the page is first inserted in the template, then processed by widgets and saved to disk.

This way you can easily try soupault on your existing website. Run soupault --init, copy your existing pages to site/, tweak the soupault.conf config, run soupault, and check out the output in build/.

Then you can either gradually migrate to a workflow with a shared template file for non-unique pages, or keep everything as is, the choice is yours.

The element whose presence tells soupault that a file is a complete page is configurable. You can change it from <html> to <body> or something else if you want.

[settings]
  complete_page_selector = "html"

HTML processor mode

A website where every page is unique (or there's just one page), or the HTML is generated by something else is also a valid use case. With complete page detection feature, nothing prevents you from using it as a configurable/programmable HTML processor rather than a website generator in the usual sense.

Now you can switch it to the HTML processor mode with this option:

[settings]
  generator_mode = false

In that mode, it doesn't require the default_template (usually templates/main.html) to exist and doesn't use it. It treats every file in site/ as a complete page instead: reads it, runs it through the widgets, and saves to the build directory.

Overriding directory locations from the command line

If you are to use soupault as an HTML processor, then the concept of “project directory” becomes meaningless. Even if you are using it as a website generator, there may also be reasons to change the output directory on the fly, e.g. to deploy your website over WebDAV.

Now you can override it all from the command line. Config file location can be set using SOUPAULT_CONFIG environent variable. Locations of the input and output directories can be set with --site-dir and --build-dir options. An example of overriding it all at once:

SOUPAULT_CONFIG="something.conf" soupault --site-dir some-input-dir --build-dir some-other-dir

Better control over content insertion

Before 1.5, soupault would always append widget output to existing children of the container (i.e. insert it after the last child element). If you are making a new website with soupault, you'll probably create designated places for widgets output in your template. However, if you want to use it with pages not written with that in mind, you need better control of the position where it's inserted. Now there's an option for it.

For example, this way you can insert a table of contents right before the first <h1> heading (if a page has it):

[widgets.table-of-contents]
  widget = "toc"
  selector = "h1"
  action = "insert_before"

The action option works for all built-in widgets, except title and delete_element where it makes no sense. Its possible values are: append_child, prepend_child, insert_before, insert_after, replace_element, replace_content.

Other improvements

The delete_element widget now deletes all elements matching its selector, unless the delete_all option is false.

Autogenerated index is now inserted before widgets run, so that widgets can modify its output.

One edge case where this is relevant is footnotes inside paragraphs used as blog post excerpts. Index data extraction happens after all widgets are processed, so you'd end up with a blog index page full of dangling footnote references. With these changes in place, you can solve that problem with something like:

[widgets.remove-footnotes-from-excerpts]
  widget = "delete_element"
  selector = "a.footnote"
  page = "blog/index.html"

Also, the default config generated by soupault --init is now more illustrative and well-commented.