GTK SourceView Highlighting

Robert Crowther Jan 2022

Last Modified: Mar 2025

I wasn’t going to write about this. But the scattered information on the web convinces me to.

Note that,

There is documentation, if not really enough
The results are good
The configuration system is for a limited set of GTK editors
The job is easy but tricks get in the way. If you are not reading these notes, avoid.

What we will do

You would like, when certain files are opened, for them to be automatically coloured, like happens when you open a file of computer code.

What gadgets it works on

You have a Gtk desktop. This usually means a Debian derivative Linux—Ubuntu or the like. And gadgets that use GtkSourceView for displaying text. That usually means Gedit and it’s forks e.g. Pulma and Xed.

image of gtk_sourceview_highlighting — Take it for granted, but you can do this too

Howto

Pick a MIME type

The highlighting will work on files of this type. Maybe it exists. Maybe you want to create new mime type.

Find and choose an XDG path

This is where we put the highlighting definition. Run this to find configured Freedesktop paths,

$ printenv XDG_DATA_DIRS

You will see several paths. If you configure at a /home relative path, the changes will only work for the given user. If you configure at a /usr relative path, then the changes will appear systemwide. If you want a home relative path, but there are none in XDG_DATA_DIRS, follow the instructions in extending Freedesktop paths.

By copying, setup a new highlighting configuration

The examples will create a highlighter for files of type ‘garbage’.

The highlighting definitions built into the system are in,

/usr/share/gtksourceview-[version number]/language-specs

There’s a lot of them, on my system maybe 120. Most are computer languages, but some are for configuration files—‘changelong.lang’, ‘mallard.lang’— and some are for markup—‘markdown.lang’, ‘latex.lang’ etc.

Pick one with some features like, or close to, the results you want to create. Check what version of GtkSourceView you have on your system (use Synaptic or apt‐get). Then make this directory, with the appropriate GtkSourceView version name,

mkdir -p ~/.local/share/gtksourceview-3.0/language-specs/

Copy your starter file into the directory, then rename,

sudo cp /usr/share/gtksourceview-3.0/language-spec/python.lang ~/.local/share/gtksourceview-3.0/language-specs/garbage.lang

Now we need to ask this to recognise files of type ‘garbage’.

Ask the copied highlighter to recognise the target MIME

Edit the new file at,

~/.local/share/gtksourceview-3.0/language-specs/garbage.lang

and alter these XML elements. A note about the _section attribute. This is where the language highlighter will be classed in the GUIs that use ‘lang’ files. The documentation glosses this as ‘just put in Script’. Me, I have found I have uses for other classifications, especially ‘Source’ and ‘Markup’,

<!--
Set the id and name attributes. They will be used in the text editor
-->
<language id="garbage" name="Garbage" version="2.0" _section="Script">
  <metadata>
    <!--
    change or comment these out to recognise the targeted files
    -->
    <property name="mimetypes">text/x-garbage</property>
    <property name="globs">*.garbage</property>
    <!--
    comment out any other elements in metadata
    <property name="line-comment-start">#</property>
    <property name="block-comment-start">##</property>
    <property name="block-comment-end">#</property>
    -->
  </metadata>
  ...

This copied, renamed and slightly modified file should now highlight a ‘xxx.garbage’ file immediately. Let’s test.

Test

Somewhere, make a text file to use as a test,

touch test.garbage

Now you need to shut down every running instance of your text editor. All instances. Then open a terminal and launch the text editor. The reason for using a terminal to launch is that it will display any errors,

xed test.garbage

If all is good, the text editor will pop up the test file, recognise it has the extension ‘.garbage’, then find the XML ‘lang’ file. The text editor should show the language as ‘Garbage’ in the display somewhere. It should apply the contents of the configuration in the file. Currently that is a copy of some other language description. But hey, if all that happened, that is a remarkably easy start.

Edit the new file for highlighting

Ok, now you need to edit the file to highlight your way. Use the two weblinks below, and start flipping between them,

GTK language file reference,: https://gitlab.gnome.org/GNOME/gtksourceview/-/blob/master/docs/lang-reference.md
GTK tutorial for the XML file,: https://gitlab.gnome.org/GNOME/gtksourceview/-/blob/master/docs/lang-tutorial.md

Now you get to adapt this file to target what you want in the file, and colour it (or otherwise modify text styles). I’d like to think that’s the end of this article, but experience has told me there’s more to say about how this system works. You may want to read the notes below, too.

Styles

Did you think you would set a few style definitions like foreground colour and bold, then apply them? I did. That, in the general ambience of how Gtk is designed, was naive of me.

The overall system works nothing like this.

Outline of the styling system

Within a language highlighter, the only possible reference to style is an abstracted style definitions like, “Style this as an XML Element Name’ or ‘Style this as a default keyword’. Which would look like,

<styles>
  <style id="tag" name="Tag" map-to="xml:element-name"/>
  <style id="keyword" name="Keyword" map-to="def:keyword"/>
  ...
</styles>

Note how the prepended part of ‘map‐to’ refers to a language highlighter. By using this prefix, you can refer to the styles in any builtin language.

There is a reason for these abstracted styles. It’s so a user can switch themes. Most every gadget that uses GtkSourceView can do this, switch the color scheme from, say, a classic ‘Gtk’ look, to a ‘Tango’‐like or ‘dark’ theme.

The style definitions themselves are mainly contained in themes, which can be found at,

/usr/share/gtksourceview-3.0/styles/

I’d have a couple of questions about this but first, what was this I said about a ‘default’ lang?

The def.lang file

Well, documentation doesn’t cover this, but it’s important. I think ‘def’ means ‘default’, though it could be ‘base definition’.

The def.lang file sits in the the same folder as the other builtin language files, so go take a look,

/usr/share/gtksourceview-3.0//language-specs/

Overall, the code has two purposes. First, it takes the very basic style definitions of the theme files and extends them into theme definitions more useful for highlighting, especially, computer languages. So ‘floating point’ is styled as the basic ‘number’,

<style id="floating-point"      name="Floating point number" map-to="def:number"/>

Here are some of the styles defined in ‘def.lang’, with my comments on where and for what the styles are used,

def:comment: Really is, in most highlighters, a comment
def:shebang: Gets reused for all kinds of defintions
def:doc‐comment‐element: Usually italic. Rarely used, usually for versioning info like ‘deprecated’. In Mediawiki, ‘light‐emphasis’
def:constant: People seem to use this as defined, and not otherwise
def:special‐char: Used for escaped chars/codepoints in strings
def:identifier: Think I’ve seen this in markup languages reused for titles
def:statement: An oddity, in naming anyway. Used for list and emphasis marks in markdown.lang, ID selectors in css.lang, builtin targets in automake.lang, properties in javascript.lang. Maybe described as “for important modifying marks?”
def:type: XML uses for attribue names. YAML uses for Alias, TOML for variables…
def:preprocessor: Def uses as a line continue! Ruby uses for module handling. XML for entities
def:error: You wouldn’t think a simple regex could detect errors like a parser, but this style is used to indicate some found errors, such as incorrect closures
def:warning: You wouldn’t think a simple regex could detect errors like a parser, but this style is used to indicate warnings, such as legacy code
def:note: barely used. Markdown.lang uses for linebreaks, Forth.lang for alerts
def:underlined: markup languages use it for web addresses/urls

Second, ‘def.lang’ provides some basic highlighting code, especially for computer languages, for example, ‘shebang’ and ‘octal’ highlighters, also an ‘email‐address’ highlighter.

You should go look at this file. You don not want to be repeating the work of others.

Warning: some of the styles do nothing!

Some of the default styles don’t style at all. For example, these are in ‘def.lang’ and seem only to be placeholders, maybe for user customisation or the future,

  <!-- Heading styles, uncomment to enable -->
  <!--
  <style name="def:heading0"                scale="5.0"/>
  <style name="def:heading1"                scale="2.5"/>
  <style name="def:heading2"                scale="2.0"/>
  <style name="def:heading3"                scale="1.7"/>
  <style name="def:heading4"                scale="1.5"/>
  <style name="def:heading5"                scale="1.3"/>
  <style name="def:heading6"                scale="1.2"/>
  -->

‘def:special‐char’ is the same. If you find nothing happening, try temporarily switch your style to a known reliable style like ‘def:comment’ or ‘def:identifier’.

But I have entities unlike other languages I want to style!

I understand. I do too. Here are some strategies.

There is a way, I think to make a new ‘style’ file to supplement what is available. I’d avoid this forever, it probably means recompiling with the new info. But if you were that keen…

Or try root round in the ‘Theme’ code. The base ‘Classic Theme’ contains some rum ‘language specific’ definitions, so clearly a compiling developer felt the need e.g.

<style name="xml:tags"                    foreground="cyan"/>
<style name="latex:command"               foreground="#2E8B57" bold="true"/>

Another way is to radically redefine what is available. Here are two styles from ‘markdown.lang’,

<style id="blockquote-marker" name="Blockquote Marker" map-to="def:shebang"/>
<style id="label" name="Label" map-to="def:preprocessor"/>

These are far‐flung associations, yes? A ‘blockquote marker’ is only a ‘shebang’ in that it’s a meta‐text marker, and a ‘label’ is only a ‘preprocessor’ command in that it’s a non‐content reference. Look, the further you distort the style definitions from their source, the more likely your highlighting will look unhappy in different themes. However, Markdown texts can benefit clearly from many different colours of text, so the author is borrowing all he can from what he has. You could do that, too.

Highlighting definitions

XML editing is notoriously gruelling and will make you psychotic. If you use a terminal to launch a test file, the error reports from the files are not bad, for XML. They provide a line number and misleading advice.

I found matching the simplest regex is the place to start. Building the matching is a dispiriting and not at all robust process.

Contexts

The configuration of ‘lang’ files is not conventional regex. Instead of searching globally, XML ‘contexts’ are defined. These search for a stretch of code, The ‘context’ stretch can have other searches within that stretch. And so on, recursively.

This has advantages. It breaks the regex up considerably, then labels the expressions. The ‘lang’ files are easy to read. Background code handles regex composition and compilation. And the recursive ability is quite powerful. Not as obviously powerful as a parser, it doesn’t make regex more or less expressive. But, for highlighting purposes, it’s a tidy system.

Take a highlighting problem that many of the xxx.lang files address. You would like to identify strings. Fine, look for opening and closing quotes.

<start>"<start>
<end>"<end>

But now you want to highlight escaped chars. Escaped chars are important to your readers, and are hard to spot in the middle of a string. So you make a regex to spot escaped chars.

<context id="escaped-strings" style-ref="def:special-char">
    <match>(\\[nst])</match>
</context>

Crude, but you get the idea?

The problem with this is that a general (‘global’) regex will highlight escaped chars everywhere in the source data. It would be unusual to find escaped chars free in data but, if they are there, this regex will highlight them. Worse, it will highlight escaped chars in comments and, if you are highlighting computer code, docstrings. But contexts solves that. Define like this,

<context>
<start>"<start>
<end>"<end>
  <include>
    <context ref="escaped-strings">
  </include>
</context>

Now escaped chars will only be highlighted within a string type.

Types of context

The XML code allows several types of context. I’ve reordered the presentation here to give examples/templates for the most common constructions. These examples do not cover many attributes—look at the reference page for full details.

This is a reduced form of the ‘simple’ context. It’s the most basic match/highlight,

        <!--
        All attributes are optional
        id, for further references
        style-ref, apply a style to the match
        -->
        <context id="function" style-ref="def:statement">
            <match>^([^\n]*)$</match>
        </context>

The larger form of the ‘simple’ context can highlight, but also allows highlighting within captured groups.

        <!--
        All attributes are optional
        id, for further references
        style-ref, apply a style to the match
        -->
        <context id="function" style-ref="def:statement">
            <match>^([^\n]*)$</match>
            <!--
            Include is optional
            Only subpattern matches in the include
            -->
            <include>
                <context id="func-name" sub-pattern="1" class="builtin-functions" />
                <context sub-pattern="2" class="args" />
            </include>
        </context>

Note that this is somewhat limited. The Include element can only contain subpattern matches, which means matching can not be further nested. When I grepped the builtin xxx.lang files, I found that most of the definitions that used subpattern matches were for markup languages. The definitions were highlighting string matches. The only computer language definition that used subpattern matches extensively was Ruby. It used them to highlight Ruby’s sophisticated string manipulation facilities.

The container context is a slightly different way of capturing a stretch. It captures the begin and maybe end of the stretch (the underlying code matches the contents),

        <!--
        Id or start are mandatory.
        id, for further references
        end-at-line-end, the attempt to match stops at line-end
        style-ref, apply a style to the match
        -->
        <context id="function" end-at-line-end="true" style-ref="def:statement">
            <start>&lt;</start>
            <end>&gt;</end>
            <!--
            Include is optional
            Any kind of context construction in the include
            -->
            <include>
                <context id="func-name" sub-pattern="1" class="builtin-functions" />
                <context sub-pattern="2" class="args" />
            </include>
        </context>

You’ve seen them above, subpattern matches match group captures in a regex,

<context id="func-name" sub-pattern="1" class="builtin-functions" />

Keyword contexts,

    <context id="tree" style-ref="def:preprocessor">
      <keyword>Oak</keyword>
      <keyword>Ash</keyword>
      <keyword>Beech</keyword>
      <keyword>Pine</keyword>
    </context>

Optionally, contexts have ids. These contexts can then be used in different places. This becomes a ‘reference’ context,

<!--
Ref is mandatory.
ref, id of another context to use here
style-ref, apply a style which overides style on the original context
-->
<context ref="comment-line" style-ref="def:type"/>

Notes on Contexts and regex

Matching is in order

In any context, first match wins. This can be important. As the documentation points out, you need your most particular matches first, or you will never reach later contexts. For the following two rules, one or more pound signs will always match the first content, or ‘rule’, so never reach the second,

<match>#</match>
<match>##</match>

These two contexts need to be the other way up,

<match>##</match>
<match>#</match>

If a succesful match suddenly disappears, this is the first suspect.

What is the ‘class’ attribute used for?

Good question. It lets you switch spell‐checking on and off for any given context.

<context id="script" class="no-spell-check">

Most xx.lang files don’t use it for anything else. It can be used also to enable or disable context classes,

Errors use fallthrough

First attempt the match. If that fails, capture something close then highlight as an error. This example not working code, but to show the code structure,

<!-- matches URLs -->https|
<context ref="urls">

<!-- matches something like a URL, now available because the previous match failed -->
<context>
<Match>((?:http|https|ftp)\:)<match>
<include>
<subpattern="1" style-ref="def:error">
</include>
</context>

Contexts are for scattered items

…like finding escape codes in a string. There is a ‘once‐only’ attribute, if that’s what you are after. But if your content is ordered then it’s time for a big regex and some subpattern matching (same as it ever was).

Thoughts

First, I’d point out that auto‐colouring has more use than for computer code. Would you like some configuration files to be coloured for easy reading? Do you have some rough text templates? Custom HTML files? I do, several.

I can grumble. For me, XML configuration is a black mark. Though I understand the need for themes, inline styles would help. And in no way can this system match the power of an EMacs mode—it can’t change bindings, represent text content, or do partial‐parsing. And the system only covers a few editors for GTK desktops.

But I’m not going to complain. Far from it. The system uses one file, placed in an easily‐locatable and edited position. Albeit in GTK’s odd we‐are‐getting‐there way, it is documented. The configuration uses readily understood components and, considering these, makes good results. Maybe only one in 2000 users will have any interest, but I’d say this chunk of thinking is under‐loved.

Refs

Gtk used to have a noble aim of publishing all documentation as proper webpages. This seems to be lost somewhere, at least for now. So these links point to the Gitlab source code. No guarantee they’ll stay put,

Introduction,: https://gitlab.gnome.org/GNOME/gtksourceview/-/blob/master/docs/lang-intro.md
Reference,: https://gitlab.gnome.org/GNOME/gtksourceview/-/blob/master/docs/lang-reference.md
Tutorial,: https://gitlab.gnome.org/GNOME/gtksourceview/-/blob/master/docs/lang-tutorial.md