Structured data

Robert Crowther Sep 2022
Last Modified: Feb 2023

That’s right, this post is packed with grumble. About the much‐touted ‘structured data’.

Summary

Structured data is sold with high‐flown language and ideals. These ideals may be necessary for establishing standards. In practice, the only benefits I find documented (ignoring Internet Of Things activity) are jazzier search engine listings. Which are a limited and rum (sorry ‘exciting’/‘inspiring’) set.

The only people who need consider implementing structured data markup are web‐builders who deal with permanently‐available products, libraries of some items, or event listings. If you are delivering data from those categories, structured data will generate a visual style for search results that users have come to expect. However, to achieve the visual style the data must have content that adheres to Google’s parser. There is little help available about this, so the structure of the markup is best copied from other sites with working examples.

After that, structured data appears to be of only theoretical benefit, can confuse pages conceptually, is a drain on site‐building resources, and may slow performance.

Structured data, or microformats, a discussion

Straight up,

Google Search works hard to understand the content of a page. You can help us by providing explicit clues about the meaning of a page to Google by including structured data on the page.

Structured Data, JSON‐LD, Microdata, RDFa

‘Structured data’ is an imprecise and clumsy name. From here on, I call it ‘structural markup’. Which is more precise, still clumsy. JSON‐LD, Microdata and RDFa are all forms of the idea. Google recommend JSON‐LD.

What’s not structured about HTML?

HTML is text with structure marks. So I don’t know where that name came from—it’s lousy. ‘Schema.org’ seeks a way through,

Usually, HTML tags tell the browser how to display the information included in the tag… However, the HTML tag doesn’t give any information about what that text string means—”Avatar” could refer to the hugely successful 3D movie, or it could refer to a type of profile picture—and this can make it more difficult for search engines to intelligently display relevant content to a user.

Ermmm, remember the old shorthand about HTML for structure, CSS for styling (’how to display…’)?

Best that can be made of this is to say that HTML tags are not expressive enough. They can show the difference between an aside, a time and a headline. But these categories are a high level of abstraction which leans towards typographical structure. Also, the HTML standard is deliberately limited—as I recall it’s only fourty or so tags, and new tags are debated for years. That said, while I don’t get the argument conceptually, I do get that there is a precision issue. It would help search‐engines if they knew more about what site‐authors intend.

How structural markup can/could help webpage visitors

Though I havn’t seen anywhere, I suggest that supplementary information would be of two kinds,

Information that categorises webpage data

For example, telling visitors, as in the Schema.org example, not only that a page contains text in an ‘Article’, but that the text is a movie review

Information that extends references in the page

So, for example, add details about a movie director

Schema.org does hint at this split, someplace talks about ‘rich information’.

About these cases, the first case, or the way I’ve pitched it, is categorisation of material. This is the business of library cataloguing—Dewey reference system and so forth. The second case, of extra information, is scattered. Librarians do grapple with extended information—all published books have an ISBN number. But this case also embraces the idea, for example, of an academic citation. So the citation can be verified by a reader, academic citations are always tagged with a mass of information. Worth noting that, compared to paper records, the abstract form of the web offers a more consistent approach.

Anyway, let’s get on with what we have. Information about structural markup is mainly at Google and ‘schema.org’. First question,

What can structural markup do in practice?

Theoretically, structural markup can provide catalogue information for a webpage. Now I ’fess up. I’ve found no information anywhere that talks about this—if it works, or how it works. None, nada, zilch. The only deductive possibility I’ve found is by looking at the recognised formats page on Google. About which more in the next paragraph. But, on our current subject, Structural markup does nothing documented to help or provoke search engine cataloguing. If that is what you need, it will help custom scrapers look at your webpages. Or maybe you can live without that.

Right, and what can be done with the extended information? If a search engine recognises the data it will use it to draw a jazzy search engine listing. This look has come to be accepted in Bing and Google especially so, if you trade in these specific kinds of data, you will want to use structural markup. But it’s a rum list of target data— ‘FAQ’? ‘Estimated salary’? The useless‐looking ‘Article’ form, which can be expressed in HTML anyway? Also, a small addition—Google publish a short list of further data types that benefit from structural markup. Google can work from this data to make ‘enriched search results’, which means not only a jazzy search listing, but some search categorising and filtering. What data does this apply to? Here’s the list,

Short, isn’t it? This scene is a comedown, isn’t it? You ask, “Is that what it amounts to—a few jazzier search engine listings?” Mostly, yes.

Pause! Facebook Open Graph markup

If you are looking for one of Google’s fancy listings, and have found examples in the Google links provided, then read down. However, I note that, besides the ‘Schema.org’ definition and it’s various formats, there is another form of meta‐data. Which is not from that universe.

Open Graph is a Facebook initiative. It involves adding meta tags to a webpage. The definition is far smaller than ‘Schema.org’. What will Open Graph do for you? It will, mostly, target a title, image and description that will be used for social media links. The data given, like structural markup, may be entirely different to that displayed in the webpage (may have smaller photos, shorter descriptions… this is a different article…).

Open Graph structural markup is used not only by Facebook, but most other social media. Twitter went so far as to extend the specification. Most sizeable or much visited websites include Open Graph markup. Google seems to take note of Open Graph markup, at least for displaying images. Is this what you were looking for? You can read about Open Graph data on this site.

And now, JSON‐LD

Because Google recommends this form.

The basic JSON‐LD structure

Nobody answers this. I’m badly educated, so needed to figure. Example, with some (illegal) comments in the JSON,

<script type="application/ld+json">
{
  // This stays the same, in this example and your code. It defines the schema used
  "@context": "https://schema.org",

  // The ampersand states an extra step of abstraction. This property/key defines which of the allowed definitions on schema.org is being used
  "@type": "Painting",
  "name": “The Rye Marshes",
  "description": "Semi-surreal and abstracted landscape of Rye Harbour",

  // Another entity can be embedded in the top level. Here, a ''Person' description of the 'author'
  "author": {
    "@type": "Person",
    "givenName": "Paul",
    "familyName": "Nash",
    "birthDate": "1889-05-11",

    // JSON parsers can be picky about commas. Last property in a list must not have one
    "birthPlace": {
        "@type": "Place",
        "address": "London, England"
    }
  },

  // JSON parsers can be picky about commas. Last property in a list must not have one
  "dateCreated": "1932"
}
</script>

Thanks to nobody in any post or article for explaining.

Also note: properties must have the correct data type—an ‘author’ must have an embedded ‘Person’ definition (a string containing a name is invalid).

Further note: both Google and Schema org are clear—structural markup will not accept properties not listed on Schema.org. Moreover, some properties are required. I have no idea which—‘@context’? Other properties are optional.

Where to place JSON‐LD

Yes, where? I’ve not seen anything on Google, or elsewhere for that matter. Common advice says, on webpages, keep non‐essential material away from the critical. Some people load all non‐essential script at the bottom of the document. Some load at top, but set the ‘defer’ attribute.

All I can tell you is that when I have found JSON‐LD data on the web, it has been head‐loaded. The independent JSON‐LD website recommends a head‐load. Which makes sense to make data available to search‐engine crawlers, but not for page performance. The exception, so far, is Wikipedia, which tail‐loads.

How can JSON‐LD target elements?

Fine, if you are using other vocabularies you can, inside the HTML, place them in scope. But JSON‐LD in a head applies to the entire webpage. I’ve not found anything yet that explains how page‐loaded JSON‐LD can target/scope an item within a page. However, see the discussion of IDs further down. That’s how it is done—but nobody explains.

Conceptual issues with structural markup

As I explained above, I don’t see a divide between structural markup and HTML. Which blur creates the following challenges. These issues are not specific to JSON‐LD, but in JSON‐LD they look obvious.

Structural markup lacks categorisation terms

I don’t have to start here, do I? It would be useful to tell a search‐engine if my generic ‘Article’ is a,

  • Short story

  • Play script

  • Poem

  • Gadget review

  • I can’t. Current definitions let me say is that it’s an ‘Article’. Or ‘NewsArticle’. On Schema.org, let alone Google, you’ll not find ‘Poem’, ‘PlayScript’ etc.

    Structural markup can duplicate information

    Structural markup without categorisation or extra information duplicates what is in the page,

    <script>
    {
      @type : article
      "headline" : "Hit me with your rhythm stick"
    }
    </script>
    <article>
      <h1>Hit me with your rhythm stick</h1>
      ...
    </article>
    

    Interesting feature. Scala programming language calls this ‘Code smell’. Ruby coding language has a principle, “Don’t Repeat Yourself”. Me, I think ‘maintenance‐stench’.

    Structural markup may create unanchored information

    Maybe I think I’ll help a search engine with information about a special image?

    <script>
    {
      "@context": "https://schema.org/",
      "@type": "ImageObject",
      "contentUrl": "https://forever.com/images/showdown.jpg",
      "caption" : "The main event",
      "author": {
        "@type": "Person",
        "givenName": "Will",
        "familyName" : "Love"
      },
      "license": "https://example.com/license",
      "acquireLicensePage": "https://example.com/how-to-use-my-images"
    }
    </script>
    <article>
      <h1>Hit me with your rhythm stick</h1>
      <img href="/images/vlue_sky.jpg">
      <img href="/images/turn_to_stone.jpg">
      <img href="/images/showdown.jpg">
      <img href="/images/overture.jpg">
    </article>
    

    Yeah, but I’ve found nothing that says where head‐placed JSON‐LD, especially, starts referring from. There are hints on Google that crawlers gather the structural markup then associate later. I don’t seem to be helping search engines much here. First they must gather the info. Then need to scan the page, with no hierarchy guiding them. Then, if they stumble across the image, they need to cross‐check the URL (which I could also get wrong, so loosing the connection). Which is not good.

    This, from Schema.org, may be a hint,

    Every web page is implicitly assumed to be declared to be of type WebPage,

    Ummmm. Structured Markup could be used beyond computers, or to refer to non‐computer items. But, lets be practical, structured markup is mainly for use on computers. And the definitions have specified that it will use a computer‐based form of identifying code, the URL. I guess not all URLs point to a webpage, but most do. So this is saying that, I think, if a page with no markup is processed by a markup processor, it is assumed to be markup of type ‘Webpage’. It may, though not for certain, assume that on a webpage with snippets of markup, those snippets are gathered into a base type of ‘WebPage’. But I don’t know that.

    Answers, or not

    The only hints I have found to these issues are in the fragment of a topic, Google advice on webpages containing multiple items. No direct answers, but suggestions that unrelated snippets of JSON‐LD can be strung together, as well as embedded/nested,

    Google Search understands multiple items on a page, whether you nest the items or specify each item individually:

    For items that are strung together,

    If there are items that are more helpful when they are linked together (for example, a recipe and a video), use @id

    So head‐placed JSON‐LD data is regarded as unanchored? Google help is not helping,

    …include the main type of structural markup that reflects the main focus of the page. For example, if a page is mainly about a recipe, make sure to include Recipe structural markup…

    yet that is all we have.

    Structuring structural markup

    Right, lets have a go. With an example; a web‐page that reviews a book. The page includes data on the book—author, publisher, ISBN and so forth, a photo of the cover, and an independent review. What would the structural markup look like?

    Using nesting

    Google says they do not recognise type ‘Review’. Then again, others say that Google will recognise ‘Review’. To me, typing this page as a ‘Review’ is wrong. I will explain why below. For now, I make this a review,

    <script type="application/ld+json">
    {
      "@context": "https://schema.org/",
      "@type": "Review",
      "isPartOf" : "https://onbooks.com",
      "headline": "Lord Of The Flies",
      "image": "http://www.example.com/lord-of-the-flies-cover.jpg",
      "about" : {
        "@type": "Book",
        "name": "Lord Of The Flies",
        "author": {
          "@type": "Person",
          "givenName": "William",
          "familyName" : "Golding"
        },
        "publisher": {
            "@type": "Organization",
            "name": "Faber and Faber"
        },
        "inLanguage": "English",
        "isbn": "00000000"
      },
      "reviewRating": {
        "@type": "Rating",
        "ratingValue": "5"
      }
    }
    </script>
    

    That is valid JSON‐LD—if it can affect a search‐engine, I do not know. It’s the approach used on jsonld.com. Anyway, that’s what a nested approach would look like.

    Using IDs

    Link pieces of data with IDs. The ‘@id’ value. Which Google barely talks about. And JSON‐LD sources rant about, without context or example (yeh, the so‐called help here is bad). Anyway,

    <script type="application/ld+json">
    [
      {
        "@context": "https://schema.org/",
        "@type": "Book",
        "@id" : "https://onbooks.com/reviews/lord-of-the-flies",
        "name": "Lord Of The Flies",
        "author": {
          "givenName": "William",
          "familyName" : "Golding"
        },
        "publisher": {
            "@type": "Organization",
            "name": "Faber and Faber"
        },
        "inLanguage": "English",
        "isbn": "00000000"
      },
      {
        "@context": "https://schema.org/",
        "@type": "Image",
        "@id" : "https://example.com/reviews/lord-of-the-flies",
        "url": "http://www.example.com/lord-of-the-flies-cover.jpg"
      },
      {
        "@context": "https://schema.org/",
        "@type": "Review",
        "@id" : "https://example.com/reviews/lord-of-the-flies",
        "reviewRating": {
          "@type": "Rating",
          "ratingValue": "5"
        }
      }
    ]
    </script>
    

    Again, valid JSON‐LD. Effect unknown. The separate pieces of the code uses the same identifier. The identifier is the webpage, so all the pieces are, hopefully, applied. Course, there is a problem—URLs shouldn’t change, but they sometimes do. However, this seems solid enough to work as given, and follows the suggestion in Google help.

    Further issues

    A note on snippets/teasers

    ‘Snippets’ are Google’s word for the teaser text used underneath web‐links in a search. Most mature search‐engines seem to attempt to provide some teaser text.

    Much as I dislike it, the Google structured markup parser seems to like finding descriptions in markuo

    Google can return errors on formally correct markup

    Direct reportage. I set data on webpages. The aim was to tell search engines that the pages were a article in a known category of subject (could be all kinds of things—electronic gadget, movie, activity…), which contained general information and a review (note that an article could also contain other things such as citation lists, images…). I used code with this structure,

    Article
    ├─ headline
    ├─ about
    │  ├─ type
    │  ├─ name
    │  ├─ description
    │  └─ potentialAction
    ├─ review
    │  └─ type (''Review')
    

    I didn’t test this with Google themselves—you can at the link in the references. To me this is useful extra information, without excess duplication.

    Google disagreed. Reported errors on the review field,

    Missing: field ''author'
    Item does not support ''reviews'
    Missing: reviewed item name
    

    And for ‘article’, I was given warnings,

    Missing: field ''Author' (optional)
    Missing: field ''image' (optional)
    

    What? Schema.org says ‘Article’ supports a ‘reviews’ property! For more on this, look further down. And why must an article have an image?

    Not the only time I’ve had these kinds of errors. The following example was markup for some reviews (as noted, I’m unhappy that ‘review’ is not not decisive between ‘review text’ or as a derivative of ‘Article’). As richer information, and formally correct, this is good,

    Article
    ├─ headline
    ├─ isPartOf
    ├─ about
    │  ├─ ProductModel
    │  ├─ brand
    │  ├─ model
    │  └─ category
    

    But these errors were reported, under a title ‘Product snippets’,

    Either 'offers', 'review' or 'aggregateRating' should be specified
    Missing field 'name'
    

    Well, I suppose something can be gained from this. The Google parse of the structured markup,

    seems to assume ProductModel type is a sales item

    Why else would ‘offers’ and ‘aggregateRating’ be listed? Which the limited Google help seems to suggest

    appears to be ignoring the ‘about’ property

    Ask me, if markup is correctly formed under ‘about’, then there can be no ‘error’ in the information given. It’s extra information. Whereas the parse is taking this as primary information, lacking in data required for a special Google display

    again wants a review to contain a rating

    Yes, this will contribute to a special list display, probably visual stars. But why is the lack of rating a step towards an error?

    can’t construct a ‘name’ from a brand/model combination

    I suppose there is a possibly of extended information here. The site may sell an item named ‘Rubber Gloves’, but internally reference as brand ‘WilsonTech’, model ‘WT‐09’. Still, it’s a shame the parse can not default a ‘name’ as ‘WilsonTech WT‐09’

    You could be surprised by the number of ‘schema.org’ ‘Things’ that return,

    Item does not support reviews
    

    Places, artwork… you may make faster decisions by looking at what Google will accept reviews about.

    With the lack of help information, a sitebuilder must guess/experiment with structured data markup until it returns as correct. And, if and when you find the product review sample, it is far as I’m concerned inside out—declares a Product then inserts the ‘review’ property. I tried testing with Google’s snippets test, and found that, if I inverted the data using the ‘itemReviewed’ property (‘Product’ inside ‘Review’), the parser was unable to find the ‘author’ field, wherever it was placed. This is time‐consuming and discouraging for website builders with eccentric needs.

    Wider, on the one hand, this is useful feedback. Let’s say, in a limited way, that Google can make something from markup it recognises. This information tells me Google, at least, is able to recognise the type of my markup, but properties of use are missing. On the other hand, this is a depressing comment on the use of structured markup. These markups are formally correct and adhere to Schema.org guidelines. They provide useful comment about categorisation and other facets of the pages. Yet return errors.

    Now, Google says somewhere that markup errors are ignored, will not affect search results. Still, it is also a kind of error to report errors when there are none. This is an ‘I don’t get it, so it’s a fail, even if my comprehension is faulty’ report. But there’s no gain in arguing here. What we can get is that Google, at least, only accepts a small subset of structured markup forms from the possibilities. And will error on markup that others could use.

    Duplication continues all the way up

    Back to duplicate information, if you start duplicating at the base of a tree structure like HTML, you’ll be duplicating all the way up. Consider this. If the structural markup replicates, or even extends webpage data, then it can replace the page. There are in fact structural markup properties that carry contents for example, ‘text’ (for ‘Article’), ‘articleBody’ and ‘reviewBody’! At this point, the only item missing from the structural markup is links for styling consideration. The HTML is unneeded. As code, it stinks.

    Worse, it can result in unstructured markup. Let’s take page descriptions. HTML meta‐descriptions—Google is keen on them. But in the Book Review example, did we put the description in the meta tag? Or should we put it in the structural markup? And where in the structural markup? The data for the review, or for the book? In practice, these two ideas could be separate—the description of the book ‘A masterpiece by acknowledge master of fear, Zebedee Dunbar’, is different to the description for the review page, ‘Dunbar’s new book horrifies, but unconvincingly’. We could end with many descriptions spread across two different data structures, the structured markup and the HTML. Then, for search‐engines, which one wins? And where would the selection apply—in listings, in page display? There are comments buried in websites that suggest nobody knows.

    Structural markup is uncertain in structure

    I’m sure this is for flexibility, and the result of a great deal of consultation. Standards specifications can not be made any other way. But look at the above example again, and you’ll see I’ve been exploiting an ambiguity. The ‘review’ will contain an image of the cover, details about the book, and the text of a main review. But is that the ‘review’? Or is the ‘review’ the specific text posted by the website author? Which is wrapped in an ‘Article’?

    This flexibility may be intended. The issue is, some search engines may respond to one form, bit not another. If you want to high‐mindedly tag your data with information, no problem. If you want to generate effects in search‐engine listings, you may get no effect, then wonder why.

    Takeaways

    Mine, anyway,

    After…

    If I find anything else, I’ll let you know, ok? But don’t expect anything. I’m less than impressed, and will likely only follow the directions in my summary.

    Refs

    Schema.org. Gathers common standards for three vocabularies of structured text,

    https://schema.org

    Google enthuse about their neat events listings,

    https://developers.google.com/search/docs/advanced/structured-data/search-gallery

    Wikipedia on JSON‐LD, a vocabulary for microformat data,

    https://en.wikipedia.org/wiki/JSON-LD

    You’d hope documentation would help. You’d hope…

    https://www.w3.org/TR/2014/REC-json-ld-20140116/#advanced-concepts

    No answers, shallow, a video tutorial, and takes five minutes to get going. However, it’s a good tutorial, short, and covers basic concepts of what JSON‐LD thinks it is,

    https://www.youtube.com/watch?v=vioCbTo3C-4

    JSON‐LD verifier only parses for form, but that’s better than lunging onto the web with invalid form,

    https://json-ld.org/playground/

    Google ‘ rich results’ test. May require a Google logon,

    https://search.google.com/test/rich-results

    Google’s product review example,

    https://developers.google.com/search/docs/appearance/structured-data/product

    This site gives pre‐made models of common JSON‐LD. One way through this mess,

    https://jsonld.com/article/

    Article on Facebook’s Open Graph Meta Tags,

    https://ahrefs.com/blog/open-graph-meta-tags/

    Facebook on Open Graph best practices,

    https://developers.facebook.com/docs/sharing/best-practices#tags