Beautiful Soup

Robert Crowther Apr 2022
Last Modified: Feb 2023

Real‐world example—you have an SVG, you’d like to add to it. And clean it up, ummm launder it, a little. Assuming use of a Debian,

#!/usr/bin/env python3

# Above, the shebang. Need to add execute permission also

# BS4, Beautiful Soup's new abbreviated name, is
# pre-installed on Ubuntus. As is the libxml2
# wrap, 'lxml'. So you shouldn't need to apt-get
# anything. But see the note below.
from bs4 import BeautifulSoup


# At first, my Ubuntu did not recognise lxml was
# installed. No idea why. I tried an import,
# then it started to work regardless. Go figure.
# As for 're', always useful here.
#import lxml
import re


# Instructions usually say use 'lxml' as a
# parameter to summon the tolerant-but-accurate
# libxml2 parser.
# However, Beautiful Soup works by default as an
# HTML parser. One effect is it will wrap output
# in compliant HTML tags like BODY and HTML.
# This is not wanted for SVG, so state 'xml' as
# a style, not a parser. Beautiful Soup will
# look for an XML parser. If 'lxml' is present
# Beautiful Soup will use that.
with open("funkyweb.svg") as fp:
    soup = BeautifulSoup(fp,  "xml")
    #soup = BeautifulSoup(fp,  "lxml")

# Not necessary.
# A sign if all is right or wrong.
count = 0

# Let's jamb a lowlight filter in
# Now, Beautiful Soup works as an
# object-orientated API. This is is slow and
# clumsy for extensive building. Also, the
# example that follows has an attribute name
#'type', which clashes with a Python keyword.
# That means the object-orientated API can't
# express it. But another feature saves this
# script. Two Beautiful Soup object-trees can
# be linked. So XML can be parsed in.
soupFilter = BeautifulSoup(
'''<filter id="lo-light" x="0" y="0" width="100%" height="100%">
     <feComponentTransfer>
       <feFuncR type="gamma" amplitude="0.5" exponent="1" offset="0"></feFuncR>
       <feFuncG type="gamma" amplitude="0.5" exponent="1" offset="0"></feFuncG>
       <feFuncB type="gamma" amplitude="0.5" exponent="1" offset="0"></feFuncB>
     </feComponentTransfer>
   </filter>''',
"xml"
)

# However, can use the object-orientated API to
# find the 'def' section of the SVG, replace it
# with the filter (or append the filter)
defsTag = soup.find("defs")
defsTag.replace_with(soupFilter)

# Right, let's add links round several of the
# paths in the SVG. This is the kind of job that
# is a waste of time manually, inefficient and
# with erratic, non-replicable results.

# Need to add the web-link namespace.
# From the attribute API, there is no way to
# generate namespaced attributes
# ('svgTag.xlms:xlink' ?) . However, the
# consistent and clunky constructor works as
# presented.
svgTag['xmlns:xlink'] = "http://www.w3.org/1999/xlink"

# Let's wrap paths in links. Assuming the paths
# have a meaningful 'id'.
for tag in soup.findAll("path"):
        urlId = id.lower()
        urlId = urlId.replace("_", "-")

        # With the Beautiful Soup API, you need
        # to make an element from parts, not
        # plaster in text (but see the XML
        # solution above)
        newTag = soup.new_tag("a", title=f"{id}")

        # Mamespaced attribute issue again. See
        # above.
        newTag['xlink:href'] = f"/area/{urlId}"
        tag.wrap(newTag)
        count = count + 1

# Right, lets talk about XML editors. Like
# Inkscape. Inkscape generates much of it's own
# data to fill SVGs. For example, it knows how
# and what size it must open an SVG (all SVG
# editors will do this). It records this
# information in the XML in an SVG file. However,
# this information is surplus to data needed to
# render the SVG. This script is cleaning, so
# lets get rid of that data.
# NB: Before Inkscape was forked, it was called
# Sodipodi (and Sodipodi continues to this day)
# This code needs to remove those references too.

# There is a tag called Sodipodi, unused for
# final rendering. Remove.
editorTag = soup.find("sodipodi:namedview")
editorTag.decompose()

# At base, this section of code compresses, and
# tidies, SVG code. Not going far with this. Why?
# Because there are programs designed specially.
# And those programs tackle ideas that can make a
# big difference. For example, cutting down on
# decimal point length in SVG paths. That can
# change the look of the SVG, so you'll need to
# try parameters. However, reduction to one
# decimal point can reduce SVG sizes by half.

# Specifically, remove Inkscape/Sodipodi editor
# attributes
# Can reduce size by 1/10. But makes final SVG
# much cleaner.

# I know XML users are self-important, but this
# is the only way I could think of to iterate
# the Soup structure
# Only need the key (of the attribute), not
# the value
for tag in soup.findAll(recursive=True):
    # Ugly code, but offhand I don't know how
    # Python can remove multiple dict entries
    # in a loop
    removeAttrs = [attrName for attrName in tag.attrs if (attrName.startswith("sodipodi:") or attrName.startswith("inkscape:"))]
    [tag.attrs.pop(attr, None) for attr in removeAttrs]

# One more compression. Editors heap stack
# attributes on text elements. These
# attributes are CSS-style lengthy e.g.
# "font-style:normal;font-variant:normal;font-weight:normal;...' etc.
# You can see, all of this, and maybe more,
# can in some circumstances be reduced to a
# simple CSS-like style="" on the main element,
# to be inherited. So strip all attributes from
# text tags. Is there anything to preserve? No,
# not even x and y attributes, because text
# elements wrap TSPAN elements that position.
# That said, check afterwards.
# Could save 1/10  file size.. More to the point,
# is better code.

# Put in the new overall style attributes
svgTag = soup.find('svg')

# display:inline is a basic. May be there already
svgTag['style'] = "display:inline; font-size:12px;"

# Strip all attributes from TEXT elements by
# replacement
for tag in soup.findAll('text', recursive=True):
    tag.attrs = {}


# Viewbox fix. Shift 'x' and 'y' positioning
# values to a viewbox. Will stop an SVG falling
# out of view, or exploding across an HTML page
width = soup.svg['width']
height = soup.svg['height']
soup.svg['viewBox'] = f"0 0 {width} {height}"
del soup.svg['width']
del soup.svg['height']

# Right, that's it, enough. Write the result out
with open("funkyweb_clean.svg", "w") as fp:
    # BS turns all text into UTF8. This will
    # destroy HTML entities. So those must be
    # re-encoded on save(), Hence, 'minimal'
    # format paramater.
    # Also, Beautiful Soup usually outputs as
    # pretty-print. But the output is tidy
    # anyway, and compact results are easy to
    # cut and paste. Use a simple 'decode'
    #fp.write(soup.prettify(formatter="minimal"))
    fp.write(soup.decode(formatter="minimal"))

# Checksum
print(f"count: {count}")

Discussion

It’s clear from the posts I came across on the web—I tried maybe six—the writers have never run or tested their own guides. Plus, none of them tackle the subjects that people are likely to ask about. The difficult stuff. No surprise. Only the crazy and the commited deal with XML.

Refs

Beautiful Soup documentation. One page of Python writing at it’s best,

https://beautiful-soup-4.readthedocs.io/en/latest