Source: Author

There are many projects that a data scientist or software engineer encounter that require passing in a configuration dictionary to a program to configure how that program functions. Often, such situations arise when full automation is impossible, yet many of the details can be abstracted away to a finite set of parameters.

Common Examples:

  • Designing a web scraper with both common and site-specific functionality.
  • Submitting a job to a compute cluster to train or test a machine learning model.
  • Automating a business process (web actions, dropshipping, etc)

Often when writing an application that requires a configuration to be loaded or passed from one process to another, the following issues are encountered:

  1. Complex parsing logic of JSON configurations.
  2. Unable to serialize and deserialize configurations with complex objects (tools like pickle can help but produce an unreadable binary file).
  3. Difficulty protecting secrets (access keys, auth tokens, etc.) in configurations

One powerful way to overcome these issues while also improving the scalability and quality of your code is YAML tags and PyYAML to parse and load the YAML configuration file.

Let’s jump in!

What is YAML/PyYAML?

As a quick brief, YAML is a cross language unicode based serialization language designed to help define configuration files. As the CI/CD industry standard, you will often see YAML files used to define CI/CD pipelines or infrastructure through code.

As an aside, this article is for advanced usage of YAML (tags) and will not cover basic syntax (see this primer). An easy way to begin to understand YAML is to draw parallels with more common JSON configuration files. YAML format, however, is more concise and extendable than JSON and supports advanced features such as custom tags.

PyYAML is an installable Python package that implements the YAML 1.1 parser specification to load and dump YAML files. Most importantly for this tutorial, PyYAML supports Python-specific tags that allow you to represent an arbitrary Python object in YAML or construct a Python object from a YAML definition.

You can install PyYAML by running the following in command prompt or bash:

pip install PyYAML

By the end of this tutorial, you will understand how to load a YAMl file and insert complex Python objects dynamically and/or write a Python dictionary containing complex Python objects to a YAML file.

Constructors and Representers

At the core of using PyYAML is the concept of constructors, representers, and tags.

From a high-level, a constructor allows you to take a YAML node and return a class instance; a representer allows you to serialize a class instance into a YAML node; and a tag helps PyYaml know which constructor or representer to call! A tag uses the special character ! preceding the tag name to label a YAML node.

Let’s take a look at an example YAML file that does not have any tags.

example.yml (without YAML tags)

name: MyBusiness
locations:
  - "Hawaii"
  - "India"
  - "Japan"
employees:
  - name: Matthew Burruss
    id: 1
  - name: John Doe
    id: 2

We would like to load the example.yml file above into the Python dictionary below:

class Employee:
  """Employee class."""
  def __init__(self, name, id):
    self._name, self._id = name, id

config = {
  "name": "MyBusiness",
  "locations": ["Hawaii", "India", "Japan"],
  "employees": [
    Employee("Matthew Burruss", 1),
    Employee("John Doe", 2)
  ]
}

In this simple, case one could easily load and parse the YAML into the desired Python dictionary, instantiating the Employee class where needed.

However, imagine if the employees list was deeper in the configuration tree or if instantiating it relied on other complex Python objects. The construction could quickly become a headache and the length of the code, unittests, and overall complexity would increase as our configuration gets longer and longer!

Luckily, PyYAML constructors can come to the rescue.

Defining PyYAML Constructors (Going from YAML to Python)

As stated previously, the constructor allows you to take a YAML node and output a constructed Python object. Let’s see how we can define a constructor to construct an Employee object whenever the !Employee tag is found.

Again, as a reminder the ! indicates a tag in YAML. Let’s look at a new YAML file that uses the custom !Employee tag.

example.yml (with YAML tags)

name: MyBusiness
locations:
  - "Hawaii"
  - "India"
  - "Japan"
employees:
  - !Employee
    name: Matthew Burruss
    id: 1
  - !Employee
    name: John Doe
    id: 2

We can write a constructor below to resolve the !Employee tag while loading the YAML file at runtime.

Employee PyYAML Constructor

import yaml

class Employee:
  """Employee class."""
  def __init__(self, name, id):
    self._name, self._id = name, id

def employee_constructor(loader: yaml.SafeLoader, node: yaml.nodes.MappingNode) -> Employee:
  """Construct an employee."""
  return Employee(**loader.construct_mapping(node))

def get_loader():
  """Add constructors to PyYAML loader."""
  loader = yaml.SafeLoader
  loader.add_constructor("!Employee", employee_constructor)
  return loader

yaml.load(open("config.yml", "rb"), Loader=get_loader())
"""
{
  'name': 'MyBusiness',
  'locations': ['Hawaii', 'India', 'Japan'],
  'employees': [
    <__main__.Employee object at 0x7f0ea2694d10>,
    <__main__.Employee object at 0x7f0ea2694d90>
  ]
}
"""

In the above example, the !Employee tag defines a mapping node; however, constructors also support scalar nodes. For example, if your YAML has the following definition:

greeting.yml

greeting: !Greeting "world"

We can define the following scalar constructor:

Scalar constructor

import yaml

def greeting_constructor(loader: yaml.SafeLoader, node: yaml.nodes.ScalarNode) -> str:
  """Construct a greeting."""
  return f"Hello {loader.construct_scalar(node)}"

def get_loader():
  """Add constructors to PyYAML loader."""
  loader = yaml.SafeLoader
  loader.add_constructor("!Greeting", greeting_constructor)
  return loader

yaml.load(open("greeting.yml", "rb"), Loader=get_loader())
"""
{
  "greeting": "Hello world"
}
"""

Whether to use a scalar instructor or a mapping constructor obviously depends on the use-case. It is even possible to define a multi-constructor that instantiates a Python object whenever a tag contains a prefix (although note, tags cannot clash). This can be used, for example, if you require instantiating a related group of classes with a single tag.

Defining PyYAML Representers (Going from Python to YAML)

Now that constructors have been covered, let’s take a look at representers. A use-case may be one Python process dynamically constructing a configuration dictionary to pass to another Python process.

In order to store the configuration for later use or modification, we would need to serialize the YAML file. However what if this Python dictionary contains any complex Python objects (e.g. functions, classes, etc.). How can we overcome this?

Well with YAML tags of course!

Let’s say we want to write the array of Employees back to a YAML file.

Example Python code

list_of_employees = [
  Employee("Matthew Burruss", id=0),
  Employee("John Doe", id=1),
  Employee("John Doe's brother", id=2),
]

We can define the following representer to take the Employee instance and write it as a YAML mapping node with the appropriate tag !Employee.

import yaml

def employee_representer(dumper: yaml.SafeDumper, emp: Employee) -> yaml.nodes.MappingNode:
  """Represent an employee instance as a YAML mapping node."""
  return dumper.represent_mapping("!Employee", {
    "name": emp._name,
    "id": emp._id,
  })

def get_dumper():
  """Add representers to a YAML seriailizer."""
  safe_dumper = yaml.SafeDumper
  safe_dumper.add_representer(Employee, employee_representer)
  return safe_dumper

with open("output.yml", "w") as stream:
  stream.write(yaml.dump(list_of_employees, Dumper=get_dumper()))
"""
output.yml
- !Employee
  id: 0
  name: Matthew Burruss
- !Employee
  id: 1
  name: John Doe
- !Employee
  id: 2
  name: John Doe's brother
"""

Like the constructor, you can also define a scalar representer. However, it is often useful to write a wrapping class around the scalar type so that you can pass a class type to the add_representer function. Otherwise, if that scalar type (e.g. a string) is used anywhere else in the YAML file, it won’t get represented by the custom function.

Now that we have looked at the basics of PyYAML constructors, representers, and tags, let’s see how these can be used in practice with a few advanced examples.

Advanced Examples

What’s a tutorial without examples! The following sections will look at specific advanced usages of YAML tags and PyYAML constructors which tend to be more complicated than representers. For the example representers, please see the Appendix section.

Protecting Secrets

Frequently in CS projects, we have secrets we need to protect. This could be an S3 or ADLS access key used to load a dataset from storage, a client secret for a service principal to auth into a service, or an access token to perform an HTTP API request.

Often, server code leverages an .env file or some DevOps feature to allow us to access the secret through an environment variable. If the secret becomes part of our configuration dictionary, this can be a big problem if we would like to eventually write the configuration dictionary to a file or print the configuration dictionary for debugging. In general, it’s never good to expose secrets!

Let’s take a simple example where we would like to load a secret MY_SECRET from the environment and set it to the attribute my_secret in our configuration dictionary.

my_secret_config.json

{
  "my_secret": {
    "env_key": "MY_SECRET"
  }
}

Let’s say we have the following Python code.

Python snippet (not good!)

import json
import os

# Load the config and retrieve secret from environment
config = json.load(open("my_secret_config.json", "rb"))

config["my_secret"] = os.environ[config["my_secret"]["env_key"]]  # Issue 1: Need to load each secret and be aware of the structure of the JSON

print(config) # Issue 2: Whoops! We just printed a secret!

config["my_secret"] = None  # Issue 3: We have to manually remove the secret before writing.
json.dump(config, open("out_config.json", "wb"))

In the above code, we could wrap the secret value in a Python class that overrides the __str__ method to safely print, but we would still have the issue of needing to be aware of the structure of the JSON and instantiate the secrets wherever located (which may be deep in the configuration).

Also what if we need to insert the secret into a string which my be the case for an HTTP access token? We would need to properly format the JSON configuration dictionary to tell us what the secret is and how to format the string.

Instead, let’s take a look at a cleaner, more scalable solution using a custom !Env YAML tag and a PyYAML constructor.

my_secret_config.yml

my_secret: !Env "${MY_SECRET}"
my_secret2: !Env "https://mywebsite.com?access_token=${ACCESS_TOKEN}"  # We can even insert into a string!

We can then write Python code to load this YAML and even dynamically insert the secrets into a formatted string. In the example below, we also wrap the secret in a class in case we want to protect it from printing to stdout.

!Env YAML tag and PyYAML Constructor

import re
import os
import yaml

class Secret:
  def __init__(self, name, secret):
    """Initialize."""
    self._name, self._secret = name, secret
  def __str__(self):
    """Override to string method."""
    return f"Secret(name={self._name}, secret=***)"
  def get(self):
    """Get secret value."""
    return self._secret

def env_constructor(loader, node):
  """Load !Env tag"""
  value = str(loader.construct_scalar(node)) # get the string value next to !Env
  match = re.compile(".*?\\${(\\w+)}.*?").findall(value)
  if match:
    for key in match:
      value = value.replace(f'$}', os.environ[key])
    return Secret(key, value)
  return Secret(key, value)

def get_loader():
  """Get custom loaders."""
  loader = yaml.SafeLoader
  loader.add_constructor("!Env", env_constructor)
  return loader

config = yaml.load(open("my_secret_config.yml", "rb"), Loader=get_loader())
print(config)
"""
{
  'my_secret': <__main__.Secret object at 0x7f3247a75e10>,
  'my_secret2': <__main__.Secret object at 0x7f3247a75d10>
}
"""
for key, secret in config.items():
  print(f"{key}={secret}")
"""
my_secret=Secret(name=MY_SECRET, secret=***)
my_secret2=Secret(name=ACCESS_TOKEN, secret=***)
"""

Replace if-else parsing logic

You can even use YAML tags to help reduce the amount of if-else logic in your code by replacing parameterized tags with handlers at runtime.

Let’s take the example of a scraping tool where you want to collect fan review of Harry Potter and the Sorcerer’s Stone media by aggregating IMDB and Goodreads data using the configuration JSON below:

harry_potter_config.json

[
  {
    "url": "https://www.imdb.com/title/tt0241527/",
    "provider": "imdb"
  },
  {
    "url": "https://www.goodreads.com/book/show/3.Harry_Potter_and_the_Sorcerer s_Stone",
    "provider": "goodreads"
  }
]

To perform this task, one could write functions to scrape IMDB pages and another for Goodreads pages, load the configuration JSON, and call the appropriate handler following an if-else block. An example is shown below:

Python code (not good!)

import json
from my_scraper import handle_imdb, handle_goodreads

for scrape_config in json.load(open("harry_potter_config.json", "rb")):
  if scrape_config["provider"] == "imbd":
    handle_imdb(scrape_config["url"])
  elif scrape_config["provider"] == "goodreads":
    handle_goodreads(scrape_config["url"])
  raise ValueError(f"Unknown provider {scrape_config['provider']}")

Although the code above would work, it may not be the best solution. As the configuration dictionary gets larger and more complicated, the code will not scale because parsing and handling all the conditions becomes more complicated.

Think of the situation where an if-else determined function takes another if-else determined function’s output as input and that function relies on another if-else determined function’s output and so on. The logic does not come out cleanly.

While you could use some design pattern to help improve the parsing quality (e.g. Builder or Factory method) imagine this cleaner situation leveraging YAML tags in the configuration to insert the handler at runtime:

harry_potter_config.yml

- url: https://www.imdb.com/title/tt0241527/
  handler: !Func
    module: my_scraper
    name: handle_imdb
- url: https://www.goodreads.com/book/show/3.Harry_Potter_and_the_Sorcerer s_Stone
  handler: !Func
    module: my_scraper
    name: handle_goodreads

In Python we can leverage PyYAML to dynamically insert the appropriate handlers wherever the !Func tag is used, using mapping nodes to tell our program what function to load.

!Func YAML tag and PyYAML constructor

import yaml

def func_loader(loader, node):
  """A loader for functions."""
  params = loader.construct_mapping(node) # get node mappings
  module = __import__(params["module"], fromlist=[params["name"]]) # load Python module
  return getattr(module, params["name"]) # get function from module

def get_loader():
  """Return a yaml loader."""
  loader = yaml.SafeLoader
  loader.add_constructor("!Func", func_loader)
  return loader

config = yaml.load(open("harry_potter_config.yml", "rb"), Loader=get_loader())
for scrape_config in config:
  scrape_config["handler"](scrape_config["url"])

"""
config
[
  {
    'url': 'https://www.imdb.com/title/tt0241527/',
    'handler': <function handle_imdb at 0x7f3247a76320>
  },
  {
    'url': 'https://www.goodreads.com/book/show/3.Harry_Potter_and_the_Sorcerer s_Stone',
    'handler': <function handle_goodreads at 0x7f3247a764d0>
  }
]
"""

The return value of yaml.load contains a Python dictionary with the function handlers handle_imdb and handle_goodreads loaded to the handler attribute, improving the scalability and maintainability of the code.

Earlier it was also mentioned that as the configuration dictionary gets larger and more complicated, so does the parsing logic. YAML can help with that too as it resolves dependencies anywhere found in the dictionary. Let’s take a look at an example below.

deep_config.yml

class_a: !MyClass
  param1: 1
  order: 1
class_b: !MyClass
    param1: !MyClass
      param1: 2
      order: 2
    order: 3

!MyClass tag and PyYAML constructor

import yaml

class MyClass:
  def __init__(self, param1, order):
    print(f"Initializing with name={param1} order={order}")

def my_class_loader(loader, node):
  """A loader for functions."""
  return MyClass(**loader.construct_mapping(node))

def get_loader():
  """Return a yaml loader."""
  loader = yaml.SafeLoader
  loader.add_constructor("!MyClass", my_class_loader)
  return loader

yaml.load(open("deep_config.yml", "rb"), Loader=get_loader())
"""
Initializing with name=1 order=1
Initializing with name=2 order=2
Initializing with name=<__main__.MyClass object at 0x7f3247a7aa50> order=3
{
  'class_a': <__main__.MyClass object at 0x7f3247a7a710>,
  'class_b': <__main__.MyClass object at 0x7f3247a7a750>
}
"""

Conclusion

That’s the end of the tutorial! Next time you are writing a Python program and you see the need for a configuration dictionary to improve the codes re-usability and functionality, you can leverage YAML and PyYAML to reduce the overhead of parsing and help with serializing/de-serializing the configuration dictionary.

Any questions or comments, please leave below. Thanks for reading!

Appendix

The PyYAML representers for the advanced examples in this tutorial are shown below.

Representer for !Env

def _env_representer(dumper, env):
  return dumper.represent_scalar("!Env", env.secret)

Representer for !Func

def _func_representer(dumper, func):
  return dumper.represent_mapping("!Func", {
    "module": func.__module__,
    "name": func.__name__
  })

Comments