Custom format rules

When designing a data schema, you’re not only choosing what data to collect but also how that data should be structured. Format rules help ensure consistency by defining the expected structure for specific types of data and are especially useful for data verification.

For example, a date might follow the YYYY-MM-DD format, an email address should look like name@example.com, and a DNA sequence may only use the letters A, T, G, and C. These rules are often enforced using regular expressions or standardized format types to validate entries and prevent errors. Using the Semantic Engine, we have already described how users can select format rules for data input. Now we introduce the ability to add custom format rules for your data verification.

Format rules are in Regex

Format rules that are understood by the Semantic Engine are written in a language called Regex.

Regex—short for regular expressions—is a powerful pattern-matching language used to define the format that input data must follow. It allows schema designers to enforce specific rules on strings, such as requiring that a postal code follow a certain structure or ensuring that a genetic code only includes valid base characters.

For example, a simple regex for a 4-digit year would be:
^\d{4}$
This means the value must consist of exactly four digits.

Flavours of Regex

While regex is a widely adopted standard, it comes in different flavours depending on the programming language or system you’re using. Common flavours include:

  • PCRE (Perl Compatible Regular Expressions) – used in PHP and many other systems

  • JavaScript Regex – the flavour used in browsers and front-end validation

  • Python Regex (re module) – similar to PCRE with some minor syntax differences

  • POSIX – a more limited, traditional regex used in Unix tools like grep and awk

The Semantic Engine uses a flavor of regex aligned with JavaScript-style regular expressions, so it’s important to test your patterns using tools or environments that support this style.

Help writing Regex

Regex is notoriously powerful—but also notoriously easy to get wrong. A misplaced symbol or an overly broad pattern can lead to incorrect validation or unexpected matches. You can use online tools such as ChatGPT and other AI assistants to help start writing and understanding your regex. You can also put in unknown Regex expressions and get explanations using an AI agent.

You can also use other online tools such as:

The Need for Testing

It’s essential to test your regular expressions before using them in your schema. Always test your regex with both expected inputs and edge cases to ensure your data validation is reliable and robust. With the Semantic Engine you can export a schema with your custom regex and then use it with a dataset with the data verification tool to test your regex.

By using regex effectively, the Semantic Engine ensures that your data conforms to the exact formats you need, improving data quality, interoperability, and trust in your datasets.

 

Written by Carly Huitema