Create a Sensitive Information Type (SIT) from scratch in Microsoft Purview – Part 9 – Cloud Build

[ad_1]

Reading Time: 10 minutes

Thank you for continuing to follow the Microsoft Purview blog post series. If you missed any previous posts, they are listed below:

Part 1: Introduction to Microsoft Purview
Part 2: Microsoft Purview Portal

Part 3: Microsoft Purview Roles and Scopes
Part 4: Turn on audit logs in Microsoft Purview
Part 5: Microsoft Purview Device Onboarding
Part 6: Enable Insider Risk Analytics in Microsoft Purview
Part 7: Microsoft Purview Information Protection
Part 8: Exploring a built-in Sensitive Info Type in Microsoft Purview

In Part 8, we explored how Microsoft Purview uses built-in Sensitive Information Types (SITs) to identify and classify sensitive data. In this part (Part 9), we’ll walk through process of creating a custom Sensitive Information Type from scratch, giving you the flexibility to tailor data classification to your organisation’s unique needs.

So, why would you want to create a custom Sensitive Information Type (SIT)?
While Microsoft Purview offers a wide range of built-in SITs to help identify common types of sensitive data, such as credit card numbers, national IDs, and medical information, there may be scenarios where your organisation handles unique or proprietary data that isn’t covered by the default options.

In such cases, creating a custom SIT allows you to define specific patterns, keywords, or validation logic tailored to your business needs, ensuring that even your most niche data types are properly classified and protected. Although you cannot directly edit built-in SITs, you can copy one and then modify the copy to suit your requirements.

In this part, I’ll create a custom SIT designed to detect internal employee ID formats within documents and emails. Since employee ID formats can vary across organisations, a custom SIT allows for tailored detection. For example, one organization might use a format like HRX849201357, which consists of three uppercase letters followed by nine digits.

Let’s begin by exploring how to create a custom SIT for this scenario.

  1. Access the Microsoft Purview portal at purview.microsoft.com
  2. Click Solutions in the left pane, then select Information Protection.
Imagesit1

3. In the left pane, expand Classifiers and click Sensitive info types.

Imagesit2

4. Click “Create sensitive info type”

Custom sit1

5. Enter your custom SIT Details. Provide a name and description for your custom Sensitive Information Type (SIT):

Name: Cloud Build Employee IDs
Description: This is a pattern for detecting Cloud Build employee IDs

Custom sit2

6. Click Next to proceed

7. Click Create Pattern
Sensitive Information Types work by identifying specific patterns in content. Each pattern must include a primary element (shown in the image below), such as a regular expression, keyword list, or built-in function, along with a confidence level that tells Purview how certain it should be when flagging a match. You can also add supporting elements, like keywords, to improve accuracy and reduce false positives.

Let’s explore further.

Custom sit3

8. Confidence Level
We briefly touched on this topic in the previous post while discussing built-in Sensitive Information Types. Now, let’s take a closer look.

Custom sit4

What are confidence levels in Sensitive Info Types (SITs)?
Microsoft Purview uses three confidence levels when detecting matches: High, Medium, and Low. These levels indicate how much supporting evidence is found alongside the primary pattern. The more supporting clues, such as nearby keywords or related phrases, a detected item contains, the higher the confidence that it includes the sensitive information you’re trying to identify.

High confidence: The match must strictly follow the defined pattern. A high confidence level means Purview is very certain the item contains sensitive information. This reduces false positives but may miss valid matches, resulting in false negatives.

Medium confidence: The match criteria are moderately strict, balancing accuracy and flexibility.

Low confidence: The match criteria are broad, increasing the chance of detecting relevant items but also raising the risk of false positives.

When testing your custom SIT, you might start with Medium confidence. If detection is too broad, increase it to high. If it’s missing too much, consider lowering to Medium or Low and adding supporting elements to improve accuracy.

We’ll discuss supporting elements shortly.

Table of Contents

Primary Element

Moving on, let’s explore the concept of the primary element.

A primary element in Microsoft Purview is the main pattern used to identify sensitive information, such as a credit card number, national ID, employee ID, and others. It forms the core of a Sensitive Information Type (SIT) and is typically defined using a regular expression, keyword, or function.

This element is what Purview looks for first when scanning content. Without a primary element, a SIT cannot function. Supporting elements and confidence levels build around it to improve accuracy and reduce false positives.

When you click Add a Primary Element (as shown in the image below), Microsoft Purview presents you with four options:

– Regular expression
– Keyword list
– Keyword dictionary
– Functions

Let’s explore each of these options in more detail.

Custom sit7

Regular expression:

To detect employee IDs, I’ll use a regular expression. Regular expressions (RegEx) help find patterns in text. Let’s say your organisation uses employee IDs that consist of 3 letters followed by 8 digits (e.g., ABC12345678). The regular expression to detect this format is: [A-Za-z]{3}\d{8}

[A-Za-z]{3} matches exactly 3 letters, uppercase (A-Z) or lowercase (a-z).
\d{8} matches exactly 8 digits (numbers from 0 to 9).

Custom sit8

The image above displays two options: Word Match and String Match.

Word Match: The rule will only match if the pattern stands alone, separated by spaces or punctuation. For example, "ID:HRX849201357Details" would not match because the ID is embedded within other text and not isolated.

String Match: A match will occur even if the pattern is embedded within other text. For example, the rule would match "ID:HRX849201357Details" because it doesn’t require the ID to be separated.

I have listed a few examples below:

Custom sit13

Under the options Word Match and String Match, there is also an option to add validators. These are functions used to perform additional validations on the regular expression pattern. Validators are built-in logic that verify the correctness of a matched pattern. They’re especially useful when your pattern matches something that looks like sensitive data, but you want to confirm it’s valid. For example, the Checksum Validator verifies numbers using algorithms like Luhn, a simple checksum formula used to validate identification numbers, such as credit card numbers. Date Validator ensures matched dates are in a valid format and range.

Custom sit14

Keyword list:

A keyword list allows you to define specific words or phrases to look for in content. These could include terms like:

  • Confidential
  • Employee ID
  • Internal use only
  • Restricted
  • Staff number
    and more.

These keywords help to increase the confidence level when detecting sensitive information.

Custom sit9

Keyword dictionary:

A keyword dictionary lets you upload a CSV or TXT file containing a large number of custom keywords, such as project names, acronyms, or other organisation specific terms. If you need to manage a large volume of keywords, the dictionary option is the recommended approach, as it allows for easier bulk management and better scalability.

Custom sit10

Functions:

Purview includes built-in functions to detect specific types of sensitive data. For example:

func_credit_card – detects credit card numbers using pattern matching and validation.

Other functions exist for detecting things like Social Security Numbers, driver’s license numbers, and more.

Custom sit11
Custom sit12

9. I have configured my custom SIT as shown in the image below. I have only used a regular expression and have not added any additional elements. Click done.

Custom sit15

10. Next, let’s explore character proximity

Character proximity gives you the flexibility to detect both primary and supporting elements within a specified number of characters. Alternatively, you can configure the Sensitive Information Type (SIT) to locate sensitive regardless of proximity, anywhere in the document. Let’s take a closer look.

Custom sit16

Understanding Proximity in Sensitive Information Types (SITs)

As we have already learned, a Sensitive Information Type (SIT) is a rule based definition that helps identify specific types of sensitive information, like Social Security Numbers (SSNs), credit card numbers, or health records, within your data estate. However, just finding a number that looks like an SSN (Social Security Number) isn’t enough. To reduce false positives, Purview checks whether that number appears near other supporting elements, such as a name, date of birth, or account number. This is where proximity comes into play.

Proximity defines how close these supporting elements must be to the primary element (e.g., SSN) for a match to be considered valid. For example, if the SIT rule specifies a proximity of 300 characters, then the supporting element must appear within 300 characters before or after the SSN in the document or data file.

Let’s take a closer look at a use case scenario.

Imagine you’re working with a dataset that contains various types of employee information. One of the fields is Employee ID, which is considered sensitive and needs to be protected. To accurately detect this data using Microsoft Purview, you define a Sensitive Information Type (SIT) where Employee ID is the primary element.

However, not every number that looks like an Employee ID should be flagged. To reduce false positives, Purview uses proximity rules. This means the Employee ID must appear within a certain number of characters (e.g., 300 characters) of one or more supporting elements, such as Full Name, Job Title, or Department.

For example, if the SIT is configured with a proximity of 300 characters and supporting elements include Name and Department, then the following would be a valid match:

Employee ID: AZF12345678
Name: Jane Doe
Department: Finance

Because the Employee ID appears close to both a name and a department within 300 characters, Purview considers this as a valid detection.

11. Next, let’s explore Supporting Elements

In Microsoft Purview, supporting elements play a crucial role in improving the accuracy of Sensitive Information Type (SIT) detection. While the primary element is the main piece of sensitive data you’re trying to identify, such as an Employee ID, credit card number, or passport number, supporting elements are additional pieces of information that help confirm the context and validity of a match.

For example, if you’re trying to detect an Employee ID, supporting elements might include a person’s name, job title, or department. These elements don’t need to be sensitive themselves, but their presence near the primary element increases the likelihood that the data is meaningful and should be protected. Purview uses these supporting elements in combination with proximity rules (e.g., within 300 characters) to determine whether a match is valid. This helps reduce false positives and ensures that only relevant, sensitive data is flagged.

By configuring supporting elements in your SIT definitions, you can fine-tune detection to match your organisation’s specific data patterns more accurately, making your data protection strategy smarter and more effective.

Custom sit17

As shown in the screenshot below, you can add supporting elements such as regular expressions, keyword lists, keyword dictionaries, and functions to enhance your Sensitive Information Type (SIT) definition. You also have the flexibility to group these elements using logical conditions like Any of these, All of these, or Not any of these. This allows you to fine-tune how supporting elements contribute to match detection. Simply click “+ Add supporting elements or group of elements” to begin customising your SIT.

Custom sit23

12. Next. let’s explore the option: Additional Checks

In addition, as shown in the second screenshot, you can apply text-based filters to further narrow down what qualifies as a match. These include:

  • Excluding specific values
  • Checking whether text starts or ends with certain characters
  • Including or excluding prefixes or suffixes
  • Filtering out duplicate characters

To apply these, simply click “+ Add additional checks” and select the conditions that best suit your SIT configuration. These advanced options give you the flexibility to tailor SIT detection to your organisation’s specific data patterns, making your data protection strategy smarter and more effective.

For example: A financial services company uses Microsoft Purview to detect customer account numbers in internal reports. However, during testing, developers often use placeholder values like “1234567890” or “TEST123”. To avoid these being flagged as real data, the team configures a SIT with additional checks to exclude specific values and patterns commonly used in test environments. This ensures only genuine customer data is detected and protected, reducing false positives and improving accuracy.

Custom sit18
Custom sit19

13. Click Create

14. You can additional patterns to the same SIT if needed. Click Next

Custom sit20

15. Clicking Next takes you to the final review screen.

Custom sit21

16. Review the details and click Create. This process may take up to one minute to complete.

Custom sit22

17. Click Done

Custom sit23 last image

18. The newly created custom Sensitive Information Type (SIT) will appear in the list, as shown in the image below.

Custom sit24 last image

Detecting Sensitive Data: A Credit Card Example

Let’s take an example to see how primary elements, supporting elements, and character proximity work together. Suppose I want to detect credit card numbers in my data.

To determine whether the message below contains a credit card number, Purview would look for a 16-digit number, known as the primary element. Additionally, if configured in your custom Sensitive Information Type (SIT), the system could also search for supporting elements, such as keywords like credit card or expiry date, within a certain character proximity. This helps confirm the presence of sensitive information.

In the message below, we have the presence of a credit card number, the phrase credit card, and an expiry date, all within close proximity.

Custom sit5

If the system detects only the primary element, a credit card number, the confidence level for the match is considered low. However, if the system identifies the primary element along with a supporting element, such as the credit card name or an expiry date, the confidence level is considered high.

Custom sit6

I hope you found this post useful.

Stay tuned for the next one, and don’t forget to subscribe to get notified when new posts go live.

[ad_2]

Share this content:

I am a passionate blogger with extensive experience in web design. As a seasoned YouTube SEO expert, I have helped numerous creators optimize their content for maximum visibility.

Leave a Comment