Creating a Local AI Custom Vision API for Training and Matching Images – Jamie Maguire


This is the third instalment of a miniseries where you will learn how to build an end-to-end AI home security and cross platform system.

I recommend reading part 1 and part 2 before reading on.

To recap, the main requirements for this system are:

  • Detect motion and capture photo
  • Create message with photo attached
  • Send message to defined Telegram bot
  • Detect who is in the photo, if the person is me, then do not invoke the Telegram bot with a message and image

 

In part 2, we created a locally running CLIP server.  This provided an API endpoint that could accept a single image.

The input image was then converted to vectors.  When in vector format, the image data is much easier to run comparisons against.

~

Table of Contents

.NET API Capabilities

In this blog post, we implement a .NET API.  The .NET API will leverage the core CLIP functionality but will expose 2 endpoints:

  • /api/image/train
  • /api/image/match

 

The /train endpoint will let us train the system with existing images and associated labels.

The /match endpoint will let us supply an image for vector comparison against stored embeddings (image data).  Cosine similarity will be used to perform the comparison.

~

Why Create an Additional API

This purple of this additional API is to provide a single-entry point for all image processing and recognition logic.

It encapsulates all required endpoints and makes it easier to create an extensible architecture.

For example, providing a chat experience via for the Semantic Kernel or the addition of new endpoints such as listing all training labels or resetting an AI agent’s memory.

Another point is a reusable API can be used with other client, projects or products.  In particular, it can be used by the Raspberry Pi.

~

Core API Functionality

To implement both API endpoints, we need logic to handle the following:

  • Get vector embeddings for an image
  • Find the closest match to a given vector embedding from a list of existing vectors
  • Calculate cosine similarity

 

The following methods can be used to perform the above:

GetClipEmbeddingFromServer

We need a method that takes an image file from the local filesystem and then sends it to the local CLIP server running on localhost on port 5003.

Here, we can see the incoming image is converted to a byte array:

private async Task> GetClipEmbeddingFromServer(string imagePath)
{
    using var client = new HttpClient();
    using var form = new MultipartFormDataContent();

    var fileBytes = await System.IO.File.ReadAllBytesAsync(imagePath);
    var byteContent = new ByteArrayContent(fileBytes);
    byteContent.Headers.ContentType = MediaTypeHeaderValue.Parse("application/octet-stream");

    form.Add(byteContent, "file", Path.GetFileName(imagePath));

    var response = await client.PostAsync("http://localhost:5003/embed", form);
    response.EnsureSuccessStatusCode();

    var responseJson = await response.Content.ReadAsStringAsync();
    using var doc = JsonDocument.Parse(responseJson);

    var embeddingArray = doc.RootElement
                            .GetProperty("embedding")
                            .EnumerateArray()
                            .Select(x => x.GetSingle())
                            .ToList();

    return embeddingArray;
}

 

The response contains an array of embeddings in the variable embeddingArray.

CosineSimilarity

This method calculates the cosine similarity.  I should have paid more attention to my maths teacher in school.  I had to lean on AI to help me with this.

In essence, this method tells you how similar two sets of numbers are.

  • A larger number means they are more similar.
  • A smaller number means they are not very similar.

 

// This method compares two lists of numbers and returns how similar they are.
private double CosineSimilarity(List a, List b)
{
    // Step 1: Make sure both lists have the same number of items.
    if (a.Count != b.Count)
        throw new ArgumentException("Embeddings must have the same length.");

    // Step 2: Prepare variables for calculations.
    double dot = 0.0;    // This will store the sum of multiplied pairs.
    double magA = 0.0;   // This will store the squared sum for list a.
    double magB = 0.0;   // This will store the squared sum for list b.

    // Step 3: Go through each pair of numbers in the lists.
    for (int i = 0; i < a.Count; i++)
    {
        dot += a[i] * b[i];        // Multiply matching items and add to dot.
        magA += a[i] * a[i];       // Square a[i] and add to magA.
        magB += b[i] * b[i];       // Square b[i] and add to magB.
    }

    // Step 4: Calculate and return the similarity.
    // This divides the dot by the product of the magnitudes (after taking square roots).
    return dot / (Math.Sqrt(magA) * Math.Sqrt(magB));
}

 

It does this by determining if they “point” in the same direction.  The closer the result is to 1, the more similar they are.  If the result is 0, they have nothing in common.

Exactly what we need to compare a newly captured and vectorised image against a list of existing vectorised images.

FindBestMatch

private (string Label, double Similarity)? FindBestMatch(List newEmbedding, double threshold = 0.85)
{
    (string Label, double Similarity)? bestMatch = null;

    foreach (var entry in TrainedEmbeddings)
    {
        var similarity = CosineSimilarity(newEmbedding, entry.Embedding);
        if (similarity > (bestMatch?.Similarity ?? 0))
        {
            bestMatch = (entry.Label, similarity);
        }
    }

    return bestMatch != null && bestMatch.Value.Similarity >= threshold
        ? bestMatch
        : null;
}

 

Each of these methods is used by the API endpoints.

~

API Controller Endpoints

We need two controller endpoints.  One to add images for training the AI.  Another to perform matching of income image data against training data.

Each controller endpoint will leverage the code we looked at in the prior sections.

Before digging into the controllers, we need some models to capture what is sent and to represent the models.

FileUploadRequest is used to capture image data from the client:

public class FileUploadRequest
{
    [Required]
    public IFormFile File { get; set; }
}

 

TrainedEmbedding is used to represent a vectorised image.  A string property Label is used to store a human readable description for the embedded image:

public class TrainedEmbedding
{
     public string Label { get; set; }
     public List Embedding { get; set; }
}

 

With models defined, we can use them in the controllers.

/api/image/train

This endpoint takes an input image FileUploadRequest and associated label.  It then creates a local file.  The local file is passed to GetClipEmbeddingFromServer.  A vector is returned and added to the TrainedEmbeddings object.

    [HttpPost("train")]
    [Consumes("multipart/form-data")]
    public async Task TrainEmbeddingAsync([FromForm] FileUploadRequest request, [FromForm] string label)
    {
        if (request.File == null || request.File.Length == 0)
            return BadRequest("No file uploaded.");


        if (string.IsNullOrWhiteSpace(label))
            return BadRequest("Label is required.");

        var uploads = Path.Combine(_env.ContentRootPath, "uploads");

        Directory.CreateDirectory(uploads);

        var filePath = Path.Combine(uploads, request.File.FileName);
        using (var stream = new FileStream(filePath, FileMode.Create))
        {
            await request.File.CopyToAsync(stream);
        }

        var embedding = await GetClipEmbeddingFromServer(filePath);
  
        // Store in memory for now
        TrainedEmbeddings.Add(new TrainedEmbedding
        {
            Label = label,
            Embedding = embedding
        });


        return Ok($"Trained embedding added with label: {label}");
    }

 

Data in the TrainedEmbeddings object is used as the master source of training data.

/api/image/match

This endpoint accepts an image as a parameter.  The input image is then checked against the list of known embeddings (training data) and a match is determined.

[HttpPost("match")]
[Consumes("multipart/form-data")]
public async Task PostAsync([FromForm] FileUploadRequest request)
 {

     if (request.File == null || request.File.Length == 0)

         return BadRequest("No file uploaded.");


     var uploads = Path.Combine(_env.ContentRootPath, "uploads");

     Directory.CreateDirectory(uploads);


     var filePath = Path.Combine(uploads, request.File.FileName);
     using (var stream = new FileStream(filePath, FileMode.Create))
     {
         await request.File.CopyToAsync(stream);
     }


     var newEmbedding = await GetClipEmbeddingFromServer(filePath);

     // Compare to trained embeddings
     var bestMatch = FindBestMatch(newEmbedding);


     if (bestMatch != null)
     {
         return Ok($"MATCH: {bestMatch.Value.Label} (Similarity: {bestMatch.Value.Similarity:F2})");
     }
     else
     {
         return Ok("NO MATCH FOUND");
     }
 }

 

Just like before, a local image is created from the input parameter.  A vector is created (newEmbedding) and a match is determined using the method FindBestMatch .

~

Testing the Controller and API

With the raw endpoints created, we can test them using Postman.  Before we do that, the CLIP server in VS Code must be started:

With the CLIP server running, we can invoke each of the endpoints.

Training and Labelling Images

Next, we can test the training and labelling of images.

We can access the /train endpoint by creating a POST request and sending it to:  http://localhost:5001/api/image/train

We supply an image from the file system and set the parameter name to File.  A parameter Label is set to “jamie”.  This is the human readable label for the associated image.

We send this request to the endpoint. The associated vector representation of the image and label are generated.   A message indicates success:

 

We can look under the hood and inspect the debugger to confirm this.  Here we can see 2 embeddings (I sent 2 POST requests):

 

We can expand the collection to examine the underlying vector embeddings:

 

And the associated label:

 

The /train endpoint has been successfully tested at this point.  Next, we can test the /match endpoint.

Matching Incoming Image Data with Training Data

The /match endpoint lets you supply and compare an image against the current stored list of vectorised images that were generated using the /train endpoint.

We’ll upload the following picture:

 

We set the key to File and value to the uploaded file in Postman and send the Request:

 

The first main thing to happen is our CLIP server running on VS Code receives the new image.  This is done with the following command:

var newEmbedding = await GetClipEmbeddingFromServer(filePath);

 

The method GetClipEmbeddingFromServer takes the file and returns vector embeddings for the supplied image.

We can inspect the CLIP server and embeddings that will be returned to the .NET API:

 

Next, .NET API takes these embeddings:

 

The newly generated embeddings for the incoming image are compared against the training data:

Within FindBestMatch, we iterate through the current list of TrainedEmbeddings.

Cosine similarity is determined for each.  If the threshold (0.85) is not reached, a match is not found:

 

In this example, when comparing a picture of me to the Tony Stark image, the calculated similarity was 0.57:

 

The API returns “NO MATCH FOUND”:

 

Within the context of the wider solution, this would mean an alert would be sent to my cell phone.

We can supply a picture that is like the labelled training data (i.e. an existing picture of me):

 

We send the request and a match is found:

In the context of the wider solution, this would mean that notification would not be sent.  So far so good.

~

Demo

You can see the above in action in the following demo.  In this demo we:

  • Add and label an image to the list of trained embeddings using the /train endpoint
  • Invoke the /match endpoint testing for positive and negative signals

~

How Does This Fit Within the Wider Solution?

The .NET API makes it easier to consume the underlying CLIP and image classification AI that runs on the Raspberry Pi.

I don’t code much Python so using .NET and C# makes it easier to build and maintain.

 

A recap of the end-to-end process

  1. When a photo is captured by the Raspberry Pi, it will be sent a .NET API that consumes the CLIP Server.
  2. An embedding is generated and compared against a collection of existing training images that have already been converted to embeddings (images of me).
  3. The result of the embeddings comparison as a cosine similarity score.
  4. If the score matches a given threshold and for existing images labelled “me”, the API will return a match with a high score. This will invoke the Telegram message notification.
  5. If the score does not match a given threshold, no Telegram message will be sent.

~

Summary

In Part 4 of the series, we will extend the existing Telegram bot.  This will let us manage embeddings and invoke API endpoints directly from the cell phone.  Stay tuned.

~




Share this content:

I am a passionate blogger with extensive experience in web design. As a seasoned YouTube SEO expert, I have helped numerous creators optimize their content for maximum visibility.

Leave a Comment