Tag Images with Elixir and Yolov8

Published in

DataDrivenInvestor

6 min readOct 26, 2023

Using Ortex to load Onnx model

I wrote another article talking about how we could tag images using Elixir and Onnx. But now let’s use the new Ortex library to read the Yolov8 model.

Thanks to @TravisaurusPlex who quickly updated the Ortex version to use the {:nx, "~> 0.6"}. This way I was able to use tensor vectorization, and improve performance.

I used LiveBook 0.11.3 to test.

Mix.install([
  {:ortex, "0.1.8"},
  {:evision, "~> 0.1.33"},
  {:image, "~> 0.38.3"},
  {:kino, "0.11.0"},
  {:exla, "~> 0.6.1"}
])

The Yolo model will return 80 labels. Then we register the name of each label.

labels = [
  "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
  "traffic light", "fire hydrant", "street sign", "stop sign", "parking meter", "bench", "bird", "dog",
  "cat", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "hat", "backpack", "umbrella",
  "shoe", "eye glasses", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball",
  "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "plate",
  "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", "broccoli",
  "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch", "potted plant", "bed", "mirror",
  "dining table", "window", "desk", "toilet", "door", "tv", "laptop", "mouse", "remote", "keyboard",
  "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "blender", "book", "clock", "vase",
  "scissors", "teddy bear", "hair drier", "toothbrush"
]

And then we can use Nx.Serving.new/2 to load the model.

## load model
model =
  Path.join(__DIR__, "yolov8n.onnx")
  |> Ortex.load()

serving = Nx.Serving.new(Ortex.Serving, model)

%Nx.Serving{
  module: Ortex.Serving,
  arg: #Ortex.Model<
    inputs: [{"images", "Float32", [1, 3, 640, 640]}]
    outputs: [{"output0", "Float32", [1, 84, 8400]}]>,
  client_preprocessing: nil,
  client_postprocessing: nil,
  streaming: nil,
  batch_size: nil,
  distributed_postprocessing: &Function.identity/1,
  process_options: [],
  defn_options: []
}

With the model defined, we can start working with the images.

# load image
alias Evision, as: Cv

image =
  Path.join(__DIR__, "ny.jpg")
  |> Cv.imread()

%Evision.Mat{
  channels: 3,
  dims: 2,
  type: {:u, 8},
  raw_type: 16,
  shape: {1944, 2592, 3},
  ref: #Reference<0.3437772413.3814064154.17065>
}

Remember that when we load the model, we receive the model data input definition: inputs:[{"images","float32",[1, 3, 640, 640]}]

This means that we need to resize the image to {640, 640}.

image_evision =
  Cv.DNN.blobFromImage(
    image,
    scalefactor: 1 / 255.0,
    swapRB: true,
    size: {640, 640},
    crop: false
  )

%Evision.Mat{
  channels: 1,
  dims: 4,
  type: {:f, 32},
  raw_type: 5,
  shape: {1, 3, 640, 640},
  ref: #Reference<0.3437772413.3814064147.16910>
}

Ok, we change the shape of the image to{640,640} andFloat32 . We can generate an image tensor.

## image to tensor
image_tensor =
  image_evision
  |> Cv.Mat.to_nx(EXLA.Backend)

#Nx.Tensor<
  f32[1][3][640][640]
  EXLA.Backend<host:0, 0.3437772413.3814064147.16914>
  [
    [
      [
        [0.5411764979362488, 0.5372549295425415, 0.5882353186607361, 0.6117647290229797, 0.6196078658103943, 0.6000000238418579, 0.6039215922355652, 0.6352941393852234, 0.6235294342041016, 0.6078431606292725, 0.5960784554481506, 0.615686297416687, 0.5764706134796143, 0.6509804129600525, 0.5921568870544434, 0.5333333611488342, 0.3921568691730499, 0.545098066329956, 0.4941176474094391, 0.6196078658103943, 0.5254902243614197, 0.4901960790157318, 0.5411764979362488, 0.4901960790157318, 0.529411792755127, 0.5254902243614197, 0.5254902243614197, 0.5215686559677124, 0.45490196347236633, 0.29411765933036804, 0.3803921639919281, 0.5529412031173706, 0.4941176474094391, 0.3960784375667572, 0.30980393290519714, 0.37254902720451355, 0.48627451062202454, 0.37254902720451355, 0.21176470816135406, 0.1725490242242813, 0.25882354378700256, 0.16862745583057404, 0.24705882370471954, 0.25882354378700256, 0.24705882370471954, 0.09803921729326248, 0.125490203499794, 0.364705890417099, 0.23137255012989044, 0.3137255012989044, ...],
        ...
      ],
      ...
    ]
  ]
>

Now we can send the tensor to serving to get our prediction.

## predictions
image_batch = Nx.Batch.stack([{image_tensor[0]}])

predictions =
  with {result} <- Nx.Serving.run(serving, image_batch) do
    result[0]
    |> Nx.backend_transfer()
    |> Nx.transpose(axes: [1, 0])
  end

#Nx.Tensor<
  f32[8400][84]
  [
    [10.502091407775879, 11.621222496032715, 20.2751522064209, 23.181604385375977, 2.6613473892211914e-5, 1.6093254089355469e-6, 1.6185641288757324e-4, 9.238719940185547e-7, 1.3947486877441406e-5, 0.0020144283771514893, 6.368756294250488e-5, 9.599030017852783e-4, 2.7805566787719727e-5, 2.4774670600891113e-4, 6.556510925292969e-7, 1.0877847671508789e-5, 1.5795230865478516e-6, 3.361701965332031e-5, 1.1920928955078125e-7, -5.960464477539063e-8, -5.960464477539063e-8, 2.980232238769531e-7, 1.1920928955078125e-7, 1.1920928955078125e-7, 1.7881393432617188e-7, 8.940696716308594e-8, 8.940696716308594e-8, 1.4901161193847656e-7, 5.364418029785156e-7, 7.718801498413086e-6, 2.592802047729492e-6, 1.7881393432617188e-7, 9.47713851928711e-6, 1.7881393432617188e-7, 9.834766387939453e-7, 1.2814998626708984e-6, 2.384185791015625e-7, 5.21540641784668e-6, 1.2516975402832031e-6, 1.4901161193847656e-7, 8.940696716308594e-6, 1.5497207641601562e-6, 2.682209014892578e-7, 1.633167266845703e-5, 3.5762786865234375e-7, 3.993511199951172e-6, 2.9802322387695312e-8, 1.1920928955078125e-7, 8.940696716308594e-8, 2.682209014892578e-7, ...],
    ...
  ]
>

The Yolov8 model returned 8400 lines representing the predictions, with 84 columns. The first 4 columns are the box definition and the other 80 are the scores for each model label (defined above).

So we can separate the boxes:

## Boxes
prediction_bboxes =
  predictions
  |> Nx.slice([0, 0], [8400, 4])

And scores:

## Scores
prediction_scores =
  predictions
  |> Nx.slice([0, 4], [8400, 80])

We need to vectorize the scores returned in the tensor, and do an Nx.argmax/2 to identify which label has the highest score.

## define classes
classesID =
  prediction_scores
  |> Nx.vectorize(:rows)
  |> Nx.argmax(keep_axis: false)
  |> Nx.devectorize()
  |> Nx.to_flat_list()

[5, 5, 5, 5, 9, 9, 5, 5, 5, 5, 9, 5, 9, 5, 5, 5, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 11,
 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, ...]

And get the highest score for each line:

## Get scores
get_score = fn vec ->
  class = Nx.argmax(vec, keep_axis: false)
  Nx.take(vec, class)
end

scores =
  prediction_scores
  |> Nx.vectorize(:rows)
  |> get_score.()
  |> Nx.devectorize()
  |> Nx.to_list()

[0.0020144283771514893, 0.003671199083328247, 0.004642307758331299, 0.0036547183990478516,
 0.0014902055263519287, 0.005525022745132446, 0.005158871412277222, 0.007003724575042725,
 0.0034270286560058594, 0.002779841423034668, 0.0024441182613372803, 0.0022981464862823486,
 8.35031270980835e-4, 9.03397798538208e-4, 7.626712322235107e-4, 4.7725439071655273e-4,
 2.8580427169799805e-4, 0.0014012455940246582, 0.0024708807468414307, 7.229447364807129e-4,
 3.1119585037231445e-4, 2.0712614059448242e-4, 3.6388635635375977e-5, 2.6792287826538086e-5,
 3.8111209869384766e-4, 0.0029450654983520508, 0.003663182258605957, 0.0016193091869354248,
 7.010400295257568e-4, 0.006528317928314209, 0.001441270112991333, 0.0019300878047943115,
 0.0019274652004241943, 0.00556597113609314, 0.04098913073539734, 0.015473812818527222,
 1.9416213035583496e-4, 3.2404065132141113e-4, 0.003769010305404663, 0.005686938762664795,
 0.0030829012393951416, 5.040764808654785e-4, 6.439685821533203e-4, 9.150207042694092e-4,
 6.031990051269531e-5, 1.3709068298339844e-5, 1.9729137420654297e-5, 1.3202428817749023e-5,
 3.4362077713012695e-5, 5.474686622619629e-5, ...]

Save boxes inside tuples.

bboxes =
  prediction_bboxes
  |> Nx.to_list()
  |> Enum.map(fn boxes ->
    List.to_tuple(boxes)
  end)

[
  {10.502091407775879, 11.621222496032715, 20.2751522064209, 23.181604385375977},
  {10.708126068115234, 10.530062675476074, 18.33818817138672, 21.079179763793945},
  {74.64244079589844, 11.722304344177246, 43.36077880859375, 23.6668758392334},
  {85.36026000976562, 12.035582542419434, 18.913047790527344, 24.110872268676758},
  {...},
  ...
]

With the boxes and scores defined, we go through the Cv.DNN.msBoxes/4 function.

Performs non maximum suppression given boxes and corresponding scores

nmsBoxes = Cv.DNN.nmsBoxes(bboxes, scores, 0.5, 0.2)

[8258, 7347, 8043, 7770, 7818, 6797, 7705, 8156, 6897, 7854, 7450, 7615, 6958, 6905, 8073, 7481,
 6693, 7286, 2527, 7413, 6683, 6775]

Now we know which prediction lines have important boxes. And we can print these boxes in our original image.

Function that will return our image:

make_image = fn ->
  {:ok, image} =
    image_tensor[0]
    |> Nx.transpose(axes: [1, 2, 0])
    |> Nx.multiply(255)
    |> Image.from_nx()

  image
end

Module that draws the boxes:

defmodule Tagging do
  def draw_bbox_labels(object_boxes, image) do
    Enum.reduce(object_boxes, image, fn {boxes, class_name}, image ->
      Enum.reduce(boxes, image, fn [cx, cy, w, h | _probs], image ->
        {text_image, _alpha} =
          Image.split_alpha(Image.Text.text!(class_name, font_size: 20))

        {:ok, image} =
          image
          |> Image.Draw.rect!(round(cx - w / 2), round(cy - h / 2), round(w), round(h),
            fill: false,
            color: :blue
          )
          |> Image.Draw.image(
            text_image,
            min(max(round(cx - w / 2), 0), 640),
            min(max(round(cy - h / 2), 0), 640)
          )

        image
      end)
    end)
  end
end

Draw each box.

{:ok, evision_mat} =
  Enum.map(nmsBoxes, fn idx ->
    b =
      bboxes
      |> Enum.at(idx)
      |> Tuple.to_list()

    s =
      scores
      |> Enum.at(idx)

    c = Enum.at(classesID, idx)
    l = Enum.at(labels, c)

    {[b ++ [s]], l}
  end)
  |> Tagging.draw_bbox_labels(make_image.())
  |> Image.to_evision()

{:ok,
 %Evision.Mat{
   channels: 3,
   dims: 2,
   type: {:f, 32},
   raw_type: 21,
   shape: {640, 640, 3},
   ref: #Reference<0.3437772413.3814064147.16957>
 }}

Let’s do a new resize to the original size.

## Resize to original
{h, w, _} = image.shape
Cv.resize(evision_mat, {w, h})

%Evision.Mat{
  channels: 3,
  dims: 2,
  type: {:f, 32},
  raw_type: 21,
  shape: {1944, 2592, 3},
  ref: #Reference<0.3437772413.3814064147.16958>
}

It’s the result.

The idea is to detect objects in videos and in real time, obtaining the same performance obtained using Python.

In the next article I will publish the results, I hope.

Subscribe to DDIntel here.

Submit your work to DDIntel here.

Join our creator ecosystem here.

DDIntel captures the more notable pieces from our main site and our popular DDI Medium publication. Check us out for more insightful work from our community.

DDI Official Telegram Channel: https://t.me/+tafUp6ecEys4YjQ1

Tag Images with Elixir and Yolov8

Written by Marcelo Reichert