Calculation rules for quality metrics

Depending on your task, the asset-level quality metrics are computed differently:

Quality metrics calculation for classification tasks

When assets are labeled with consensus or honeypot activated (or after predictions are uploaded), a score of agreement is calculated.

The process for single-category classification tasks is as follows:

  1. Take all the selected categories. If the available categories were "red", "blue" and "green", and all the labelers selected "red", this value will be 1.
  2. Iterate through the selected categories:
    1. For each labeler who selected a given category, calculate 1 / number of labelers. For 3 labelers who selected "red", this value will be 1/3 for each one of them.
    2. Sum up the score per category. In our case: 1/3 + 1/3 + 1/3 = 1
  3. Sum up the total score for labeling task: add all the scores per category and then divide by the number of selected categories. In our case: 1 / 1 = 1. The score is 100%

The process for a multi-category classification task is exactly the same:

  1. Take all the selected categories. If the available categories were "red", "blue" and "green", and all the labelers selected "red" and "blue", this value will be 2.
  2. Iterate through the selected categories:
    1. For each labeler who selected a given category, calculate 1 / number of labelers. For 3 labelers who selected "red" and "blue", this value will be 1/3 for each one of them.
    2. Sum up the score per category. In our case: (1/3 + 1/3 + 1/3) + (1/3 + 1/3 + 1/3) = 2
  3. Sum up the total score for labeling task: add all the scores per category and then divide by the number of selected categories. In our case: 2 / 2 = 1. The score is 100%

📘

  • For consensus, if all labelers select the same class, the consensus will be evaluated to 1 (100% agreement). For honeypot, the score will be 100% only if labeler's choice exactly matches the gold standard.
  • You might say that Honeypot assumes the existence of two labelers. And one of them is always right.

Quality metrics calculation examples for classification tasks

Selected categoriesNumber of labelersTotal score
1/33Class A: (1/3 + 1/3 + 1/3) = 1
1 / 1 = 100%
2/33Class A: (1/3 + 1/3 + 1/3) + Class B: (1/3 + 1/3 + 1/3) =
1 + 1 = 2
2 / 2 = 100%
2/33Class A: (1/3 + 1/3 + 0/3) + Class B: (0/3 + 0/3 + 1/3) =
2/3 + 1/3 = 1
1 / 2 = 50%
3/33Class A: (1/3 + 0/3 + 0/3) + Class B: (0/3 + 1/3 + 0/3) + Class C: (0/3 + 0/3 + 1/3) = 3/3 = 1
1 / 3 = 33%
1/32Class A: (1/2 + 1/2) = 1
1 / 1 = 100%
2/32Class A: (1/2 + 0/2) + Class B: (0/2 + 1/2) = 1
1 / 2 = 50%

📘

With only two labelers (or for honeypot), 50% is the lowest score possible.


Quality metrics calculation for object detection tasks

📘

Quality metrics calculation for object detection is implemented for bounding-boxes, polygons and semantic segmentation. It is not implemented yet for points, lines and vectors.

Each image is considered a set of pixels to be classified. A pixel can be classified into several non-exclusive categories representing different objects. The computation involves evaluating the intersection over the union of all annotations. So the ratio depends on common area. Hence:

  • two perfectly overlapping bounding-boxes correspond to a metric of 100 %
  • two completely distinct shapes correspond to a metric of 0 %

The mathematical formula for calculations is:

$∩(A,B)/U(A,B) $

With imprecise labeling, quality score quickly decreases.

One benefit of this method is that it is dependent on the size of the shape, through the union area denominator, ensuring accuracy for all sizes of shapes.


Quality metrics calculation for transcription tasks (all asset types)

Quality metric calculations for transcription jobs are made using the the Levenshtein distance. This is an approach to quantify the number of required changes to change one string into the other, taken one character at a time.

For example, a task to translate the sentence "Bryan is in the kitchen" to French with two different translations "Bryan est dans la cuisine" and "Bryan est dans la salle de bain" scores at 76 %.

🚧

With a completely irrelevant translation, the Levenshtein ratio can still be around 50 %, so it is best to monitor this closely, especially in the case of short transcriptions.


Quality metrics calculation for Named Entity Recognition tasks (NER)

For NER tasks, quality scores are computed at the category level:

For each category, the score is the intersection over the union of all entities by all labelers (latest submitted label per labeler). Its maximum is 1 if the selected entities are the same (beginOffset and endOffset are in alignment), and its minimum is 0 if the entities do not overlap at all.

In the example below, if one annotator has labeled the yellow entity and another the two blue entities, and both used the same category, the final score will be 33%.

score = number of characters in common / number of characters union = 15/45 = 33%


Quality metrics calculation for named entity relation tasks

For NER relation tasks, computation rules are almost the same as in Named Entity Recognition tasks. The only difference is that calculations are made at the category level and also at the relation level. For two 100% overlapping entities, if one labeler included both of them in a relation but the other didn't, the metric will be 0.


Quality metrics calculation for optical character recognition (OCR) tasks

The OCR task id the composition of an object detection task (selecting a box) and a text entry task (text contained in the box).

For this task, the quality metric for each category is computed in the same way as for transcription, using the the Levenshtein distance between the selected text items. To do this, we concatenate the text items for each category in each bounding box for the labels that we want to compare.

This is how the order of bounding boxes is established:

  • The document is divided in horizontal sections of the same height. The height is taken as the average height among all bounding boxes in the considered category of the considered label.
  • Next, we take the bounding boxes, using their centers as references, from top to bottom. Boxes that have the center in the same horizontal section are taken from left to right; boxes in different horizontal sections are taken from top to bottom.
  • The horizontal section takes preference over the left/right position.