Calculation rules for quality metrics

Depending on your task and the specific quality metric type, the asset-level quality metrics are computed differently.

For calculation rules that apply to Honeypot, Review score, and Human-model IoU, refer to Calculation rules for Honeypot, Review score, and Human-model IoU.
For calculation rules that apply to consensus, refer to Calculation rules for Consensus.

Calculation rules for Honeypot, Review score, and Human-model IoU

For Honeypot, Review score and Human-model IoU, asset quality calculation metrics depend on a specific labeling task. For more information, refer to these sections:

Calculation rules for classification tasks
Calculation rules for object detection tasks

For all the other job types, calculation rules for Consensus apply.

📘
In situations when multiple job types exist in your ontology:

For same-level jobs (no nesting) the final score is a weighted average, by number of annotations across jobs. For example, if 20 objects were marked on an asset using object detection and then 5 transcriptions were added, the final calculation will be skewed towards object detection.

In case of nested jobs, a geometric average is calculated. For example for a BBox with a nested classification, if the classification is wrong the review score for the bbox is 0%. Nesting itself does not impact the weight of the job in the calculation.

Calculation rules for classification tasks

Single-choice classification jobs

For single-choice classification jobs the result is always either 0 (0%) or 1 (100%).

Multiple-choice classification jobs

For multiple-choice classification jobs, Kili divides the number of selected classes by the number of overlapped classes. For example, if "red", "green", and "blue" were the available options, the labeler selected "red", and the reviewer selected "blue" and "red", then the score is 2/1 = 50%.

Calculation rules for object detection tasks

In case of object detection tasks, Kili applies a matching algorithm: first, for each pair of bboxes, the app compares the alignment of bboxes and the selected classes and computes the respective IoU score. Then this number is divided by the sum of these measurements:

The number of pairs of bboxes added to this annotation
The number of bboxes that were added by reviewer only
The number of bboxes that were added by labelers only

For example:

Alignment of bboxes	Classes matched	Pairs of bboxes	Calculation
100%	0% (different classes selected)	1 (labeler and reviewer added the same number of bboxes)	0.5/(1+0+0) = 50%
95%	100% (same classes selected)	1 (labeler and reviewer added the same number of bboxes)	0.95/(1+0+1) = 95%
100%	100% (same classes selected)	1 (but reviewer removed one bbox)	1/(1+0+1) = 50%
100%	100% (same classes selected)	1 (but reviewer added one bbox)	1/(1+1+0) = 50%

Calculation rules for Consensus

📘
Consensus calculations may involve labels added by more than two users. This is why they're different from other quality metrics.

For consensus, asset quality calculation metrics depend on a specific labeling task. For more information, refer to these sections:

Consensus calculation rules for classification tasks
Consensus calculation rules for object detection tasks
Consensus calculation rules for transcription tasks
Consensus calculation rules for named entity recognition (NER) tasks
Consensus calculation rules for named entity relation tasks
Consensus calculation rules for optical character recognition (OCR) tasks

📘
In situations when multiple job types exist in your ontology (including nested jobs), Kili calculates an average score based on all the jobs. So if your ontology contains one classification job with a consensus score of 50% and one transcription job with a consensus score of 100%, the total consensus average score for the asset will be 75%.

Consensus calculation rules for classification tasks

When assets are labeled with consensus activated (or after predictions are uploaded), a score of agreement is calculated.

The process for single-category classification tasks is as follows:

Take all the selected categories. If the available categories were "red", "blue" and "green", and all the labelers selected "red", this value will be 1.
Iterate through the selected categories:
1. For each labeler who selected a given category, calculate 1 / number of labelers. For 3 labelers who selected "red", this value will be 1/3 for each one of them.
2. Sum up the score per category. In our case: 1/3 + 1/3 + 1/3 = 1
Sum up the total score for labeling task: add all the scores per category and then divide by the number of selected categories. In our case: 1 / 1 = 1. The score is 100%

The process for a multi-category classification task is exactly the same:

Take all the selected categories. If the available categories were "red", "blue" and "green", and all the labelers selected "red" and "blue", this value will be 2.
Iterate through the selected categories:
1. For each labeler who selected a given category, calculate 1 / number of labelers. For 3 labelers who selected "red" and "blue", this value will be 1/3 for each one of them.
2. Sum up the score per category. In our case: (1/3 + 1/3 + 1/3) + (1/3 + 1/3 + 1/3) = 2
Sum up the total score for labeling task: add all the scores per category and then divide by the number of selected categories. In our case: 2 / 2 = 1. The score is 100%

📘
If all labelers select the same class, the consensus will be evaluated to 1 (100% agreement).

Consensus calculation examples for classification tasks

Selected categories	Number of labelers	Total score
1/3	3	Class A: (1/3 + 1/3 + 1/3) = 1 1 / 1 = 100%
2/3	3	Class A: (1/3 + 1/3 + 1/3) + Class B: (1/3 + 1/3 + 1/3) = 1 + 1 = 2 2 / 2 = 100%
2/3	3	Class A: (1/3 + 1/3 + 0/3) + Class B: (0/3 + 0/3 + 1/3) = 2/3 + 1/3 = 1 1 / 2 = 50%
3/3	3	Class A: (1/3 + 0/3 + 0/3) + Class B: (0/3 + 1/3 + 0/3) + Class C: (0/3 + 0/3 + 1/3) = 3/3 = 1 1 / 3 = 33%
1/3	2	Class A: (1/2 + 1/2) = 1 1 / 1 = 100%
2/3	2	Class A: (1/2 + 0/2) + Class B: (0/2 + 1/2) = 1 1 / 2 = 50%

📘
With only two labelers, 50% is the lowest score possible.

Consensus calculation rules for object detection tasks

📘
Quality metrics calculation for object detection is implemented for bounding-boxes, polygons and semantic segmentation. It is not implemented yet for points, lines and vectors.

Each image is considered a set of pixels to be classified. A pixel can be classified into several non-exclusive categories representing different objects. The computation involves evaluating the intersection over the union of all annotations. So the ratio depends on common area. Hence:

two perfectly overlapping bounding-boxes correspond to a metric of 100 %
two completely distinct shapes correspond to a metric of 0 %

The mathematical formula for calculations is:

$∩(A,B)/U(A,B) $

With imprecise labeling, quality score quickly decreases.

One benefit of this method is that it is dependent on the size of the shape, through the union area denominator, ensuring accuracy for all sizes of shapes.

Consensus calculation rules for transcription tasks (all asset types)

Quality metric calculations for transcription jobs are made using the the Levenshtein distance. This is an approach to quantify the number of required changes to change one string into the other, taken one character at a time.

For example, a task to translate the sentence "Bryan is in the kitchen" to French with two different translations "Bryan est dans la cuisine" and "Bryan est dans la salle de bain" scores at 76 %.

🚧
With a completely irrelevant translation, the Levenshtein ratio can still be around 50 %, so it is best to monitor this closely, especially in the case of short transcriptions.

Consensus calculation rules for Named Entity Recognition tasks (NER)

For NER tasks, quality scores are computed at the category level:

For each category, the score is the intersection over the union of all entities by all labelers (latest submitted label per labeler). Its maximum is 1 if the selected entities are the same (beginOffset and endOffset are in alignment), and its minimum is 0 if the entities do not overlap at all.

In the example below, if one annotator has labeled the yellow entity and another the two blue entities, and both used the same category, the final score will be 33%.

score = number of characters in common / number of characters union = 15/45 = 33%

Consensus calculation rules for named entity relation tasks

For NER relation tasks, computation rules are almost the same as in Named Entity Recognition tasks. The only difference is that calculations are made at the category level and also at the relation level. For two 100% overlapping entities, if one labeler included both of them in a relation but the other didn't, the metric will be 0.

Consensus calculation rules for optical character recognition (OCR) tasks

The OCR task id the composition of an object detection task (selecting a box) and a text entry task (text contained in the box).

For this task, the quality metric for each category is computed in the same way as for transcription, using the the Levenshtein distance between the selected text items. To do this, we concatenate the text items for each category in each bounding box for the labels that we want to compare.

This is how the order of bounding boxes is established:

The document is divided in horizontal sections of the same height. The height is taken as the average height among all bounding boxes in the considered category of the considered label.
Next, we take the bounding boxes, using their centers as references, from top to bottom. Boxes that have the center in the same horizontal section are taken from left to right; boxes in different horizontal sections are taken from top to bottom.
The horizontal section takes preference over the left/right position.