The goal of this double triangle annotation is to provide precise annotation, combining the power of LLM or other automatic labelling models and human. In order to reduce human effort while ensuring the quality of LLM labeling, we design this double triangle annotation paradigm.
There are two machine annotators and one human jury in this part. The two machine annotators, which could be LLM or traditional OCR, offer the label for our data. Then we automatically compare the results of those two annotators. The human jury will check the data points which have different annotations from two annotators, and only sample a small part of data points with the same annotation.
The rational behind this design is that, when two models are both strong and independent, we believe that they would label most data correctly, and when they make mistakes, the probablity that they make same mistake (giving the same wrong label) on the same data point is fairly low. This is effectively reducing the human labelling efforts, cost and decreasing the human labelling errors due to fatigue, as human only need to check the data points with different labels and a small part of data points with identical label, which in total should be an acceptable amount for the attention of human annotator.
Since the two machine annotators (M1 and M2) offers their separate and independent labels for the human jury (H) to check, it looks like a triangle. For the convenience of later discussion, we call the triangle, which consists of two machine annotators and one human jury, a system (S).
The annotation generated by the system risks the human subjectivity of the human jury H. In order to control the subjectivity of one jury, we are introducing the second triangular structure, with two systems and one final reviewer (R).
First we have two systems (S1, S2), providing their independent high-quality labels (L1, L2). L1 and L2 will be then submitted to the final reviewer (R), which examines the difference in L1 and L2, solve the difference manually and generates the final label, or "golden truth" (G).
In the second layer triangle, we still have human as the final reviewer, so the human subjectivity is not completely eliminated. However, with this second layer mechanism, we believe that we could have more precise labels than only using the first layer triangle.
We made a filtering here, <= 70 fields to be corrected, > 0.7 IAA_Field, > 0.6 IAA_Character, in order to make the human correction easier. However, this filtering may make the evluation set has more short lists than the original sampled ones.
In the first layer triangle, we need to choose the machine annotator. Here we want to choose two strong and independent models for each system, and the models between systems should be different, to ensure the independence between systems.
For the model, we have two choices
- Use strong multimodal LLM. (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5, Grok 4, DeepSeek OCR, Qwen3-Max) Be careful of the supported languages of the model
- Use traditional OCR / commercial OCR + only-text strong LLM for correction. (Google OCR, Microsoft OCR)
In our real world case, we use these two pairs:
- Claude Sonnet 4.5 vs Qwen 3 VL 235B
- Llama 4 Maverick vs Grok 4 0709.
Choose an experienced domain expert as the final reviewer. In my case, the researcher himself.
For more detailed information, see the full report:
- Google Drive folder: https://drive.google.com/drive/folders/1p7XqrWZg4O1lyLGvgXd4mwM9ZVFZBUEy