Measuring OCR performance     
Extracting data from images and scanned documents is a crucial part of what we do. We've deployed 3 OCR engines on our platform, the first 2 are locally deployed (Tesseract, Microsoft Azure OCR), the third is available via an API (Google Vision). We thought that we'd illustrate performance by measuring them on 2 basic use-cases.

Use-case 1 with following raw file: mimicking a hastily scribbled driver's log:
  • Microsoft Azure OCR:
    Numbers: managed everything except for number "ZH42651A", which it confused with "ZH42659A"
    Dates: managed all dates except for 10012023
    Comments: managed all 3 handwritten comments
  • Google Vision API:
    Numbers: managed everything except for number "ZH42651A", where it only gave back the numbers "42651"
    Dates: managed all dates except for the first 09072023
    Comments: managed all 3 handwritten comments
  • Tesseract (opensource): all over the place, almost nothing correct

Use-case 2: a screen-shot of a dummy Excel file:
  • Microsoft Azure OCR: Everything correct (numbers, headers, row numbers) with 1 exception, instead of 0.56 it sent 0:56
  • Google Vision API: Everything correct without exception, however it misaligned the 3 bottom rows (capital city and data no longer aligned to capital)
  • Tesseract (opensource): got all the numbers typed into the cells correctly, however it didn't get 3 row numbers (before the country name) correctly: 7, 31, 32. It also didn't manage the color-inverted header row, where it only got gibberish

Your takeaway: If your use-case doesn't have a lot of handwriting, then opensource solutions such as Tesseract will bring you a long way. However if you do have handwritten notes, then both Azure and Google offer similar impressive performance and are way beyond Tesseract. All 3 were similar in terms of speed

Conclusion: we find these results impressive. OCR has come a long way in a short time and - despite not going easy in our 2 use-cases - especially Google and Azure did an incredible job of digitizing the data. We've deployed all 3 OCR engines and continue to experiment (different languages, different colors, different resolutions, etc.) in order to leverage their strengths. For customers with strong data security requirements, we tend to gravitate to Azure and Tesseract, as these are deployed locally and we haven't found an easy way to deploy Google Vision locally (yet).
Click here to show all blog posts
Powered by Sense6