Optical Character Recognition
- Transformers vs. OCR: an in-depth comparison for Information Extraction (2023)
- Anh Tuan: Tutorial: OCR with PaddleOCR (PP-OCR) (2022)
- Neuralearn: Extract Tables from PDF and convert to Excel sheet with Paddle OCR text detection and recognition (2023)
- G. Kim et al: OCR-free Document Understanding Transformer (2022).
- Donut github, huggingface
- PaddleOCR
- Hacker News: Our search for the best OCR tool (2019) (2022)
- OCR in 2024: Benchmarking Text Extraction/Capture Accuracy
Process unstructured data into JSon
- https://jxnl.github.io/instructor/ - It’s a thin wrapper over LLM + a few additional parsing / extraction capabilities
- https://github.com/1rgs/jsonformer - applies constrained token generation over the logits - so more useful for HuggingFace powered models
- https://github.com/outlines-dev/outlines - claims more features