Process unstructured data into JSon
- https://jxnl.github.io/instructor/ - It’s a thin wrapper over LLM + a few additional parsing / extraction capabilities
- https://github.com/1rgs/jsonformer - applies constrained token generation over the logits - so more useful for HuggingFace powered models
- https://github.com/outlines-dev/outlines - claims more features
- Benchmarks
- SciTSR (Scientific Table Structure Recognition Dataset)
- Dataset Link
- Contains 15,000 tables from scientific papers in PDF format
- Focuses on complex table structures with merged cells and hierarchical headers
- Includes both images and ground truth annotations in JSON format
- TableBank
- Dataset Paper
- Dataset Link
- 417,234 tables from Word and LaTeX documents
- Labeled with both table detection and structure recognition annotations
- Automatic labeling process using digital document formats
- PubTabNet
- Dataset Link
- Paper
- 568,000 tables extracted from PubMed Central Open Access Subset
- HTML representation of table structures
- Includes both simple and complex table layouts
- ICDAR Competitions Datasets
- DocLayNet
- Focus on both modern and historical documents
- Includes table detection, structure recognition, and cell content extraction tasks
- ViDoRe Benchmark
- Dataset Link
- Comprehensive evaluation of document retrieval systems
- 100,000+ documents with rich visual elements
- Includes complex tables, figures, and layout structures
- TabStruct-Net
- Dataset Link
- 6,000+ tables from scientific papers
- Specialized in handling complex hierarchical table structures
- Includes cell relationship annotations
- FinTabNet
- Dataset Link
- 111,690 tables from financial documents (10-K reports)
- Annotated with table structure and cell semantic types
- Includes complex financial tables with nested headers and spanning cells
- Specifically designed for financial document understanding
Companies
Other