Open Semantic Search is an open source research and analytics platform designed for searching, analyzing, and exploring large collections of documents using semantic search technologies. It provides an integrated search server combined with a document processing pipeline that supports crawling, text extraction, and automated analysis of content from many different sources. Open Semantic Search includes an ETL framework that can ingest documents, process them through analysis steps, and enrich the data with extracted information such as named entities and metadata. It also supports optical character recognition to extract text from images and scanned documents, including images embedded inside PDF files. It integrates text mining and analytics capabilities that allow users to examine relationships, topics, and structured data within document collections.
Features
- Integrated semantic search server for indexing and querying large document collections
- ETL pipeline for document crawling, processing, and data enrichment
- Optical character recognition for extracting text from images and PDFs
- Named entity recognition for identifying people, organizations, and locations
- Full text and faceted search with interactive filtering and exploration
- Metadata and semantic enrichment using thesauri, ontologies, and knowledge graphs