Welcome
Minion is a product quality search engine
written in Java, created by Sun Labs.
In addition to standard document retrieval
operations, it provides for relational querying in conjunction with
boolean and proximity querying. It also provides document similarity
measures (e.g., "More like this"), result and document
clustering, and automatic document classification. The
engine is designed to be highly configurable/pluggable and is intended to be used
in research as well as production environments. We're still putting
documentation together, but here are a couple links to help you get started.
- Download Minion 1.0
- Getting started with Minion
- Minion JavaDoc
- Minion technical talk given at JavaOne 2008 (Also see the simple mail indexer referenced in the JavaOne talk)
- Documentation on basic and advanced Minion configuration.
Summary of Features
Indexing & Retrieval
- Universal Tokenizer switches seamlessly between alphabetic and character (CJK) words within documents
- True saved field types for integers, floats, dates, booleans, and strings
- Documents are automatically available for querying as soon as their index data is written to disk
- Range queries for saved fields
- Morphological expansion of query terms is available by default, or add in your own custom knowledge source
- Default passage retrieval operator finds passages that best match your query using patented relaxation ranking retrieval algorithm
- Multiple query languages available, or use the programmatic query API
- Passage highlighting for results
Classification
- Easily define classes from result sets based on full document text, or text of a specific field
- Documents are automatically classified in bulk
- Available options for term clustering, cross-validation, feature selection, and feature set size optimization
- Manually assign documents to classes for training or to override determined values
Clustering
- Group documents in a result set into clusters on the fly
- Describe clusters using top terms
- Several clustering algorithms available
Configuration
- Highly configurable using runtime configuration system
- Configure indexing pipeline to control tokenization, case variations, stop words, stemming, postings types (doc ID only, ID & frequency, ID & frequency & position), or other custom stages
- Replace virtually any component of the indexer from the tokenizer to the dictionaries to the classifier without recompiling
Performance
- In default case-sensitive configuration, over 26GB/hour when indexing HTML (index is about 45% size of data)
- Using case-insensitive config that doesn't save position data, over 60GB/hour when indexing HTML (index is about 15% size of data)