OBIE Research at Aimlab


OBCIE (Ontology-Based Components for Information Extraction) Approach

This page presents the datasets, executables and source code related to our work on the development a comprehensive component-based approach for information extraction named OBCIE (Ontology-Based Information Extraction). A paper based on this work has been accepted for publication by the CIKM 2010 conference. The full paper is available here.


Information Extraction (IE) has existed as a field for several decades and has produced some impressive systems in the recent past. Despite its success, widespread usage and commercialization remain elusive goals for this field. We identify the lack of effective mechanisms for reuse as one major reason behind this situation. Here, we mean not only the reuse of the same IE technique in different situations but also the reuse of information related to the application of IE techniques (e.g., features used for classification).

We have developed a comprehensive component-based approach for information extraction that promotes reuse to address this situation. We designed this approach starting from our previous work on the use of multiple ontologies in information extraction. The key ideas of our approach are "information extractors," which are components of an IE system that make extractions with respect to particular components of an ontology and "platforms for IE," which are domain and corpus independent implementations of IE techniques. A case study has shown that this component-based approach can be successfully applied in practical situations.

Platforms for Information Extraction:

  1. Extraction Rules
  2. Two-Phase Classification

Standard Schemata:

  1. XML Schema for Platforms
  2. XML Schema for Metadata of Information Extractors

Complete Z Specification:

Our CIKM 2010 paper only contains sections of the Z specifications for generic OBIE systems and systems operating under our component-based approach due to lack of space. The complete specification is available here.


  1. OBIE Ontology (which defines the "hasInformationExtractor" annotation property)
  2. MUC4 Ontology
  3. MindSwap Terrrorism Ontology
  4. Wikipedia Terrorism Ontology

Information Extractors:

Note that technically these are the metadata components of the information extractors.

  1. Two-Phase Classification Platform
    1. MUC4 Ontology
    2. Wikipedia Ontology
  2. Extraction Rules Platform
    1. Mindswap Ontology
    2. Wikipedia Ontology

Text Corpora:

  1. MUC4 Corpus
    1. Training set (documents, key files)
    2. Test set (documents, key files)
  2. Wikipedia Corpus
    1. Training set (documents, key files)
    2. Test set (documents, key files)

Lines of a key file should have the format "<Class>:<Property>:<Value>". An example file is provided here.


If you need any other details regarding this work please contact Dejing Dou (dou AT