Sense of Wonder | Project Halo: Towards a Digital Aristotle

David Cerezo's Weblog

Tue 10-02-2004 04:21 PM

Project Halo: Towards a Digital Aristotle

     Project Halo in a multi-staged effort towards a digital Aristotle, an application that will encompass much of the world's scientific knowledge and be capable of answering novel questions and advanced problem solving. Vulcan sees two primary functions for the digital Aristotle: first, as a tutor capable of instructing students in the sciences; and second, as a research assistant with broad interdisciplinary skills to help scientists in their work. Phase I of project Halo demonstrated that the current state of the art in knowledge representation and reasoning is capable of producing question answering technology capable of answering novel questions and providing domain-appropriate justifications in AP chemistry. The project demonstrated two drawbacks of the current state of the art:

  • knowledge formulation requires highly specialized and expensive personnel, pushing the per page cost of robust knowledge formulation to about $10,000 per page
  • most of the system failures were attributed to the fact that this specialized personnel lacked sufficient domain knowledge (in chemistry).
     Phase II will address these two issues head on by developing technology that will enable domain experts to formulate knowledge with ever increasing independence from knowledge engineers and be able to pose questions and problems to these systems. Vulcan believes that success in this objective will reduce the cost of knowledge formulation to levels comparable to textbook development and encourage scientists and educators to build an ever-expanding body of machine-processable knowledge that will form the basis for the digital Aristotle. The 30-month Phase II effort will involve teams with world class skills and technology in five primary areas: knowledge representation and reasoning (KRR), knowledge acquisition (KA), intelligent interfaces including natural language understanding, usability and system integration. Comprehensive user-driven evaluations will provide guidance to and validation of our research.      Description taken from a 85-minute length video presentation of Project Halo by Noah Friedland, a research project of Paul Allen’s Vulcan Inc. Another video presentation, 58-minutes length.      This project has brought memories back: when I was a 11-years old kid, I programmed a rule-based system to help me in the chemical identification process of compounds and minerals. In the Halo Pilot, three teams encoded 50 pages out of a chemistry syllabus to answer an exam of 100 questions, graded by three separate chemistry professors. Surprisingly, the three teams performed remarkably well, and they have generated a vast amount of information, including copies of the systems used to answer the questions.      The participants are leaders in field of knowledge-based representation and reasoning(descriptions from Vulcan’s final report):
  • Cycorp has over 600 person-years invested in the development of the world’s largest knowledge base, featuring over a million entities and relationships and tens of thousands of axioms organized into several thousand microtheories. These constructs are tied together by an “upper ontology.” Cycorp tries to leverage its knowledge base extensively when constructing new knowledge. Its reasoning engine has over 500 computational modules and claims to be fundamentally non-monotonic in its reasoning approach. It employs a truth maintenance system (TMS) to verify that new knowledge does not corrupt existing knowledge. Its formal language, CycL is an extremely expressive logic-based representation.
  • Ontoprise’s formal language is F-Logic. F-logic (“F” stands for “Frames”) combines the advantages of frame-based languages and the expressiveness, compact syntax, and well-defined semantics of logic programming languages such as Prolog. The original features of F-logic include signatures, object identity, complex objects, methods, classes, inheritance and rules. Their inference engine OntoBroker® provides means for efficient reasoning in F-Logic through a mixture of forward and backward chaining, based on a dynamic filtering algorithm to compute (the smallest possible) subset of the model for answering the query. The semantics for a set of F-Logic statements is then defined by a transformation process of F-Logic into normal logic (Horn logic with negation) and the well-founded semantics for the resulting set of facts and rules and axioms in normal logic. Their engine does not currently provide any framework for knowledge reuse, rather each knowledge base is constructed from scratch and customized for the defined requirements of the given project or problem at hand.
  • SRI’s approach to knowledge formulation relies on a component library consisting of several hundred concepts, which can be combined into more complex knowledge constructs. Their approach relies on the assumption that these fundamental building blocks can be easily extended and specialized for each new domain. SHAKEN, their KRR environment, features the KM formal language, an expressive frame-based representation. The engine supports monotonic reasoning only, with heuristics for handling identity, and does not, to date, employ a TMS. It does feature an automated entity classification capability.
Interesting documents: Some notes:
  • Cycorp featured the maximum running time because their database its two order of magnitude bigger: maybe they also use some exponential running time algorithm. And generated answers of multiple pages.
  • SRI was the winner of the "contest": they used professional chemists to encode the knowledge.
  • Ontoprise has the best language to encode the information.
  • The contest’s aim was to attract media attention and funding, because breaking one of the following requirements would have render the experiment totally unsuccessful, and the real breakthrough would be to ignore of one of these requisites:
    • Chemistry has a highly formalized body of knowledge, compared to other disciplines like philosophy, psychology and other social sciences.
    • Most questions were of “multiple-choice questions”, that is, answers were given: resolution by elimination is the best strategy. There were not real chemistry questions involving an scenario with some parameters and a goal.
    • No use of meta-reasoning capabilities, just instance-based solutions to prove generalized comprehension-oriented questions.
    • There were not really hard questions like:
      • Enumerative questions, that will need enumerative algebraic gemeotry.
      • Any NP-related problem. See A compendium of NP optimization problems
      • Logical inference that will need to use really big ordered binary decision diagrams.
      • Questions dealing with the quantum mechanically behaviour of the system, that is, calculations exponential in the number of atoms.
      • Questions dealing with uncomputability in physics.
    • Questions were encoded by humans, there was no natural language processing.
    • Experts in knowledge representation and even professional chemists worked in the project.
    Although somewhat impressive, experts in the field already knew that the systems will score high.     Finally, let’s suppose some entity waste some billion dollars to encode the knowledge of undergraduate physics, chemistry, biology and medicine using a somewhat improved version of SRI’s code with Ontoprise’s modelling language– note that encoding mathematics would be a real waste of time-. The biggest advantage here will be in cross-disciplinary answers to questions, because any expert in a particular field will answer more precisely field-related questions. Unfortunately, the experiment didn’t test this scenario, so analogies with Aristotle are out-of-place.