Class Ariel::Learner
In: lib/ariel/learner.rb
Parent: Object

Implements a fairly standard separate and conquer rule learning system. Using a list of labeled examples, candidate rules are generated. A rule is refined until it covers as many as possible of the labeled examples. This rule is recorded, the covered examples are removed and the process repeats on the remaining examples. Once all examples are covered, the disjunct of all generated rules is returned.

Methods

Attributes

candidates  [RW] 
current_rule  [RW] 
current_seed  [RW] 
direction  [RW] 

Public Class methods

Takes a list of TokenStreams containing labels.

Public Instance methods

Implements topology refinements - new landmarks are added to the current rule.

  • Takes a landmark and its index in the current rule.
  • Applies the rule consisting of all landmarks up to and including the current landmark to find where it matches.
  • Only tokens between the label_index and the position at which the partial rule matches are considered.
  • Tokens before the rule match location will have no effect, as adding new landmarks before or after the current landmark will not make the rule match any earlier.
  • For every token in this slice of the TokenStream, a new potential rule is created by adding a new landmark consisting of that token. This is also done for each of that token’s matching wildcards.

When learning list iteration rules, some examples may be unsuitable. For instance if there is a list item at the start of an example with no tokens before it, a skip_to(nil) start rule would be generated that wouldn’t make sense for exhaustive rules. The example should be caught by the corresponding end rule. This should only be run after tokenstream’s have been prepared (reversed based on whether a :forward or :back rule is being searched for). Only returns a valid conclusion if the examples are intended to be used for exhaustive rule learning

Equivalent of LearnDisjunct in STALKER algorithm. Generates initial candidate rules, refines, and then returns a single rule.

Using the seed example passed to it, generates a list of initial rule candidates for further refinement and evaluation. The Token prior to the labeled token is considered, and separate rules are generated that skip_to that token’s text or any of it’s matching wildcards.

Given a list of candidate rules, uses heuristics to determine a rule considered to be the best refiner. Prefers candidate rules that have:

  • Larger coverage = early + correct matches.
  • If equal, prefer more early matches - can be made in to fails or perfect matches. Intuitively, if there are more equal matches the rule is finding features common to all documents.
  • If there is a tie, more failed matches wins - we want matches to fail rather than match incorrectly
  • Fewer wildcards - more specific, less likely to match by chance.
  • Shorter unconsumed prefixes - closer to matching correctly
  • fewer tokens in SkipUntil() - huh? Perhaps because skip_until relies on slot content rather than document structure.
  • longer end landmarks - prefer "local context" landmarks.

Given a list of candidate rules, use heuristics to determine the best solution. Prefers:

  • More correct matches
  • More failed matches if a tie - failed matches preferable to incorrect matchees.
  • Fewer tokens in SkipUntil()
  • fewer wildcards
  • longer end landmarks
  • shorter unconsumed prefixes

Initiates and operates the whole rule induction process. Finds an example to use as its seed example, then finds a rule that matches the maximum number of examples correctly and fails on all overs. All matched examples are then removed and the process is repeated considering all examples that remain. Returns an array of the rules found (in order). learn_rule will take care of reversing the given examples if necessary.

Implements landmark refinements. Landmarks are lengthened to make them more specific.

  • Takes a landmark and its index in the current rule.
  • Applies the rule consisting of all previous landmarks in the current rule, so the landmark can be considered in the context of the point from which it shall be applied.
  • Every point at which the landmark matches after the cur_loc is considered.
  • Two extended landmarks are generated - a landmark that includes the token before the match, and a landmark that includes that token after the match.
  • Rules are generated incorporating these extended landmarks, including alternative landmark extensions that use relevant wildcards.

A given rule is perfect if it successfully matches the label on at least one example and fails all others.

Oversees both landmark (e.g. changing skip_to("<b>") in to skip_to("Price","<b>") and topology (skip_to(:html_tag) to a chain of skip_to() commands). Takes the current rule being generated and the example against which it is being created (the current seed_rule) as arguments.

The seed example is chosen from the array of remaining examples. The LabeledStream with the fewest tokens before the labeled token is chosen.

[Validate]