Module Ariel
In: lib/ariel/node/structure.rb
lib/ariel/node/extracted.rb
lib/ariel/learner.rb
lib/ariel/rule_set.rb
lib/ariel/labeled_document_loader.rb
lib/ariel/rule.rb
lib/ariel/wildcards.rb
lib/ariel/token_stream.rb
lib/ariel/log.rb
lib/ariel/node.rb
lib/ariel/label_utils.rb
lib/ariel/token.rb
lib/ariel/candidate_refiner.rb
lib/ariel.rb

Ariel - A Ruby Information Extraction Library

Ariel intends to assist in extracting information from semi-structured documents including (but not in any way limited to) web pages. Although you may use libraries such as Hpricot or Rubyful Soup, or even plain Regular Expressions to achieve the same goal, Ariel approaches the problem very differently. Ariel relies on the user labeling examples of the data they want to extract, and then finds patterns across several such labeled examples in order to produce a set of general rules for extracting this information from any similar document.

When working with Ariel, your workflow might look something like this:

  1. Define a structure for the data you wish to extract. For example:
     @structure = Ariel::StructureNode.new do |r|
       r.item :article do |a|
         a.item :title
         a.item :author
         a.item :date
         a.item :body
       end
       r.list :comments do |c|
         c.list_item :comment do |c|
           c.item :author
           c.item :date
           c.item :body
         end
       end
     end
    
  2. Label these fields in a few example documents (normally at least 3). Labels are in the form of <l:label_name>…</l:label_name>
  3. Ariel will read these examples, and try to generate suitable rules that can be used to extract this data from other similarly structured documents. Use Ariel#learn to initiate learn ruling.
  4. A wrapper has been generated - we can now happily load documents with the same structure (normally documents generated by the same rules, so different pages from a single site perhaps) and query the extracted data. See Ariel#extract.

Methods

extract   learn  

Classes and Modules

Module Ariel::LabelUtils
Module Ariel::Node
Class Ariel::CandidateRefiner
Class Ariel::LabeledDocumentLoader
Class Ariel::Learner
Class Ariel::Log
Class Ariel::Node
Class Ariel::Rule
Class Ariel::RuleSet
Class Ariel::Token
Class Ariel::TokenStream
Class Ariel::Wildcards

Public Class methods

Will use the given root Node::Structure to extract information from each of the given files (can be any object responding to read, and if passed a string the parameter will be opened using File.read). If a block is given, each root Node::Extracted is yielded. An array of each root extracted node is returned.

Ariel.extract structure, ‘file1.txt’, fileobj, ‘file2.html’ # => an array of 3 Node::Extracted objects

Given a root Node::Structure and a list of labeled_files (either IO objects or strings representing files that can be opened with File.read, will learn rules using the labeled examples. The passed Node::Structure tree is returned with new RuleSets added containing the learnt rules. This structure can now be used with Ariel#extract on unlabeled documents.

Ariel.learn structure, ‘file1.html’, fileobj, ‘file2.html‘

[Validate]