Ariel intends to assist in extracting information from semi-structured documents including (but not in any way limited to) web pages. Although you may use libraries such as Hpricot or Rubyful Soup, or even plain Regular Expressions to achieve the same goal, Ariel approaches the problem very differently. Ariel relies on the user labeling examples of the data they want to extract, and then finds patterns across several such labeled examples in order to produce a set of general rules for extracting this information from any similar document.
When working with Ariel, your workflow might look something like this:
@structure = Ariel::StructureNode.new do |r|
r.item :article do |a|
a.item :title
a.item :author
a.item :date
a.item :body
end
r.list :comments do |c|
c.list_item :comment do |c|
c.item :author
c.item :date
c.item :body
end
end
end
Will use the given root Node::Structure to extract information from each of the given files (can be any object responding to read, and if passed a string the parameter will be opened using File.read). If a block is given, each root Node::Extracted is yielded. An array of each root extracted node is returned.
Ariel.extract structure, ‘file1.txt’, fileobj, ‘file2.html’ # => an array of 3 Node::Extracted objects
Given a root Node::Structure and a list of labeled_files (either IO objects or strings representing files that can be opened with File.read, will learn rules using the labeled examples. The passed Node::Structure tree is returned with new RuleSets added containing the learnt rules. This structure can now be used with Ariel#extract on unlabeled documents.
Ariel.learn structure, ‘file1.html’, fileobj, ‘file2.html‘