| Class | Ariel::TokenStream |
| In: |
lib/ariel/token_stream.rb
|
| Parent: | Object |
A TokenStream instance stores a stream of Tokens once it has used its tokenization rules to extract them from a string. A TokenStream knows its current position (TokenStream#cur_pos), which is incremented when any of the Enumerable methods are used (due to the redefinition of TokenStream#each). As you advance through the stream, the current token is always returned and then consumed. A TokenStream also provides methods for finding patterns in a given stream much like StringScanner but for an array of tokens. For rule generation, a certain token can be marked as being the start point of a label. Finally, a TokenStream will record whether it is in a reversed or unreversed state so that when rules are applied, they are always applied from the front or end of the stream as required, whether it is reversed or not.
| TOKEN_REGEXEN | = | [ Wildcards.list[:html_tag], # Match html tags that don't have attributes /\d+/, # Match any numbers, probably good to make a split /\b\w+\b/, # Pick up words, will split at punctuation /\S/ |
| LABEL_TAG_REGEXEN | = | [LabelUtils.any_label_regex] |
| cur_pos | [RW] | |
| label_index | [RW] | |
| original_text | [RW] | |
| tokens | [RW] |
Used to ensure operations such as @tokens.reverse! in one instance won’t inadvertently effect another.
Returns all text represented by the instance’s stored tokens. It will not strip label tags even if the stream is marked to contain them. However, you should not expect to get the raw_text once any label_tags have been filtered (TokenStream#remove_label_tags).
Goes through all stored Token instances, removing them if Token#is_label_tag? Called after a labeled document has been extracted to a tree ready for the rule learning process.
Returns a copy of the current instance with a reversed set of tokens. If it is set, the label_index is adjusted accordingly to point to the correct token.
Converts the given position so it points to the same token once the stream is reversed. Result invalid for when @tokens.size==0
Set a label at a given offset in the original text. Searches for a token with a start_loc equal to the position passed as an argument, and raises an error if one is not found.
Takes a list of Strings and Symbols as its arguments representing text to be matched in individual tokens and Wildcards. For a match to be a success, all wildcards and strings must match a consecutive sequence of Tokens in the TokenStream. All matched Tokens are consumed, and the TokenStream’s current position is returned on success. On failure, the TokenStream is returned to its original state and returns nil.
Returns the slice of the current instance containing all the tokens between the token where the start_loc == the left parameter and the token where the end_loc == the right parameter.
Returns all text represented by the instance’s stored tokens, stripping any label tags if the stream was declared to be containing them when it was initialized (this would only happen during the process of loading labeled examples). See also TokenStream#raw_text
The tokenizer operates on a string by splitting it at every point it finds a match to a regular expression. Each match is added as a token, and the strings between each match are stored along with their original offsets. The same is then done with the next regular expression on each of these split strings, and new tokens are created with the correct offset in the original text. Any characters left unmatched by any of the regular expressions in TokenStream::TOKEN_REGEXEN are discarded. This approach allows a hierarchy of regular expressions to work simply and easily. A simple regular expression to match html tags might operate first, and then later expressions that pick up runs of word characters can operate on what’s left. If contains_labels is set to true when calling tokenize, the tokenizer will first remove and discard any occurences of label_tags (as defined by the Regex set in LabelUtils) before matching and adding tokens. Any label_tag tokens will be marked as such upon creation.