AI::DecisionTree(3pm) | User Contributed Perl Documentation | AI::DecisionTree(3pm) |
AI::DecisionTree - Automatically Learns Decision Trees
version 0.11
use AI::DecisionTree; my $dtree = new AI::DecisionTree; # A set of training data for deciding whether to play tennis $dtree->add_instance (attributes => {outlook => 'sunny', temperature => 'hot', humidity => 'high'}, result => 'no'); $dtree->add_instance (attributes => {outlook => 'overcast', temperature => 'hot', humidity => 'normal'}, result => 'yes'); ... repeat for several more instances, then: $dtree->train; # Find results for unseen instances my $result = $dtree->get_result (attributes => {outlook => 'sunny', temperature => 'hot', humidity => 'normal'});
The "AI::DecisionTree" module automatically creates so-called "decision trees" to explain a set of training data. A decision tree is a kind of categorizer that use a flowchart-like process for categorizing new instances. For instance, a learned decision tree might look like the following, which classifies for the concept "play tennis":
OUTLOOK / | \ / | \ / | \ sunny/ overcast \rainy / | \ HUMIDITY | WIND / \ *no* / \ / \ / \ high/ \normal / \ / \ strong/ \weak *no* *yes* / \ *no* *yes*
(This example, and the inspiration for the "AI::DecisionTree" module, come directly from Tom Mitchell's excellent book "Machine Learning", available from McGraw Hill.)
A decision tree like this one can be learned from training data, and then applied to previously unseen data to obtain results that are consistent with the training data.
The usual goal of a decision tree is to somehow encapsulate the training data in the smallest possible tree. This is motivated by an "Occam's Razor" philosophy, in which the simplest possible explanation for a set of phenomena should be preferred over other explanations. Also, small trees will make decisions faster than large trees, and they are much easier for a human to look at and understand. One of the biggest reasons for using a decision tree instead of many other machine learning techniques is that a decision tree is a much more scrutable decision maker than, say, a neural network.
The current implementation of this module uses an extremely simple method for creating the decision tree based on the training instances. It uses an Information Gain metric (based on expected reduction in entropy) to select the "most informative" attribute at each node in the tree. This is essentially the ID3 algorithm, developed by J. R. Quinlan in 1986. The idea is that the attribute with the highest Information Gain will (probably) be the best attribute to split the tree on at each point if we're interested in making small trees.
If "noise_mode" is set to "fatal" (the default), the "train()" method will throw an exception (die). If "noise_mode" is set to "pick_best", the most frequent result at each noisy node will be selected.
An optional "name" parameter lets you give a unique name to each training instance. This can be used in coordination with the "set_results()" method below.
[ 'outlook', { 'rain' => [ 'wind', { 'strong' => 'no', 'weak' => 'yes', } ], 'sunny' => [ 'humidity', { 'normal' => 'yes', 'high' => 'no', } ], 'overcast' => 'yes', } ]
This is slightly remniscent of how XML::Parser returns the parsed XML tree.
Note that while the ordering in the hashes is unpredictable, the nesting is in the order in which the criteria will be checked at decision-making time.
if outlook='rain' and wind='strong' -> 'no' if outlook='rain' and wind='weak' -> 'yes' if outlook='sunny' and humidity='normal' -> 'yes' if outlook='sunny' and humidity='high' -> 'no' if outlook='overcast' -> 'yes'
This can be helpful for scrutinizing the structure of a tree.
Note that while the order of the rules is unpredictable, the order of criteria within each rule reflects the order in which the criteria will be checked at decision-making time.
A "leaf_colors" argument can specify a fill color for each leaf node in the tree. The keys of the hash should be the same as the strings appearing as the "result" parameters given to "add_instance()", and the values should be any GraphViz-style color specification.
Any additional arguments given to "as_graphviz()" will be passed on to GraphViz's "new()" method. See the GraphViz docs for more info.
A few limitations exist in the current version. All of them could be removed in future versions - especially with your help. =)
The usual way to deal with this problem is for the tree-building process to figure out how to place the continuous attribute values into a set of bins (like "cool", "medium", and "hot") and then build the tree based on these bin values. Future versions of "AI::DecisionTree" may provide support for this. For now, you have to do it yourself.
All the stuff in the LIMITATIONS section. Also, revisit the pruning algorithm to see how it can be improved.
Ken Williams, ken@mathforum.org
Mitchell, Tom (1997). Machine Learning. McGraw-Hill. pp 52-80.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), pp 81-106.
perl, GraphViz
2018-11-01 | perl v5.28.0 |