Modeling

This brief introduction to ChoiceMaker data analysis provides a high-level description of the process and tools for developing a ChoiceMaker matching model.  For a more detailed description of ChoiceMaker data analysis, please see the OSCMT Data Modeler's Guide and the OSCMT Wiki.

Design, Train, Test

ChoiceMaker models are developed and refined iteratively.  A model developer begins by writing some clues that indicate what data correlations between records point toward match, hold or differ decisions.  Next, the candidate clues are trained against a set of record pairs (the training set) that have been marked by one or more data experts.  Finally, the new clue weights are tested against a different set of record pairs (the test set) that have also been marked by data experts.

This iterative process continues until the model under development achieves a satisfactory accuracy (compared to human markings) on the test pairs.  Once the accuracy is adequate, the model is ready to be deployed in a production application or to a production server.

ChoiceMaker Iterative Modeling Process

Bootstrapping

A critical part of the ChoiceMaker modeling process is the development of appropriate clues.  An equally critical part is the selection of good pairs for training and testing.

What constitutes good pairs for training and testing?  First and foremost, the pairs used for training and testing should be distinct.  A pair that is used for training should not be used for testing.  Also, a pair that is used for training or for testing should be used only once.  Second, pairs that are used for either training or testing should be relatively ambiguous.  The accuracy of a ChoiceMaker model compared to human intuition is generally not improved by including a bunch of pairs that are obvious matches (in which all fields agree exactly) or obvious holds (in which all fields are missing or invalid) or obvious differs (in which all fields are completely different).

How does one find good pairs?  There are several ChoiceMaker tools that will comb through a database of (single) records and pull out pairs of records with ambiguous ChoiceMaker probabilities.  Typically, these tools use the same ChoiceMaker model that is under development to find pairs that score ambiguously under that model.  Human data experts then review the new pairs, and mark the pairs as matches, holds and differs.  Early in the development process, pairs that score ambiguously to the nascent model typically do not appear at all ambiguous to human reviewers, so when the model is trained against the newly marked pairs, the model weights move toward values that push pair probabilities toward less ambiguous scores.  As the iterative process continues, human markings and ChoiceMaker probabilities begin to coalesce.