Two (3, 5, 8 or 100) wrongs do make it right! A Set Theory Approach to Data Self-Organization

Share

Missing, incomplete, redundant, and conflicting data is a major problem in member/patient unique identification. Customer centricity has become at the top of nearly every business strategy. Communicating to, Connecting with, and Convincing a customer to take certain actions has become the gold standard for predicting business success.

The fact that plan members, patients, and other consumers of service (healthcare or others) can’t be always be uniquely identified, makes it difficult to personalize services (The one size fits one principle!). Other services, like care and disease management also becomes costly and ineffective.

Another challenge comes in the form of attempting to merge claims, clinical, biometric, or other personal data in an M&A event.

For a business to Connect, Communicate, and Convince a customer, they must first Know their customer. .. Easier said than done, in many cases!

In some other cases, the certain or near certain identification of an entity of fragmented, conflicting, or ambiguous data elements is of great interest to law enforcement agencies, and national security efforts

difficulty in determining what a unique customer identification is has daunted information scientists for years. The difficulty stems from the fact that a ‘typical’ customer may have multiple variation of the spelling of their name, foreign, and non-Latin alphabet names, various sources for dates of birth, social security numbers variances, biometric identifiers, ethnic identifiers, referral sources, various identification records, membership status, etc.

Using Set Theory principles, I have been working on a method by which multiple ambiguous, incomplete, conflicting, unassociated data files can be associated and reconciled in a first degree record (most plausible), as well as secondary and tertiary degree alternatives.

The following is a pseudo-algorithm describing the method:

Consider a heterogeneous N number of data files for individuals or any entity. These data files could be of varying length, data structures, and could contain conflicting or incomplete data.

Following steps indicated below, we will show that it is possible to correct, correlate, and combine multiple files belonging to the same entity.

1- Using a pre-defined dictionary, scan all files for keys

2- Index all keys

3- Group files by keys and by degree:

1.    Group can contain 1 to n keys. These key can be full or partial keys. For example, three files contain the same social security number, and a fourth contains only 2 of the 3 social security number segments

4- Create ‘Sets’ that contain related keys and rank from highest to lowest

5- Within these sets, create ‘Assemblies’  based on the strength of score of associative keys. For example; a SSN can be assigned strength of 7, while a first name may have strength of 1.

6- Create criteria for identification confidence – How many key? How strong is the associate? 

7- Invoke election criteria for the ‘right’ data element. For example, if three DOBs exist, which is the right one?

8- Re-assemble ‘whole record’ / ’prime record’

9- Create 2nd and 3rd order records.

10- Create an associative self-learning algorithm for future identification

The initial results have been very positive.. I am working on improving the algorithm and the computational technique.

Get notified when the next post goes live: