How a European consumer credit bureau assembles millions of entities

Estimated reading time: 14min

Use this demo tool to submit some data into the database, that is shared among all demo users.
You start with around 20 random entities. Start small by adding a single data set, e.g. John Smith in Newtown. After that you might want to submit a data pair, e.g. John Smith in Oldtown and in Newtown and see that all three data points have been connected.
If you have lost one of your entities or if you come back at a later point, you can simply search for for that entity again.
You may also share this link with your friends and watch while they submit data.

USE ONLY FAKE DATA! OTHERS CAN SEE WHAT YOU ENTER!

Technical challenge

In technical terms, the issue Regis24 was facing, is called "entity resolution" or "record linkage", which attempts to find data sets that belong to the same real-world entity - in this case a person.

Beside the relatively simple task of identifying a person based on their name, address and maybe other additional information, Regis24 also has the challenge of maintaining a history of previous addresses or even previous names. This results in the challenge of connecting data sets of e.g. John Smith who moved from Hamburg to Munich, changed his surname due to marriage, moved to Berlin, and sometimes uses his living address and other times his address from work.

Example Graph of John:

Requirements

  • Scalability: There must be no limit for the amount of data sets that can be added into the database and the database must handle an unlimited amount of requests per second.
  • Query Performance: Searching for any person in the database must be fast enough for realtime applications.
  • Reliability: All entities and their links must be plausible and reproducable.
  • Completeness: All data sets must be stored in the database and are fully queryable. No database for additional data is required.
  • Consistency: Each stored entity must be consistent at all times.
  • Compliance: must be compliant with the GDPR, e.g. data sets must be deletable.
  • Costs: Low costs for idle time, ideally pay per use.

Numbers

  • Number of Persons: ~45 Million
  • Number of Data Sets: ~180 Million
  • Average Response Time (Query): <150ms
  • Average Response Time (Mutation/Async): 50ms
  • Average Assembly Time: 500ms


How it works

The whole solution was build using only serverless AWS technologies, mainly Lambda, S3, DynamoDB and Kinesis. All Lambda functions have been implemented using Golang.

The API is provided using GraphQL via AWS API Gateway and Lambda. Adding data into the database works by calling the submit mutation, which in a very simplified way looks like:

{
  submit(
    new: {
      firstName: "John"
      surName: "Smith"
      address: {
        street: "Augustinerstr."
        houseNumber: "1"
        city: "München"
      }
    }
    old: {
      firstName: "John"
      surName: "Smith"
      address: {
        street: "Jungfernstieg"
        houseNumber: "7"
        city: "Hamburg"
      }
    }
  )
}

The example represents an old as well as the new address of John. Other submissions might omit the old data when it is not known. The database will make the links with existing data using configured rules anyway.

The invoked GraphQL Lamda the takes this data and puts it into a Kinesis stream, using it as a data buffer for unexpected high loads, but most of all to ensure that the data will eventually be added to the database. This will be relevant not only when working with unavailable services, but also when it comes to concurrent writes to the same entity.

The next Lambda function, the assembly, is triggered as soon as an item was put into the Kinesis stream. This is where the data gets added into the database and it can roughly be separated into the following steps:

  • matching
  • deduplication
  • updating
  • indexing

Indexing

Any entity resolution must compare different data sets with other potential matches in order to find the right candidates that belong to one entity. Additionally in this solution, it also is relevant when performing the search as well as when identifying possible duplicates. The goal is always to have as few as possible comparisions between data sets. Ideally each resulting block contains only the correct results. However, for fuzzy matching, e.g. when using the Levenshtein distance, this is not possible.

Consider the following two rules, that should be used for matching:

  • R1: matching if: first name, surname and city are lowercase equal, OR
  • R2: matching if: city and street are lowercase equal and the name parts have the same metaphone value and a maximum Levenshtein distance of 1

Note: These are just example rules and not the real rules of Regis24.

Creating an index key for R1 is simple. For the old and the new data of the previous submission the resulting index keys can be calculated as:

  • R1:john:smith:münchen
  • R1:john:smith:hamburg

Resulting in any data having the same index key to be matching without any further comparison.

For R2 the resulting index key cannot contain any hint for the Levenshtein distance as that is based on having another data set to compare with. However, when ignoring the Levenshtein distance for a moment, then the resulting index keys would be:

  • R2:münchen:augustinerstr.:JN:SM0
  • R2:hamburg:jungfernstieg:JN:SM0

Resulting in everything with the same index key to be a potential match.

The indexes are then stored in an AWS DynamoDB table, together with some additional data.

Assuming for the new data an ID aaa and for the old data an ID bbb was generated - the actual implementation uses random UUIDs - then the index table would contain the following values:

index_key data
R1:john:smith:münchen {"aaa"}
R1:john:smith:hamburg {"bbb"}
R2:münchen:augustinerstr.:JN:SM0 {"aaa:john:smith"}
R2:hamburg:jungfernstieg:JN:SM0 {"bbb:john:smith"}

The data column is a string set that contains at least the ID of the corresponding data set(s). For R2 it also contains the first and surname, which will be used when doing the final comparison. This has the advantage of not requiring to load the actual data from somewhere else.

Furthermore, this table structure has the advantage, that data can be added very easily. This is valid in case of a complete new index key as well as for adding data to an already existing index key. Both cases can be handled with the same DynamoDB update query:

  • Key: {"index_key": {S: "R1:john:smith:münchen"}}
  • UpdateExpression: "ADD data :data"
  • ExpressionAttributes: {":data": {SS: ["aaa"]}}

If the item already exists, then aaa will be added to the string set, otherwise a new index entry will be created under the provided key. However, for the purpose of understanding the other steps, it is important to understand how the indexing works.


Matching

Assuming the two previous data sets have already been indexed and now the following data set ccc needs to be added:

{
  submit(
    new: {
      firstName: "John"
      surName: "Smith"
      address: {
        street: "Hofgraben"
        houseNumber: "3a"
        city: "München"
      }
    }
  )
}

When searching for possible matches, the index keys according to R1 and R2 would be:

  • R1:john:smith:münchen
  • R2:münchen:hofgraben:JN:SM0

Searching with this data in the index table, results in exactly one entry: {"aaa"} with currently only one data set aaa. Because the used rule R1 does not require any further checks, the matching is done here already. In that case the new data set needs to be added to the existing entity.

For another data set ddd

{
  submit(
    new: {
      firstName: "Johnn"
      surName: "Smith"
      address: {
        street: "Augustinerstr."
        houseNumber: "11"
        city: "München"
      }
    }
  )
}

the resulting index keys would be

  • R1:johnn:smith:münchen
  • R2:münchen:augustinerstr.:JN:SM0

resulting in a possible match for R2 on {"bbb:john:smith"}. Since everything required for the additional Levenshtein check is now pres ent, no further data needs to be loaded. The Levenshtein distance between John and Johnn is 1 and0 for Smith and Smith, therefore the data set should also be added to the existing entity.

An entity can be represented as a graph comprising nodes (data sets) and edges (matches). There are two types of edges so far: rule based and fact based. Every edge found during the matching is rule based. Everything that has been provided from the client side, is more or less a fact. E.g. the client may have provided the information that aaa and bbb represent a relocation.

This results in the following edges so far:

  • aaa:bbb:RELOCATION
  • aaa:ccc:R1
  • bbb:ddd:R2


Deduplication

Let's look at data set eee

{
  submit(
    new: {
      firstName: "John"
      surName: "Smith"
      address: {
        street: "Hofgraben"
        houseNumber: "3"
        city: "München"
      }
    }
  )
}

which is similar, but not identical (see house number), to ccc - a non-identical duplicate. This is quite typical, e.g. a different order transaction, a different date or anything else, that prevents one from simply ignoring the data set.

When trying to match the data set in the same way, it would result in these new edges:

  • aaa:eee:R1
  • ccc:eee:R1
  • ccc:eee:R2

This may look fine at first, but what if there would be only four non-identical duplicates?

These are already twelve edges, or n*(n-1)/2 for each rule. Consider a frequent online shopper, with only 100 orders, this would be 9.900 edges!

To prevent the exponential growth of the edges, a new rule can be added:

D1: matching if: first name, surname, city and street are lowercase equal

The deduplication rule must cover all the matching relevant fields from R1 and R2. The assumption here is, that all these fields do not change as frequently as e.g. an order number among different data sets.

Finding duplicates using D1 works in the same way as finding matches: generate the index key, search with index key and if needed filter the candidates when fuzzy matching is required.

Data set eee would then be recognized as a duplicate of ccc. It might simply be stored as a single edge in the form ccc:eee:D1 or slightly optimized as in:

{
  "ccc": ["eee"]
}

The important part is, that duplicates do not create edges and will not be indexed, because they are not providing new relevant data for matching or searching.

When removing the original of a duplicate, then any of its duplicates become the new original and needs to replace all indexes and edges with the new id.

For the frequent shopper example provided above, this would reduce the number of edges from 9.900 to 99, resulting in linear growth rather than exponential growth.


Updating

When adding a new data set, there are three potential scenarios, that can happen:

  • no matching data sets were found -> a new entity must be created
  • one or more matching data sets for the same entity were found -> the entity needs to be updated
  • matching data sets from multiple entities were found -> the entities must be merged into one entity

For this use case, each entity is stored in a single AWS S3 object. Each object uses JSON lines (jsonl). The first line contains a header and all other lines contain one data set each.

The content for the example entity looks something like:

{"entityID":"157009fd-65f3-4ec6-ac29-7f9f81cafa72","edges":["aaa:bbb:RELOCATION","aaa:ccc:R1","bbb:ddd:R2"],"duplicates":{"ccc":["eee"]}}
{"id":"aaa","firstName":"John","surName":"Smith","address":{"street":"Augustinerstr.","houseNumber":"1","city":"München"}}
{"id":"bbb","firstName":"John","surName":"Smith","address":{"street":"Jungfernstieg","houseNumber":"7","city":"Hamburg"}}
{"id":"ccc","firstName":"John","surName":"Smith","address":{"street":"Hofgraben","houseNumber":"3a","city":"München"}}
{"id":"ddd","firstName":"Johnn","surName":"Smith","address":{"street":"Augustinerstr.","houseNumber":"11","city":"München"}}
{"id":"eee","firstName":"John","surName":"Smith","address":{"street":"Hofgraben","houseNumber":"3","city":"München"}}

Additionally to this file, two AWS DynamoDB tables are populated.

entity_id s3_location ...
157009fd-65f3-4ec6-ac29-7f9f81cafa72 1570/157009fd-65f3-4ec6-ac29-7f9f81cafa72_10ff295f-8f4a-41d3-b35f-2d556b7cb2d1 ...

This table holds the path to the S3 location and other meta data.

The data set table:

data_set_id entity_id ...
aaa 157009fd-65f3-4ec6-ac29-7f9f81cafa72 ...
bbb 157009fd-65f3-4ec6-ac29-7f9f81cafa72 ...
ccc 157009fd-65f3-4ec6-ac29-7f9f81cafa72 ...
ddd 157009fd-65f3-4ec6-ac29-7f9f81cafa72 ...
eee 157009fd-65f3-4ec6-ac29-7f9f81cafa72 ...

When adding a new data set for a new entity, then the S3 object will be created and the entity as well as the data set table populated.

When adding a new data set for an existing entity, then a new S3 object will be created (adding header, copying existing data sets and appending the new data set), the new entry in the data set table will be created, the location in the entity table will be updated and the old S3 object will be removed.

When adding a new data set that merges two entities, then surviving entity wil be selected, a new S3 object will be created (adding header, copying existing data sets from both old objects and appending the new data), the new entry in the data set table will be created (pointing to the survivor), the location for the surviving entity will be updated and both old S3 objects will be removed.


Searching

Searching works similar to finding matches or duplicates:

  • create index keys for the provided search
  • find possible candidates and filter them
  • find the entries in the data set table for these candidates
  • find the S3 location(s) from the entity table
  • load the S3 object(s)
  • perform any kind of data selection or aggregation on the entity data sets

Its worth mentioning the advatange of this approach: the linear time complexity, which is already better than some of the alternatives, is in reality very close to constant response times. Since the recognized duplicates have not been indexed, they don't show up as possible candidates. The database lookups and S3 get object calls for the few remaining candidates can easily be parallelized - something that can easily be done because it is implemented in golang. And also processing, e.g. aggregating certain fields, from the entities data sets can already be started as soon as the second line from the S3 object was downloaded. However, the last part, for obvious reasons cannot be less than O(n).


Further Notes

Some parts were not yet covered, but they are important for the overall process.

Parallelization and Locking

Data may be added in parallel. Each entity must therefore be locked to prevent conflicting writes on the same entity (entity table or S3 object). This is done using an UpdateItem request instead of an GetItem request when loading the data from the entity table:

  • ConditionExpression: "attribute_not_exists(#lockConstraint) OR #lockConstraint=:lockConstraint"
  • UpdateExpression: "SET #lockConstraint=:lockConstraint"
  • ReturnValues: "ALL_NEW"

However, if multiple entities are involved in an update process, it must be ensured, that all locks have been sucessfully acquired. If that is not the case, then all locks must be released again to prevent dead-locks.

Failure Handling and Execution Plan

Any kind of error, including failing to acquire locks, should simply result in letting the lambda fail (after a reasonable amount/time of retries). Once the Lambda invocation failed, the Lambda service will, due to using Kinesis, automatically retry to invoke it with the same data.

In order to prevent inconsistent entity states, before any data modification takes place, a full execution plan must be created. Each step of that plan represents a small idempotent action, e.g. an update in the DynamoDB. That plan is also stored in an S3 bucket before executing it. If the process fails, then on the next Lambda invocation the plan will be loaded and replayed right from the beginning. Because each step is idempotent and the involved entities are locked, it is safe to do so.

To ensure a consistent state during execution of the plan, the order of the steps is important. E.g. updating the S3 location must happen after the S3 object was written and adding an index should be one of the very last steps.

Matching and Search Rules

To keep things (relatively) simple in this description only two rules were used for both matching and searching. However, things start to get interesting when the rules for matching and the rules for searching are different.

Take your time and think about the consequences, when for example defining a rule S1 which is similar to R1 but without including the first name. And then use S1 exclusively in the search.



Alternatives

There are quite some alternatives for the provided solution. But none which performs really well.

Offline Entity Matching

The traditional approach to entity matching would be to precalculate all matches up-front. For small amounts of data sets, this works quite well. Larger amounts of data sets require some kind of blocking approach similar to the used index key. However, even the slightest change would require a complete recalculation of all matches.

Relational Database

Using a recursive search in a relational database would also be possible. But there is no optimization for non-identitcal duplicates and jumping from one data set to the next to the next to the next ... takes its time.

Graph Database

Graph databases are great - but not for this use case. They are great at answering questions like "who are the friends of my friends" or "how is A connected with B". But they lack at the question "what are all nodes and edges that belong to a certain sub-graph". The reason is the same as with relational databases: they need to jump from node to node. Beside, the handling for the non-identitcal duplicates must happen outside of a graph database.


Copyright 2021, Tilo