What happens when you do this kind of probabilistic linkage,
you use a set of available attributes for linking.
That could be personal information, names, addresses, date of birth, and so on.
And then you calculate match weights for attributes.
So this might be a little bit confusing, because I just said earlier on the exact
linkage, you could create a unique key out of that string.
That's right, that way if I actually make a string out of the entire
set of variables, which eventually then is unique.
Here, the notion is more that you could link on names, on addresses,
on birthdates, and each of them should have a different weight.
Because, let's say if you have the correct name, right,
and so, you assume every one of these variables could have an error.
So my name could be typed wrong, my birthdate could be wrong,
the birthplace could be wrong, the address could be wrong.
And so you wouldn't want all of them to go equally strongly into the probabilistic
record linkage algorithm, because you would assume, well,
name change, less likely to happen, right?
Whereas address changes maybe more so, right?
So if I have the same name and birthdate but
two different addresses, then maybe the address shouldn't have as much weight.
Or depending on what it is, maybe you have different decisions going on here, right?
Or you think okay, well typos in name happen quite frequently,
but in the database I have, addresses are always validated and so
therefore the address is good.
Then you might have a different rationale here, right?
So you gotta think ahead of time what that might be,
what these weights are for the particular variables.
And then eventually you sum over all of them and
you create a score in which you merge.
Not going to go into the statistical details here.
That's not the right course for this.
But it gives you a sense of what's behind this technique.
And then likewise, on the predictive linkage approaches,
most of them are sort of driven by machine learning techniques.
There are a lot of papers by Bill Winkler from the US Census Bureau,
that you can retrieve from the US Census Bureau website,
on using these techniques for various data linkage endeavors.
And as always in machine learning, you can use both techniques,
supervised learning and unsupervised learning.
Supervised learning basically being like regression techniques.
You predict something using training data so you have a subset of your data set,
where you already had a link and you know, okay, these really are true matches.
And then you try to predict, learn from these training data,
what are good predictive variables for these matches and use those.