Data Ambiguity in Names

Terry Brooks, lecturing in INFO498, made an excellent point in his brief discussion towards the end of class today: the standardization of names is something that still troubles those that are pursuing the representation of people on the Semantic Web.  This is a larger data problem, though – how do you represent names when there are so many different methods of referring to a person?

For instance, the Library of Congress uses the format of “<Last Name>, <First Name> <Middle Name>”, but might also end up using “<Last Name>, <First Name> <Middle Initial>.”.  Friend of a Friend, one of the XML schemas used to represent information about people, might use “<First Name> <Last Name>” or “<First Name> <Middle Name> <Last Name>” or “<First Name> <Middle Initial> <Last Name>”.  How about academic citation formats?  APA, used by the social sciences, lists as “<Last Name>, <First Initial>.”.

So what does it take to be comprehensive?  Let’s use the name of the main test dummy on one of my favorite shows, Mythbusters, to demonstrate.  He is only known as “Buster”, but we’ll expand his name to “Buster Dee Myth”.

  • First Name: Buster
  • First Initial: B.
  • Middle Name: Dee
  • Middle Initial: D.
  • Last Name (Surname): Myth
  • Last Initial: M.

Ah, but wait – what about titles (Doctor, Sir) or numerations (the Third)?  Expand his name to “Sir Buster Dee Myth II”:

  • Title: Sir
  • Numeration: II (see the problems here?)

And this is just names.  What if there are multiple people named “Sir Buster Dee Myth II” (hopefully not)?

  • Birth date (see the problems here?)
  • Death date (see the problems here?)

Pushing the envelope still farther: what if there are are multiple people named “Sir Buster Dee Myth II” both born on the same day and died on the same day?  Okay, this seems a bit unlikely (unless they’re clones).  We’ll stop with the bulleted list there.  Is this a comprehensive representation of a person?  Does it represent everything we might need to know to identify a person as a single unique entity?  No.  What’s missing?

  • Birth place (see the problems here?)
  • Current location (see the problems here?  Is this category really necessary to uniquely isolate a single person?  No.)

We set out to standardize names, but we run into other standardization problems: how do we represent numeration (the Third, III)?  Or dates (MM-DD-YY, MM-DD-YYYY, DD-MM-YYYY)?  Or locations (latitude/longitude, country, state, city, zip code)?

The point: standardizing names is not easy, because it requires more than simply the standardization of the name itself.  This doesn’t even consider the relationships between names and, say, works referring to that particular person, or works authored by that particular person, or jobs performed by that particular person — and the list goes on.  The idea of semantic data is to describe relationships and context (this is a bit of an oversimplification); each element must be carefully crafted in order for this to happen.

4 comments on “Data Ambiguity in Names

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>