Data Ambiguity in Names

Terry Brooks, lecturing in INFO498, made an excellent point in his brief discussion towards the end of class today: the standardization of names is something that still troubles those that are pursuing the representation of people on the Semantic Web.  This is a larger data problem, though – how do you represent names when there are so many different methods of referring to a person?

For instance, the Library of Congress uses the format of “<Last Name>, <First Name> <Middle Name>”, but might also end up using “<Last Name>, <First Name> <Middle Initial>.”.  Friend of a Friend, one of the XML schemas used to represent information about people, might use “<First Name> <Last Name>” or “<First Name> <Middle Name> <Last Name>” or “<First Name> <Middle Initial> <Last Name>”.  How about academic citation formats?  APA, used by the social sciences, lists as “<Last Name>, <First Initial>.”.

So what does it take to be comprehensive?  Let’s use the name of the main test dummy on one of my favorite shows, Mythbusters, to demonstrate.  He is only known as “Buster”, but we’ll expand his name to “Buster Dee Myth”.

  • First Name: Buster
  • First Initial: B.
  • Middle Name: Dee
  • Middle Initial: D.
  • Last Name (Surname): Myth
  • Last Initial: M.

Ah, but wait – what about titles (Doctor, Sir) or numerations (the Third)?  Expand his name to “Sir Buster Dee Myth II”:

  • Title: Sir
  • Numeration: II (see the problems here?)

And this is just names.  What if there are multiple people named “Sir Buster Dee Myth II” (hopefully not)?

  • Birth date (see the problems here?)
  • Death date (see the problems here?)

Pushing the envelope still farther: what if there are are multiple people named “Sir Buster Dee Myth II” both born on the same day and died on the same day?  Okay, this seems a bit unlikely (unless they’re clones).  We’ll stop with the bulleted list there.  Is this a comprehensive representation of a person?  Does it represent everything we might need to know to identify a person as a single unique entity?  No.  What’s missing?

  • Birth place (see the problems here?)
  • Current location (see the problems here?  Is this category really necessary to uniquely isolate a single person?  No.)

We set out to standardize names, but we run into other standardization problems: how do we represent numeration (the Third, III)?  Or dates (MM-DD-YY, MM-DD-YYYY, DD-MM-YYYY)?  Or locations (latitude/longitude, country, state, city, zip code)?

The point: standardizing names is not easy, because it requires more than simply the standardization of the name itself.  This doesn’t even consider the relationships between names and, say, works referring to that particular person, or works authored by that particular person, or jobs performed by that particular person — and the list goes on.  The idea of semantic data is to describe relationships and context (this is a bit of an oversimplification); each element must be carefully crafted in order for this to happen.

Career Goals

Even though this is posted on my internal wiki, I figured I’d post it here for posterity.

This document outlines my personal career goals as they currently stand, as well as related academic goals that inform these goals.

General Goals

  1. Apply my personal mantra, “everything is interconnected”, to information management and sustainability and understand how these fields infiltrate and influence everyday decisions.
  2. Work in a collaborative rather than an isolated environment.
  3. When possible, incite change. When impossible, make possible.

Academic Goals

  1. Serve as teaching assistant for an undergraduate course.
  2. Assist in the learning process of my fellow students; learning is not competition.

Topic-Specific Goals: Information Management

  1. Understand how information is ethically and professionally handled and embody these standards in my own work.
  2. Understand the paradigms behind information organization.
  3. Actively consider issues of information fragmentation, information overload, and information sustainability.
  4. Place human use of information first.
  5. Promote information accessibility.
  6. Participate in relevant national professional associations.

Topic-Specific Goals: Environmental Sustainability

  1. Significantly contribute to thinking and dialog about environmental sustainability and environmental policy.
  2. Understand the relationship between information and sustainable action.
  3. Promote corporate and public environmental stewardship.
  4. Recognize that sustainability is not achieved in a void. Promote cross-political and interdisciplinary sustainable initiatives.

    “I never saw a Democratic mountain or a Republican glacier.” – Daniel J. Evans

  5. Influence organizational thinking and action around sustainable ideals.

A Shift in Philosophy

People may or may not be aware that my work at Evergreen made one thing abundantly obvious: everything is interconnected. I’ve been living by this mantra for quite some time (indeed, since somewhere around my freshman year at Evergreen), but lately, I’ve come to realize that, while it’s certainly sufficient to recognize this, there’s an extra layer to this idea that I hadn’t quite recognized. There are two ways that I can state this, and I haven’t quite decided which one I prefer yet, since they are two distinct expressions of the same set of ideas:

Everything is interconnected, given a particular context.

Or:

Everything is interconnected; context is king.

The word “context” is something that is repeated almost ad nauseam in a lot of the work that I’ve done so far in the MSIM program. A lot of user interaction design work depends on the context in which a solution will be used. How things are categorized depends on the context of that information in relation to other facets. The context in which a question is asked can affect the results of that question. Management styles differ depending upon how managers choose to contextualize different information in their environments.

There is one major thing missing at this point as well that I’ve actually chosen not to attempt to integrate: the centrality of the user (or, less technically, of people) in information management. The reason for this is that it’s already recognized in my personal statement of my career goals (which has not been posted to this blog – it exists on my personal wiki).

So what’s the difference between these two potential statements? “given a particular context” implies restrictions or limitations on what connections can be formed, and suggests to me that those limitations may not be surmountable. On the other hand, “context is king” recognizes the original spirit of the mantra of “everything is interconnected” – that everything, somehow, connects to something else, context or not. It also recognizes that context plays a central role in our accumulation of knowledge and information.

Which one I end up choosing will depend heavily on which of these interpretations I feel is more central to my work.

Information Management According to ERIC

As part of a class assignment for IMT 530, I’ve had to use some of the subject indexing resources at Suzallo Library on campus – one of them is the Thesaurus of ERIC Descriptors.  While I was doing my indexing work, I ran across the following definition of information management:

Management of the acquisition, organization, storage, retrieval, and dissemination of information–can combine such traditional organizational functions as data processing, telecommunications, records control, and user services.

Now if the iSchool could make it that clear :)

Judging the Complexity of Processes

I’ve been downloading and installing various pieces of software lately, particularly in connection with my work for INFO 498.  One of those pieces of software, oXygen XML Editor, requires a 30-day trial key, and getting that key requires a form to be completed.  I want to demonstrate what’s horribly wrong with the image below:

videodemo

The problem?  The “watch this video demonstration” link.  Why is this a problem?

  1. If your registration process is so difficult as to require a video, your registration process needs to be rethought badly.
  2. Why the video is required is not obvious; the registration process on the web page itself is self-explanatory, since it’s just completing the form and pressing “Submit”.

But wait – what does the video cover?  If you watch it, it walks you through two different forms of registration – via the program itself and via the program’s web site.  Both registration forms to request the e-mail are very, very easy.  The hard part (well – more like “the confusing part”) of the registration actually comes after you complete the e-mail; there are nine lines of e-mailed text to copy and paste into a licensing dialog.  Why nine lines?  Why not one?

  • They failed to consider their audience.  This is an XML editor geared towards developers.  If developers don’t know how to complete a registration form and copy/paste into a text box, we’re all in serious, serious trouble.
  • They failed to simplify their information entry process.  Why the hell do we need nine lines of licensing information to paste into the program?
  • The video restates the same facts twice.  It presents registering via the program and then entering the registration information into the dialog box provided with the application, then presents it via the web site and doing exactly the same process with the application.  There’s no difference between the two methods other than the point of initiation of the request for the trial license.

What did they do right?

  • At least with the online form, they marked required fields.
  • They left the default value of the “Please send me news about upgrades, discounts and special promotions” checkbox unchecked.
  • They only asked for what they needed.  They didn’t request your mailing address, birthdate, or any of the other extraneous information that can make people suspicious of a company’s true intent with the information you provide when registering.

Research Conversation: Personal Information for A World As We Want It to Be

William Jones, one of the professors at the iSchool, gave a really interesting talk about the idea of personal information management and how to improve our ability to find the information we need. Jones is one of the lead researchers for the Keeping Found Things Found project, which is a project that I’ve had some interest in since I discovered it through my research on the iSchool itself.

Some notes from the presentation:

  • Why do we have folders?
    • From the audience: to organize data.
      • Why do we organize data?
        • To find/locate information.
    • As a quick reference into the materials we need.
    • As content metadata
  • Search on our own machines gives us the ability to get stuff the same way as on the Web, so why would there be resistance to this?
  • Audience member observation: There’s a difference between finding things and finding new stuff
  • Folders are a part of our interaction with data
  • Why do people use folders in so many diverse ways?
  • The Web is becoming an extension of ourselves (and of our personal information)
  • Capturing information is now very easy
  • Storage is now very cheap
  • Search makes retrieval of information easy (if it is properly indexes and if there’s some form of version control – search does no good if we’re looking for old versions of things we already have)
  • Information fragmentation – the idea that our information is now incredibly spread out – is a more recent problem than that of information overload, which has existed, one could argue, for centuries
  • Keeping Found Things Found project did three major studies:
    • How people keep information
    • How people re-locate information they have
    • How people organize their information
  • There is a lot of diversity in the way that people organize their information – why is this?
  • An audience member gave an example of using e-mail instead of favorites or bookmarks to manage their web site. When asked why, they explained that they didn’t want their favorites list to get too long or unmanageable.
  • What about the recall of information? KFTF participants were given a list of information they had accessed 3+ months ago and asked to relocate it quickly using whatever method they wanted. They were only given five minutes for the task. After that five minutes, it was found that there was a 95% successs rate in finding that information based on a list of particular conditions (what those conditions were wasn’t discussed in the talk). However, there were some issues with people trying to remember where that information was stored. It was also noted that “Do nothing” methods – where people had made no prior note as to where the information was located (methods like Google searching) won out over bookmarks and most other methods of information search and retrieval.
  • Fourteen participants were asked to give a tour of their folder/information organization on their computers. For every single participant, there was something where they said “this shouldn’t be here”, and a small number even had to stop the demonstration to move the information to the correct location.
  • An idea Jones suggested was that old information should slowly fade from view – it doesn’t get deleted, it just isn’t visible.
  • It’s easier to pay the small cost of not being able to find things immediately than to pay the larger cost of having to reorganize or clean out our information resources.
  • An audience member noted that economics can play a big role in how information is organized, especially in a work environment – if we get paid to do things quickly, our information organizational structure better make things easy to find!

Question of the Day

A question that sparked from one of my IMT510 readings (Fisher, Theories of Information Behavior, ASIST Monograph Series, chapter 30):

Research is also needed on how information needs are expressed and recognized as information grounds . . . and how they can be used to facilitate information flow, including how employers can alleviate the stressors of unemployment by helping laid-off employees establish or identify replacement information grounds that can facilitate the availability of information required during times of transition (p188-9).

The question: can companies become more competitive or successful by supporting employees even when they aren’t employees of that company any longer?

Simple as Pen and Paper

I sat down for a job interview back in June 2006 with Robbie Cape, CEO of the then-unlaunched Cozi, housed in the Smith Tower in downtown Seattle.  The interview was for a software development position, and thus I was grilled by a couple of members of the Cozi team on writing software code (I don’t recall doing particularly well on this).  What I remember most, though, was talking with Robbie, who described his product thusly: putting down the pen he was taking notes with, he fluttered the top page of his notebook a bit and said that he wanted his product to be as transparent as pen and paper.  Lofty goals, to be sure, but for some reason, that very image has stuck with me, and it’s haunted me quite a bit lately.  Part of the reason for this are the titles of IMT 510 and IMT 540 this year: Human Aspects of Information Systems and Design Methods for Interaction and Systems, respectively.

As I’ve read a lot of the readings that have been assigned, particularly for 540, the idea of user-centered design – that the software should be written to suit the user’s purposes, rather than the user adapting to the software’s purposes – has been at the forefront.  There are various different approaches to this, of course, but the central idea is that users should not be forced to accept whatever decisions the developers have made for them without any input into the process.  Ease of use, it is said, cannot be achieved without involving people who are somehow affected by the software – to coin phrases from Value Sensitive Design and Hosmer, the direct and indirect stakeholders.   This makes me think quite a bit about the pen and paper metaphor.  The fact is that pen and paper is only easy to use because we, as a society, make it so; for the longest time, it was quill and paper.  The next advancement in technology could very well make it stylus and “ePaper”, some sort of electronic device that is as thin as paper but that remembers everything we write on it by storing it within a very large internal memory.  But I digress – the point of design is to ensure transparency.

Can the simplicity of pen and paper ever truly be matched by a computer program or an information system?  An open question, since many are attempting to do this.  In reality, it likely is only what it is – a metaphor.  But what if it were doable?  What kind of world would we have then?