Classifying is hard, tagging is worse

My work for Mondeca has been for years to help building classification schemes, ontologies and the like for a variety of customers. Most of the time this means formalization of implicit ontologies they already have in their data. And I don't have either to make any decision about actually populating the schemes, this task is left to human editors or automatic text mining engines. I sometimes take care of automatic migration of legacy content, but following rules decided with the customer. I'm very happy with all that, because I'm not good at classification. I tend to see so many subjects in anything, any interesting resource to classify seems so multi-dimensional that choosing a category always brings me to the fringe of undecision, and any decision I eventually make about it seems always arbitrary. Comes maybe from an ancient traumatic experience as Open Directory editor.

Sounds familiar? I already hear the folksonomy people crying : "Hey, of course, that's why tagging is so cool". As far as I am concerned, tagging is worse, it means more arbitrary decisions, because not only do I have to choose a category, I can choose more that one, or none at all, and I have to figure them myself. Way too many decisions ... That's why my browser bookmarks and email folders are a mess, why I have no del.icio.us account, why my Technorati profile is so low, etc ...

Beyond my own decision difficulties, there is something to be added as this now long discussion obout ontologies vs tagging. What I've learnt in science is that a good theory is a falsifiable one. What you assert using an ontology, whatever language or framework with declared formal semantics, is falsifiable. No formal semantics, no notion of true and false, hence no falsifiability. In other words, and to make it simple, an RDF assertion can be declared or inferred true or false vs a given ontology, a OWL class can be proven unsatisfiable etc. Nothing of the like with tags. Assignation of a tag cannot be proven true or false, or inconsistent. Tags are not falsifiable.
By the way, the same distinction is to be made for RDF vs Topic Maps. Topic Maps are not falsifiable, because they have no formal semantics. Now the question is to know is falsifiability, which has been proven to be critical in science, is also critical in information technologies.

That said, since the new Blogger version enables easy tagging (maybe the older version did also, but was never aware of it), and since there is now quite a bunch of posts on univers immedia, I decided to be brave and start tagging them, as thoughtlessly as possible. Starting by the more recent ones, I then shifted to the most ancient, a good occasion to revisit them if nothing else. The result you see on the left under "What". First impression is of course there are too many of them, but I will try to keep up that way throughout the blog just to see how it flies, then maybe keep only the most frequent ones if I end up with a too long list.


OWL ontology for identity on the web

This paper by Valentina Presutti and Aldo Gangemi at SWAP2006 begins with a clear introduction to the issue of resource identity, and the ambiguity of the term resource itself. Then it goes on with a very smart OWL model attempting to articulate the various aspects of this concept. Maybe too smart and conceptual to become really popular, but interestingly enough, it goes against the popular Semantic Web assumption that URI can actually identify "non-addressable things", and is rather in the line of letting the referent entities outside of the identification framework.

The definitions of resource that can be found in literature show ambiguity, making the issue of handling the identification of a web resource very problematic.

Our approach restricts the nature of the web resource to that of a computational object. This choice is motivated by the fact that a resource is something that has to be addressable, and things like cars and people are not addressable for their nature. Hence, it is wrong in principle to use the same mechanism of addressing for entities that have such different sorts.

Migration to new Blogger version

I took the opportunity of this migration to change the layout template. The new one seems more readable, and anyway I was fed up with all those dots. I like this little architecture piece on the top left corner. Actually I think it's a lighthouse, since the template is called "Harbor", but it also looks to me like an observatory platform opening on the sky, "useful by its opening and central emptiness" ... well in the spirit of univers immedia.

The list of contributors does not show anymore, they have to do something about their Blogger account to be able to post again.

A couple of things I've been about lately

I've been silent here for over two months now, my blogging time devoted to the Mondeca blog in French Leçons de Choses. But there is a couple of things I've been working on, worth mentioning.

I've exchanged with Michel Biezunski on his Data Projection Model , and found out that its genericity and simplicity made it easy and straightforward to express the structure of Mondeca ITM, without the borderline hacking needed when using either OWL-RDF or XTM for the same task. Now open questions: What will happen with that model? Who will see the benefits over languages already in this space, and singularly over RDF? Who will build tools supporting it?

Been wondering if a semiotic approach could shed some light on our thoughts on referents, and came out with a RDF semiotic triangle. The URI is the signifier, the RDF description is the formalisation of the signified concept associated with the URI. The referent is out of the language and signs realm, and should stay there. In this approach, attempting to achieve a representation of the referent, even using tricks as blank resources or hubjects of any kind, is therefore a recursive trap and actually a non-sense. So any declaration of same-ness or identity of referents should be avoided. Only concepts bear identity, not their referents. From that point on, came to the idea that linking different concepts/signs (URI + RDF description) which humans consider to have more or less similar referent will take the form of processing rules, more than declarative semantics.

Thanks to Jakob Voss for this post in a long thread on public-esw-thes list, which really triggered a kind of illumination about this. As an example, trying to say that my SKOS concept a:Restaurant has the same referent as your OWL class b:Restaurant through any RDF declarative relation between those two resources shoud be avoided. But I can set in my system a functional rule expressing that any document of which subject is an instance of your b:Restaurant class will be indexed against my a:Restaurant concept. The referent is represented nowhere, but it is acting at the core of this rule.

Actually we have this very indexing rule mechanism working in some Mondeca applications, and I have submitted a paper to XTech 2007 about it. More to come if ever the paper is selected.

Lately, got interested again in triggering some process to have languages available not only as tags to use in XML, but as proper RDF resources. This is an old story tracking back to OASIS Published Subjects Technical Committees, and singularly PSI for languages. Track this topic on ESW Wiki, and see here for ongoing thread and more explanations. There again, my proposal is to forget absolute identification of a language by a URI. Concepts identified by URI are the properties and property values than can be declared for a language, and let applications decide on which properties are useful to them. No absolute rule saying that two descriptions refer to the same language.


Leçons de Choses

My blog activity lately has been dedicated to launch the new Mondeca Blog - yes it's in French. Sorry about that.


Geonames enters the Semantic Web

I'm quite happy of this. The collaboration with Marc at geonames started as a simple deal : provide the ontology, said he, and I'll provide the Web Service. So did I, and so did he. This is a fantastic Web 2.0 place to play in. It already had Web Services and wiki using Google Maps, and now he has the SW interface ...


Back to Earth

I've been quite silent here lately. Put more focus on actually working stuff than theoretical musing, and singularly on various geo-unameit technologies such as geocoding, geotagging, geo microformat, and the collaborative databases using them like geonames, wikimapia ...
Places are the most fascinating use cases when it comes to identification and description, and all those technologies and their ability to interface with each other tend to prove that information about some place can be aggregated without any consensus or explicit declaration of what this place (or a place, in general) actually is, but by providing hubs between data, and interfaces putting together data somehow relevant to this place (such as Google Earth, Google Maps API and the like). Geonames web services provide a good example of this.


In defence of 404

Following the announce on the Semantic Web list of the publication of RDF geographical codes by french INSEE, a lively debate with Dan Connolly about access vs description.


More thoughts on that Blue Glass

John Black commented on my previous post, providing links to interesting pieces of thoughts. Anatomy of a Reference and Ambiguity and Identity are certainly in the line of what we try to entangle here.
Let's put those pieces together. John's blue glass, physically tagged with its URI is a clear example of the difference between access and naming as explained by Pat Hayes. What is accessed using this http URI is an image of the blue glass, but the URI is intended to name the glass itself. Tagging physically the blue glass with a URI sticker or a bar code is something very close to what Pat calls ostention, but it's not the silver bullet to kill ambiguity.
As John clearly points, when you stick a tag somewhere in the physical world, you still need interpretation to know which part of the world is tagged this way, since the world is not "naturally" divided, things and their limits are always the result of some mental operation of division. So a common interpretation of the tag needs a common understanding of the limits of things, and having a common understanding of classes of things, and which kind of tags you put on which class of things is really useful. This is something well known in geographical maps. On ill-designed maps, names are not printed properly, and it's often unclear what (town, river, mountain ...) is named. In well-designed maps, there are a lot of implicit or explicit disambiguation tricks, such as fonts, colors, position of the name vs the named feature etc. One has to know all that to make sense of the map (actually many people are not able to make sense of a map properly). When you come to the ground, you have similar physical tags indicating towns, rivers, streets and house numbers, and you also need similar knowledge to construe those tags in context : this kind of sticker is for a town, that kind is for a street, and so on. And even so, what is the limit of a town, a street, a valley or a mountain remains basically undecidable.

[2016-06-20] John Black's Kashori archives have moved. The reference articles are now here:


Ambiguity, Ostention and Description

Pat Hayes' In Defence of Ambiguity, presented at IRW 2006, has been on my desktop for weeks, and I take it as the most challenging food for thought currently available about the Semantic Web. If you have not yet read it, you should now.
There is only one point on which I would argue. Pat holds that reference can be made by ostention (gesticulation showing what you are about) or description. All Pat writes thereafter about description being inherently ambiguous, I strongly agree with : disambiguation being a contextual process, the more precise the description, the more ambiguity you get, and so on.
But I would hold that ostention is as ambiguous as description, so that reference is ambiguous in nature whatever the way it's done.
Suppose I am holding a book and ask you : "Have you read this?". The reference to "this" is by ostention, since I seem to hold and show "this". But the "ostentatum" indicated by "this" is actually some copy of some edition of some book. Does "this" refer to this specific copy, which happens to be my own personal copy (maybe annotated in some way), or is the referent the particular edition of which this specific copy is a sample, or is it the abstract entity, the book independent of any physical support, of which what I am currently holding happens to be some physical avatar? Every one of those interpretations is meaningful, and only the context of the conversation might disambiguate. So even with ostention, there is ambiguity left.


Wikipedia's semantic cow paths

I've been quite silent here on this blog for two months, and meanwhile resumed a bit of my Wikipedian activity lately. Although I'd been an enthusiastic early adopter of Wikipedia back in 2001, I have a poor and episodic editing history so far. But every time I've been coming back to the editor dashboard after months or years of inactivity, I've been amazed by the tremendous qualitative growth of the toolkit made available to users. Wikipedia's growth has been stressed again and again in terms of quantity and quality of articles, languages, editors, popularity as a reference resource etc. But was has not been stressed enough is the parallel growth in terms of features supporting better search and editing. And many of those features are in fact adding a quality of information which makes it ready for semantic parsing, and easy RDF re-writing. The basis of it all is a sound use of URIs, names and namespace.
For example http://en.wikipedia.org/wiki/Volcano defines without ambiguity the unique page dedicated to this geological feature, while http://en.wikipedia.org/wiki/Volcano_(disambiguation) is a hub to potential homonyms such as http://en.wikipedia.org/wiki/Volcano_(film). Links are provided to similar resources in other languages such as http://fr.wikipedia.org/wiki/Volcan. One can reasonably use any of those URIs to identify the concept "Volcano". But what about a class Volcano? Here come Wikipedia categories. http://en.wikipedia.org/wiki/Category:Active_volcanoes is ready-made for a class, and parsing the page will give you easily the list of instances, with a link to the full description page. This description itself is formatted using templates such as "infoboxes", so that a page in a category "Active Volcanoes" will yield standard properties in a standard format. Easy to turn this into a data base, and if one interprets the infobox elements as so many properties, turn it into a RDF description, and the infobox structure itself in a RDFS or OWL description of the matching class.
What should we learn from that? That from a collaborative and mostly non-directed process, are emerging cow paths which look more and more like semantic markup, ready to be spidered by smart parsers and tools, either to improve Wikipedia content itself (there are already a bunch of bots and agents doing that), or to extract of this amazing knowledge base any kind of structured data in whatever format, including implicit ontology, like structure of categories, attributes used, and the like. And my hunch is that this process will be quickly much more effective for Semantic Web building than many costly academic ontologies nobody will ever use.

[2010-04-08] : What is described here is exactly what DBpedia started in 2007


More use cases for nondescript resources

nondescript is a funny word. Quite sure I did not catch its correct meaning when I first heard it. It has no real equivalent in French. Indescriptible is too strong, it refers to something too extraordinary to be described, whereas nondescript is actually too ordinary to deserve a proper description. Or is it that any description would miss the point? The free online dictionary says : Lacking distinctive qualities; having no individual character or form. Maybe in fact nondescripts remain such because those features which makes their individuality are too subtle to be captured, or this capture would be useless.
Anyway, let's take them as they are : no description. What does a nondescript look like in RDF? Well, easy, it's a resource with no description. No type, no identity, no properties, certainly not even worth a URI. A blank node with no property whatsoever. Rings a bell now? After the previous post, I've come out with more use cases for those nondescript resources, and how natural language uses them as binding points. Examples

John walks through the neighborhood where Ann is working.

:John :walksThrough _:n
:Ann :worksIn _:n

In The Rule of Four (currently reading) Vincent and Richard have different theses about Hypnerotomachia Poliphili. Paul prefers Richard's thesis.

:Vincent :hasThesis _:tV
:Richard :hasThesis _:tR
:Paul :prefers _:tR

Now the "Class or Concept" example comes naturally as just another case

a:Restaurant a owl:Class
b:Restaurant a skos:Concept

a:Restaurant :represents _:x
b:Restaurant :represents _:x

Cloud hidden, whereabouts unknown. (Similar stuff in the early '70s)


Identifying things - blank nodes again

Still trying to figure if the hubject seeds are likely to grow somewhere, so I dropped one today on Danny Ayers' blog, as a comment on a post itself commenting on another one from Jon Udell where is introduced the cool notion of collaborative aliasing. "What a concept!" will say Jack. And actually, it's no more no less the idea that hubjects can be created out of aggregation of resources being "about the same thing". An aliasing service, which really looks like tagging.
So my suggestion again is here to use blank tagging, that is, allow users, in a simple way, to make all those resources point to the same blank node.
Now something is slowly coming from the back of my mind. I thought for a while we needed a specific and mysterious vocabulary to do that, hubjects and the like. Since this kind of stuff is far from being on the track of adoption, maybe using more popular and less exotic vocabulary, such as dc:subject or something similar would make the whole thing more understandable. Seems there is no formal opposition to declare things like:

http://www.amazon.com/gp/product/B00006RCLH dc:subject _:b
http://labs.oclc.org/xisbn/068981836X dc:subject _:b
http://en.wikipedia.org/wiki/Call_of_the_Wild dc:subject _:b

And actually, any other property could be used as well, such as the following, to take the example from Jon Udell's post
http://upcoming.org/venue/3669/ a:venue _:x
http://eventful.com/venues/V0-001-000150985-3 a:venue _:x

This kind of declaration keeps completely agnostic on what a venue in general, and this particular one actually is. It simply says that the two resources are about the same one.


Identity vs Meaning

Under this really boring title "The Semantics Are Important", Seth Ladd in his Semergence blog is making a really good point :
The identity is singular. The meaning is relative.
In other words, identity and identifiers can be shared, global, universal, whereas semantics/meaning, such as expressed in a particular RDF graph, is local, relative, context-bound, perspective-defined. And therefore multiple, orthogonal, non-compatible, globally inconsistent.
I like it more and more. This goes along the same lines as Pat Hayes' recent post, and puts again the question of how to deal with context. I'm not sure now, munching over Pat's arguments, that the context always needs to be explicited. I've been working those days on SKOS used to express simplified view of hierarchies (of any kind) in an OWL ontology. In some OWL ontology, one would find
a:SomeRegion a:partOf a:SomeCountry
In a simplified SKOS view
a:SomeRegion skos:broader a:Some Country
Reasoners and RDF stores are happy with the OWL version, search engines, taxonomy managers and the like are happy with the SKOS version. So one perspective by kind of tools/applications. What would not make sense would be to merge them, and entail that a:SomeRegion is at the same time an instance of a:Region and skos:Concept. It is not at the same time, it is one in some application context, and the other one in another context. No problem with that. So what is a:SomeRegion in essence, to use this arrogant word I saw passing in the previous post? Well, it's neti, neti, neither this, nor that. No big deal. Who cares?
Set this question about a week ago on the SKOS forum. An astounding silence has been the answer so far ...
[2013-08-01] : Still waiting for an answer ... more than 7 years after the issue is still open.


Identity -- some philosophical musings

Here are some interesting words taken from the link. Follow the link for more. Following, I'll sketch what I am thinking. This is a bit disjoint, but, I think, necessary. It seems that our topic maps are becoming sophisticated enough that we are now able to push the boundaries of subject identities that motivated Bernard to start this blog in the first place. Maybe someone else, or something else (hubjects?) will help resolve some things dealing with subject identity. Possibly longish.

Crystals appear (on the scene of Reality) -- just like organisms -- always as individuals. Such an individual has a definite Identity that remains constant during its existence. It is, say, A, it is not B, not C, etc. A developing crystal of Salt (growing in a solution) can change its shape while its Identity remains the same. For organisms this applies even stronger. We ourselves (being an organism) seem to have direct experience of our Identity staying the same during all of our life in spite of the fact of the many changes we constantly undergo. Some insects undergo a strong metamorphosis (for example from caterpillar to butterfly) but nevertheless their Identity stays the same. So it seems for every entity, which is an intrinsic whole, that there is something that remains the same, and something else not remaining the same, but always changing. In Philosophy such changes are called "accidental" or "per accidens" in relation to the persistent Identity. This Identity is called the "intrinsic Essence" of the thing, so every real uniform being has such an Essence.


But what then is this Essence?
Where does it abide?
Does it abide outside the thing (as Plato assumed), or inside the thing (as his famous pupil Aristotle assumed)?
And if the Essence is located inside the thing (meaning that the Essence of every being abides in "our world", and not in some external immaterial world transcending the material world), which I consider the most probable position, where in the thing is it located and in what way? Could this Essence be a concrete part of the thing, the "heart" or "soul" of the thing, which implies that the Essence itself would also be a thing (and this thing should of course also have an Essence of its own........Oh my god, where are we going???), or is it in the thing in an abstract way (whatever that means), like a principle?

A background sketch: together with Joshua Levy, I am building a subject mapped social bookmarking application. We call it Tagomizer (tm). It's being fun. But, it's also causing (moi) brain pain. What is a subject? Let me translate. Someone bookmarks a webpage. This means that the URL of that page, and the page title, are sent to Tagomizer, which then paints a form in which the user can add tags (words or phrases for now, images and other objects later), and a body of text taken as a comment. A user can come in later and add more comments or more tags, or remove tags. Tags are a large part of Web 2.0, where folksonomies are breaking out everywhere.

What is a subject? When Tagomizer creates a bookmark, it creates several subject proxies in the subject map where those objects don't already exist. Tagomizer is a kind of TMA (topic maps application -- or SMA in the newspeak of the TMRM), so it is responsible for identification of its subjects, some of which might already have subject identity granted by other TMAs. What, then, is a subject? Consider the webpage itself. Tagomizer asks the core TMA to create a subject proxy for a webpage with a given URL. If that subject proxy already exists, it is returned. Otherwise, a new one is created and granted subject identity by way of a PSI associated with the core TMA. Tagomizer, as a different TMA then grants that subject proxy subject identity with a different PSI, one that says "this is a subject identified by Tagomizer." Other TMAs might grant an SIP (psi) of their own. This is necessary because each individual TMA will be adding other properties to the proxy, mostly assertions.

So, a webpage has granted to it subject identity. What is the subject? In this case, subject identity has been granted to a particular resource, a webpage. Nothing more than that. The resource exists, it is located on the web at a particular URL, and it has been granted subject identity based on that URL by one or more TMAs. Each TMA is going to confer other properties on that subject. We know from nothing about the subject itself other than those properties of location and object type. What is contained/presented at that webpage will be the subject(s) of other subject proxies, for which that resource becomes an instance of an occurrence.

Brain pain, for me (warning: admission of ignorance forthcoming), stems from notions of essence. Essence is mentioned in the quote above as an intrinsic issue. Now, we're deep into the same issues that come up from time to time in the OODB community, intrinsic vs. extrinsic properties. There's an interesting thread on web resource identity, not dissimilar to Bernard's previous post on URI ambiguity. That xml-dev thread starts here.
Intrinsice-extrinsic properties are discussed here.

Closure? Is closure possible? I post this because I am interested in looking for concensus reality related to interoperable ways in which subject identity can/should be conferred on the subjects of future topic/subject maps. My sense is that the inquiry I reveal in this post represents the, um, essense of this entire blog and of Bernard's inquiry. I'll take my answers anywhere I can find them.


Pat Hayes on URI ambiguity

In this excellent thread on SWBPD list, Pat Hayes makes a good point again. URI are ambiguous identifiers, and can be used for more than one referent, hence bearing different or conflicting semantics in different declaration or processing contexts. Just as any name.
Now, what I pushed lately here and there is that different URIs can have the same referent, but describe it differently from different perspectives. Hubjects are formalizations of referents as binding blank nodes. Those two points seem dual and complementary.
Where I differ with Pat is when he says that context, which indeed provides disambiguation of the referent, and is implicitly defined by rules of a given community, a given protocol, should stay this way. In other words, it can't, or should not, be formalized as an object, jusque like in ordinary conversation : the context is effective without being explicit. I think context (or perspective) could and should be declared formally ...