UMBC CMSC 491/691: 2009

Sunday, March 15, 2009

Ontology Summit 2009: Toward Ontology-based Standards

A two day event, Ontology Summit 2009: Toward Ontology-based Standards, will be held 6-7 April 2009 at NIST in Gaithersburg MD. The Summit is co-organized by NIST and a number of other organizations.

"This summit will address the intersection of two active communities, namely the technical standards world, and the community of ontology and semantic technologies. This intersection is long overdue because each has much to offer the other. Ontologies represent the best efforts of the technical community to unambiguously capture the definitions and interrelationships of concepts in a variety of domains. Standards -- specifically information standards -- are intended to provide unambiguous specifications of information, for the purpose of error-free access and exchange. If the standards community is indeed serious about specifying such information unambiguously to the best of its ability, then the use of ontologies as the vehicle for such specifications is the logical choice. Conversely, the standards world can provide a large market for the industrial use of ontologies, since ontologies are explicitly focused on the precise representation of information. This will be a boost to worldwide recognition of the utility and power of ontological models. The goal of this Ontology Summit 2009 is to articulate the power of synergizing these two communities in the form of a communique in which a number of concrete challenges can be laid out. These challenges could serve as a roadmap that will galvanize both communities and bring this promising technical area to the attention of others."

The meeting is free, but advanced registration is required. You can also register to participate remotely.

Saturday, March 14, 2009

Video from Tim Berners-Lee 2009 TED talk on linked data

Here is the video of the talk that Tim Berners-Lee gave at the TED2009 conference on linked data.

You can see the slides that TBL used on the W3C site.

I may have missed it, but I don't think he mentioned the phrase "Semantic Web" once during the 16 minute talk.

Monday, March 2, 2009

Ian Davis code{4}lib keynote: data outlasts code

Ian Davis, CTO of Talis, posted the slides from his code4lib2009 keynote talk on slideshare. If you love something... set it free gives a very nicely done description of the motivation behind and hopes for the Semantic Web.

Code{4}lib is a conference series and community focused on the intersection of libraries, technology, and the future. code4lib2009 was held this week in Providence, hosted by the Brown University Library.

Ian's talk contained three conjectures, the first of which I especially liked:

Conjecture 1: Data outlasts code
Conjecture 2: There is more structured data in the world than unstructured
Conjecture 3: Most of the value in our data will be unexpected and unintended

(h/t Danny Ayers)

If you love something... set it free

View more presentations from Ian Davis. (tags: opendata data)

Tuesday, February 17, 2009

Tim Berners-Lee's map of the Web world

Back in 2007, Tim Berners-Lee made this amusing map. It might be a useful reference as you make your way from the World Wide Web toward the Sea of Interoperability while trying to avoid some of the dark and dangerous places where giants and dragons are said to dwell.

Thursday, February 12, 2009

Yahoo BOSS exposes structured data in RDF

This could be a big step toward the "web of data" vision of the Semantic Web.

Yahoo announced (Accessing Structured Data using BOSS that their BOSS (Build your Own Search System) will now support structured data, including RDF.

"Yahoo! Search BOSS provides access to structured data acquired through SearchMonkey. Currently, we are only exposing data that has been semantically marked up and subsequently acquired by the Yahoo! Web Crawler. In the near future, we will also expose structured data shared with us in SearchMonkey data feeds. In both cases, we will respect site owner requests to opt-out of structured data sharing through BOSS."

$Yahoo\'s BOSS to support RDF data$ Here's how it works:

Sites use microformats or RDF (encoded using RDFa or eRDF) to add structured data to their pages
Yahoo's web crawler encounters embedded markup and indexes the structured data along with the unstructured text
A BOSS developer specifies "view=searchmonkey_rdf" or "view=searchmonkey_feed" in API requests
BOSS's response returns the structured data via either XML or JSON

Yahoo's SearchMonkey only acquires structured data using certain microformats or RDF vocabularies. The microformats supported are hAtom, hCalendar, hCard, hReview, XFN, Geo, rel-tag and adr. RDF vocabularies handled include Dublin Core, FOAF, SIOC, and "other supported vocabularies". See the appendix on vocabularies in Yahoo's SearchMonkey Guide for a full list and more information.

A post on the Yahoo search blog talks about this and other changes to the BOSS service and includes a nice example of the use of structured data encoded using microformats from President Obama’s LinkedIn page.

$microformatted data on President Obama\'s linked in page$

Sunday, February 8, 2009

Tim Berners-Lee talks on linked data at TED 2009

Tim Berners-Lee gave a talk at the TED2009 conference on linked data -- one of the newest and most interesting ideas to emerge from efforts to realize the Semantic Web vision.

Here's a summary of Sir Beerners-Lee's from a post by Gigaom, Highlights from TED: Tim Berners-Lee, Pattie Maes, Jacek Utko. I'm looking forward to being able to see his talk online soon.

"Founder of the web Tim Berners-Lee spoke of the next grassroots communication movement he wants to start: linked data. Much in the way his development of the web stemmed out of the frustrations of brilliant people working in silos, he is frustrated that the data of the world is shut apart in offline databases.

Berners-Lee wants raw data to come online so that it can be related to each other and applied together for multidisciplinary purposes, like combining genomics data and protein data to try to cure Alzheimer’s. He urged “raw data now,” and an end to “hugging your data” — i.e. keeping it private — until you can make a beautiful web site for it.

Berners-Lee said his dream is already on its way to becoming a reality, but that it will require a format for tagging data and understanding relationships between different pieces of it in order for a search to turn up something meaningful. Some current efforts are dbpedia, a project aimed at extracting structured information from Wikipedia, and OpenStreetMap, an editable map of the world. He really wants President Obama, who has promised to conduct government transparently online, to post linked data online."

You can see the slides that TBL used on the W3C site.

Big data, linked or not

The Data Evolution blog has an interesting post that asks Is Big Data at a tipping point?. It's suggests that we may be approaching a tipping point in which large amounts of online data will be interlinked and connected to suddenly produce a whole much larger than the parts.

"For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off. As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center. And every action — sales lead, mouse click, and shipping update — is stored. The result: organizations are overwhelmed by what feels like a tsunami of data. The same trend is occurring in the larger universe of data that these organizations inhabit. Big Data unleashed by the “Industrial Revolution of Data”, whether from public agencies, non-profit institutes, or forward-thinking private firms."

I expected that the post would soon segue into a discussion of the Semantic Web and maybe even the increasingly popular linked data movement, but it did not. Even so, it sets up plenty of nails for which we have a an excellent hammer in hand. I really like this iceberg analogy, by the way.

"At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It’s frozen because format and meta-data standards make it hard to flow from one place to another: comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling studies with huge cohorts)."

The post also points out some sources of online data and analysis tools, some familiar and some new to me (or maybe just forgotten.)

"Yet there’s a slow thaw underway as evidenced by a number of initiatives: Aaron Swartz’s theinfo.org, Flip Kromer’s infochimps, Carl Malamud’s bulk.resource.org, as well as Numbrary, Swivel, Freebase, and Amazon’s public data sets. These are all ambitious projects, but the challenge of weaving these data sets together is still greater."

Wednesday, February 4, 2009

Google's approach to the semantic web

Google has never expressed a strong interest in the W3C Semantic Web approach. They are interested in developing systems that can understand the content indexed to a greater degree, however. They have to be, if they take the long view, and Google (still) has the resources to take the long view. The most common reason I have heard from Googlers about why they are not working with RDF and OWL is that they still don't seen enough content out there expressed in these languages. I guess if you are Google, tens or hundreds of billions of triples is still not very much.

IDG news service has a story sketching how Google Researcher Targets Web's Structured Data. This is not directed at data published in mahine understandable form (e.g., in RDF), but on other kinds of structured data accessible on the web.

"Internet search engines have focused largely on crawling text on Web pages, but Google is knee-deep in research about how to analyze and organize structured data, a company scientist said Friday. "There's a lot of structured data out on the Web and we're not doing a good job of presenting it to our users," said Alon Halevy during a talk at the New England Database Day conference at the Massachusetts Institute of Technology,

Halevy was referring in part to so-called "deep Web" sources, such as the databases that sit behind form-driven Web sites like Cars.com or Realtor.com. Google has been submitting queries to various forms for some time, retrieving the resulting Web pages and including them in its search index if the information looks useful.

But the company also wants to analyze the data found in structured tables on many Web sites, Halevy said, offering as an example a table on a Web page that lists the U.S. presidents. And there are reams of those tables -- Google's index turned up 14 billion of them, according to Halevy. He "realized very quickly that over 98 percent of these are not that interesting," but even after significant filtering there remain about 154 million tables worth indexing, he said.

ReadWriteWeb also has a story (Google: "We're Not Doing a Good Job with Structured Data")on that Google is or isn't doing with structured data, including an interesting admission by Google researcher Halevy.

"During a talk at the New England Database Day conference at the Massachusetts Institute of Technology, Google's Alon Halevy admitted that the search giant has "not been doing a good job" presenting the structured data found on the web to its users. By "structured data," Halevy was referring to the databases of the "deep web" - those internet resources that sit behind forms and site-specific search boxes, unable to be indexed through passive means."

For some technical details on the issues and current work, see the paper Google’s DeepWeb Crawl by researchers from Google (including Halevy), UCSD and Cornell published in the Proceedings of VLDB 2009.

Free webinar on the Semantic Web from Dow Jones, Thur 12 Feb 2009

Dow Jones is hosting a free one hour webinar about the Semantic Web, on Thursday 12 February 2009 at 10:00am and again at 2:00pm EST. The webinar, The Semantic Web: Discover, Determine and Deploy, is the first in a tree-part series on the Semantic Web.

"Dow Jones notes that "these days it's critical for organizations to consume, digest, and share news and information. The Semantic Web is no longer ahead of its time and is rapidly changing how organizations keep up with information overload." This webinar is Part I of a series and in it you will learn how Semantic Web Technologies enable you to re-use valuable information to save costs, facilitate easier collaboration and sharing of critical information across your business, and increase search relevancy and surface the most valuable information needed to remain competitive."

The presenters are Christine Connors and Daniela Barbosa, both members of the Dow Jones Enterprise Media Group

The webinar is free but requires registration.

Monday, February 2, 2009

Problems with RDF validator: undecodable data

We're experiencing problems in validating FOAF files. The two identical files served by different servers give different results. The first validates successfully but the other produces an error:

"An attempt to load the RDF from URI 'http://cs.umbc.edu/~ctilmes1/ foaf.rdf' failed. (Undecodable data when reading URI at byte 0 using encoding 'UTF-8'. Please check encoding and encoding declaration of your document.)"

Checking the headers when getting two files shows them to be identical and have reasonable http headers:


% GET http://cs.umbc.edu/~ctilmes1/foaf.rdf | md5sum
74ca1b53dab1591a76517d714686936a  -
% GET http://userpages.umbc.edu/~ctilmes1/foaf.rdf | md5sum
74ca1b53dab1591a76517d714686936a  -

% HEAD http://cs.umbc.edu/~ctilmes1/foaf.rdf
200 OK
Connection: close
Date: Mon, 02 Feb 2009 15:45:49 GMT
Accept-Ranges: bytes
ETag: "603-dff-461f145761dea"
Server: Apache
Content-Length: 3583
Content-Type: application/rdf+xml
Last-Modified: Mon, 02 Feb 2009 15:33:07 GMT
Client-Date: Mon, 02 Feb 2009 15:45:49 GMT
Client-Peer: 130.85.36.80:80
Client-Response-Num: 1

% HEAD http://userpages.umbc.edu/~ctilmes1/foaf.rdf
200 OK
Connection: close
Date: Mon, 02 Feb 2009 15:45:58 GMT
Accept-Ranges: bytes
ETag: "2a27001e-dff-49871180"
Server: Apache/1.3.33 (Unix) mod_fastcgi/2.4.2 PHP/4.3.10 mod_perl/
1.29 mod_ssl/2.8.22 OpenSSL/0.9.7d
Content-Length: 3583
Content-Type: application/rdf+xml
Last-Modified: Mon, 02 Feb 2009 15:30:08 GMT
Client-Date: Mon, 02 Feb 2009 15:45:58 GMT
Client-Peer: 130.85.24.44:80
Client-Response-Num: 1

We've sent email to the W3C www-rdf-validator list, but if anyon has advice, please let us know.

Jim Hendler on Web 3.0 in Computer, v42n1

Jim Hendler has a short three page article in the January issue of Computer on Web 3.0, aka, to some, anyway, the Semantic Web.

Jim Hendler, Web 3.0 Emerging, Computer, v42n1, pp 88-90, January 2009.

Here's how ACM summarized it in their daily TechNews service:

"Web 3.0 is generally defined as Semantic Web technologies that run or are embedded within large-scale Web applications, writes Jim Hendler, assistant dean for information technology at Rensselaer Polytechnic Institute. He points out that 2008 was a good year for Web 3.0, based on the healthy level of investment in Web 3.0 projects, the focus on Web 3.0 at various conferences and events, and the migration of new technologies from academia to startups. Hendler says the past year has seen a clarification of emerging Web 3.0 applications. "Key enablers are a maturing infrastructure for integrating Web data resources and the increased use of and support for the languages developed in the World Wide Web Consortium (W3C) Semantic Web Activity," he observes.

The application of Web 3.0 technologies, in combination with the Web frameworks that run the Web 2.0 applications, are becoming the benchmark of the Web 3.0 generation, Hendler says. The Resource Description Framework (RDF) serves as the foundation of Web 3.0 applications, which links data from multiple Web sites or databases. Following the data's rendering in RDF, the development of multisite mashups is affected by the use of uniform resource identifiers (URIs) for blending and mapping data from different resources. Relationships between data in different applications or in different parts of the same application can be deduced through the RDF Schema and the Web Ontology Language, facilitating the linkage of different datasets via direct assertions.

Hendler writes that a key dissimilarity between Web 3.0 technologies and artificial intelligence knowledge representation applications resides in the Web naming scheme supplied by URIs combined with the inferencing in Web 3.0 applications, which supports the generation of large graphs that can prop up large-scale Web applications."

Wednesday, January 28, 2009

State of the Semantic Web

Danny Ayers, a well known Semantic Web developer, has the first of a three-part article on the state of the semantic web in the latest issue of IEEE Internet Computing.

Danny Ayers, "Delivered Deliverables: The State of the Semantic Web, Part 1," IEEE Internet Computing, vol. 13, no. 1, pp. 86-89, Jan./Feb. 2009.

He writes in the Nodalities blog:

"Well finally I got around to starting this write-up, and the first instalment has appeared in the excellent IEEE Internet Computing. I foolishly thought I’d be able to cover the main ground in one column, now it seems like I’ll need at least three. In Delivered Deliverables I look mostly at the output of the W3C. The provisional plan is to cover infrastructure & backend tools in part two (with comments of the notion of linked data), and move on to real-world applications in part three. Suggestions are very much welcome."

It's a good summary of the standards that have been developed by the W3C to support the Semantic Web.

Tuesday, January 27, 2009

Semantics-Empowered Social Computing

Here's another interesting looking article from the current issue of IEEE Internet Computing.

Amit Sheth and Meenakshi Nagarajan, Semantics-Empowered Social Computing, IEEE Internet Computing, v13n1, 2 pp 76-80, 2009.

"User-generated textual content on social media has unique characteristics owing to the interpersonal and interactional nature of the communication medium. Web 3.0 applications that aim to automatically create accurate annotations from user-generated content to common reference models will have to invariably deal with the informal nature of this content. In this article, the authors discuss opportunities in addressing challenges posed by this content by supplementing traditional statistical and NLP techniques with domain knowledge."

Semantic Email Addressing

The current issue of IEEE Internet Computing has an article titled Semantic Email Addressing: The Semantic Web Killer App? by Michael Kassoff, Charles Petrie, Lee-Ming Zen, and Michael Genesereth.

"Email addresses, like telephone numbers, are opaque identifiers. They’re often hard to remember, and, worse still, they change from time to time. Semantic email addressing (SEA) lets users send email to a semantically specified group of recipients. It provides all of the functionality of static email mailing lists, but because users can maintain their own profiles, they don’t need to subscribe, unsubscribe, or change email addresses. Because of its targeted nature, SEA could help combat unintentional spam and preserve the privacy of email addresses and even individual identities."

You can get the full pdf here.

UMBC CMSC 491/691