UMBC CMSC 491/691: February 2009

Tuesday, February 17, 2009

Tim Berners-Lee's map of the Web world

Back in 2007, Tim Berners-Lee made this amusing map. It might be a useful reference as you make your way from the World Wide Web toward the Sea of Interoperability while trying to avoid some of the dark and dangerous places where giants and dragons are said to dwell.

Thursday, February 12, 2009

Yahoo BOSS exposes structured data in RDF

This could be a big step toward the "web of data" vision of the Semantic Web.

Yahoo announced (Accessing Structured Data using BOSS that their BOSS (Build your Own Search System) will now support structured data, including RDF.

"Yahoo! Search BOSS provides access to structured data acquired through SearchMonkey. Currently, we are only exposing data that has been semantically marked up and subsequently acquired by the Yahoo! Web Crawler. In the near future, we will also expose structured data shared with us in SearchMonkey data feeds. In both cases, we will respect site owner requests to opt-out of structured data sharing through BOSS."

$Yahoo\'s BOSS to support RDF data$ Here's how it works:

Sites use microformats or RDF (encoded using RDFa or eRDF) to add structured data to their pages
Yahoo's web crawler encounters embedded markup and indexes the structured data along with the unstructured text
A BOSS developer specifies "view=searchmonkey_rdf" or "view=searchmonkey_feed" in API requests
BOSS's response returns the structured data via either XML or JSON

Yahoo's SearchMonkey only acquires structured data using certain microformats or RDF vocabularies. The microformats supported are hAtom, hCalendar, hCard, hReview, XFN, Geo, rel-tag and adr. RDF vocabularies handled include Dublin Core, FOAF, SIOC, and "other supported vocabularies". See the appendix on vocabularies in Yahoo's SearchMonkey Guide for a full list and more information.

A post on the Yahoo search blog talks about this and other changes to the BOSS service and includes a nice example of the use of structured data encoded using microformats from President Obama’s LinkedIn page.

$microformatted data on President Obama\'s linked in page$

Sunday, February 8, 2009

Tim Berners-Lee talks on linked data at TED 2009

Tim Berners-Lee gave a talk at the TED2009 conference on linked data -- one of the newest and most interesting ideas to emerge from efforts to realize the Semantic Web vision.

Here's a summary of Sir Beerners-Lee's from a post by Gigaom, Highlights from TED: Tim Berners-Lee, Pattie Maes, Jacek Utko. I'm looking forward to being able to see his talk online soon.

"Founder of the web Tim Berners-Lee spoke of the next grassroots communication movement he wants to start: linked data. Much in the way his development of the web stemmed out of the frustrations of brilliant people working in silos, he is frustrated that the data of the world is shut apart in offline databases.

Berners-Lee wants raw data to come online so that it can be related to each other and applied together for multidisciplinary purposes, like combining genomics data and protein data to try to cure Alzheimer’s. He urged “raw data now,” and an end to “hugging your data” — i.e. keeping it private — until you can make a beautiful web site for it.

Berners-Lee said his dream is already on its way to becoming a reality, but that it will require a format for tagging data and understanding relationships between different pieces of it in order for a search to turn up something meaningful. Some current efforts are dbpedia, a project aimed at extracting structured information from Wikipedia, and OpenStreetMap, an editable map of the world. He really wants President Obama, who has promised to conduct government transparently online, to post linked data online."

You can see the slides that TBL used on the W3C site.

Big data, linked or not

The Data Evolution blog has an interesting post that asks Is Big Data at a tipping point?. It's suggests that we may be approaching a tipping point in which large amounts of online data will be interlinked and connected to suddenly produce a whole much larger than the parts.

"For the past several decades, an increasing number of business processes– from sales, customer service, shipping - have come online, along with the data they throw off. As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center. And every action — sales lead, mouse click, and shipping update — is stored. The result: organizations are overwhelmed by what feels like a tsunami of data. The same trend is occurring in the larger universe of data that these organizations inhabit. Big Data unleashed by the “Industrial Revolution of Data”, whether from public agencies, non-profit institutes, or forward-thinking private firms."

I expected that the post would soon segue into a discussion of the Semantic Web and maybe even the increasingly popular linked data movement, but it did not. Even so, it sets up plenty of nails for which we have a an excellent hammer in hand. I really like this iceberg analogy, by the way.

"At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It’s frozen because format and meta-data standards make it hard to flow from one place to another: comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling studies with huge cohorts)."

The post also points out some sources of online data and analysis tools, some familiar and some new to me (or maybe just forgotten.)

"Yet there’s a slow thaw underway as evidenced by a number of initiatives: Aaron Swartz’s theinfo.org, Flip Kromer’s infochimps, Carl Malamud’s bulk.resource.org, as well as Numbrary, Swivel, Freebase, and Amazon’s public data sets. These are all ambitious projects, but the challenge of weaving these data sets together is still greater."

Wednesday, February 4, 2009

Google's approach to the semantic web

Google has never expressed a strong interest in the W3C Semantic Web approach. They are interested in developing systems that can understand the content indexed to a greater degree, however. They have to be, if they take the long view, and Google (still) has the resources to take the long view. The most common reason I have heard from Googlers about why they are not working with RDF and OWL is that they still don't seen enough content out there expressed in these languages. I guess if you are Google, tens or hundreds of billions of triples is still not very much.

IDG news service has a story sketching how Google Researcher Targets Web's Structured Data. This is not directed at data published in mahine understandable form (e.g., in RDF), but on other kinds of structured data accessible on the web.

"Internet search engines have focused largely on crawling text on Web pages, but Google is knee-deep in research about how to analyze and organize structured data, a company scientist said Friday. "There's a lot of structured data out on the Web and we're not doing a good job of presenting it to our users," said Alon Halevy during a talk at the New England Database Day conference at the Massachusetts Institute of Technology,

Halevy was referring in part to so-called "deep Web" sources, such as the databases that sit behind form-driven Web sites like Cars.com or Realtor.com. Google has been submitting queries to various forms for some time, retrieving the resulting Web pages and including them in its search index if the information looks useful.

But the company also wants to analyze the data found in structured tables on many Web sites, Halevy said, offering as an example a table on a Web page that lists the U.S. presidents. And there are reams of those tables -- Google's index turned up 14 billion of them, according to Halevy. He "realized very quickly that over 98 percent of these are not that interesting," but even after significant filtering there remain about 154 million tables worth indexing, he said.

ReadWriteWeb also has a story (Google: "We're Not Doing a Good Job with Structured Data")on that Google is or isn't doing with structured data, including an interesting admission by Google researcher Halevy.

"During a talk at the New England Database Day conference at the Massachusetts Institute of Technology, Google's Alon Halevy admitted that the search giant has "not been doing a good job" presenting the structured data found on the web to its users. By "structured data," Halevy was referring to the databases of the "deep web" - those internet resources that sit behind forms and site-specific search boxes, unable to be indexed through passive means."

For some technical details on the issues and current work, see the paper Google’s DeepWeb Crawl by researchers from Google (including Halevy), UCSD and Cornell published in the Proceedings of VLDB 2009.

Free webinar on the Semantic Web from Dow Jones, Thur 12 Feb 2009

Dow Jones is hosting a free one hour webinar about the Semantic Web, on Thursday 12 February 2009 at 10:00am and again at 2:00pm EST. The webinar, The Semantic Web: Discover, Determine and Deploy, is the first in a tree-part series on the Semantic Web.

"Dow Jones notes that "these days it's critical for organizations to consume, digest, and share news and information. The Semantic Web is no longer ahead of its time and is rapidly changing how organizations keep up with information overload." This webinar is Part I of a series and in it you will learn how Semantic Web Technologies enable you to re-use valuable information to save costs, facilitate easier collaboration and sharing of critical information across your business, and increase search relevancy and surface the most valuable information needed to remain competitive."

The presenters are Christine Connors and Daniela Barbosa, both members of the Dow Jones Enterprise Media Group

The webinar is free but requires registration.

Monday, February 2, 2009

Problems with RDF validator: undecodable data

We're experiencing problems in validating FOAF files. The two identical files served by different servers give different results. The first validates successfully but the other produces an error:

"An attempt to load the RDF from URI 'http://cs.umbc.edu/~ctilmes1/ foaf.rdf' failed. (Undecodable data when reading URI at byte 0 using encoding 'UTF-8'. Please check encoding and encoding declaration of your document.)"

Checking the headers when getting two files shows them to be identical and have reasonable http headers:


% GET http://cs.umbc.edu/~ctilmes1/foaf.rdf | md5sum
74ca1b53dab1591a76517d714686936a  -
% GET http://userpages.umbc.edu/~ctilmes1/foaf.rdf | md5sum
74ca1b53dab1591a76517d714686936a  -

% HEAD http://cs.umbc.edu/~ctilmes1/foaf.rdf
200 OK
Connection: close
Date: Mon, 02 Feb 2009 15:45:49 GMT
Accept-Ranges: bytes
ETag: "603-dff-461f145761dea"
Server: Apache
Content-Length: 3583
Content-Type: application/rdf+xml
Last-Modified: Mon, 02 Feb 2009 15:33:07 GMT
Client-Date: Mon, 02 Feb 2009 15:45:49 GMT
Client-Peer: 130.85.36.80:80
Client-Response-Num: 1

% HEAD http://userpages.umbc.edu/~ctilmes1/foaf.rdf
200 OK
Connection: close
Date: Mon, 02 Feb 2009 15:45:58 GMT
Accept-Ranges: bytes
ETag: "2a27001e-dff-49871180"
Server: Apache/1.3.33 (Unix) mod_fastcgi/2.4.2 PHP/4.3.10 mod_perl/
1.29 mod_ssl/2.8.22 OpenSSL/0.9.7d
Content-Length: 3583
Content-Type: application/rdf+xml
Last-Modified: Mon, 02 Feb 2009 15:30:08 GMT
Client-Date: Mon, 02 Feb 2009 15:45:58 GMT
Client-Peer: 130.85.24.44:80
Client-Response-Num: 1

We've sent email to the W3C www-rdf-validator list, but if anyon has advice, please let us know.

Jim Hendler on Web 3.0 in Computer, v42n1

Jim Hendler has a short three page article in the January issue of Computer on Web 3.0, aka, to some, anyway, the Semantic Web.

Jim Hendler, Web 3.0 Emerging, Computer, v42n1, pp 88-90, January 2009.

Here's how ACM summarized it in their daily TechNews service:

"Web 3.0 is generally defined as Semantic Web technologies that run or are embedded within large-scale Web applications, writes Jim Hendler, assistant dean for information technology at Rensselaer Polytechnic Institute. He points out that 2008 was a good year for Web 3.0, based on the healthy level of investment in Web 3.0 projects, the focus on Web 3.0 at various conferences and events, and the migration of new technologies from academia to startups. Hendler says the past year has seen a clarification of emerging Web 3.0 applications. "Key enablers are a maturing infrastructure for integrating Web data resources and the increased use of and support for the languages developed in the World Wide Web Consortium (W3C) Semantic Web Activity," he observes.

The application of Web 3.0 technologies, in combination with the Web frameworks that run the Web 2.0 applications, are becoming the benchmark of the Web 3.0 generation, Hendler says. The Resource Description Framework (RDF) serves as the foundation of Web 3.0 applications, which links data from multiple Web sites or databases. Following the data's rendering in RDF, the development of multisite mashups is affected by the use of uniform resource identifiers (URIs) for blending and mapping data from different resources. Relationships between data in different applications or in different parts of the same application can be deduced through the RDF Schema and the Web Ontology Language, facilitating the linkage of different datasets via direct assertions.

Hendler writes that a key dissimilarity between Web 3.0 technologies and artificial intelligence knowledge representation applications resides in the Web naming scheme supplied by URIs combined with the inferencing in Web 3.0 applications, which supports the generation of large graphs that can prop up large-scale Web applications."

UMBC CMSC 491/691