The Found Corporation
Featured content from Semantic Focus, Semantic Web Blog and Community
Displaying 10 most recent entries.
Bueda API Turns Tags into RDF URIs 26 Feb 2010
A large percentage of content that users deal with on a daily basis is created by other users. Every minute more than 90,000 videos and images are uploaded to YouTube, Flickr and other social media websites, yet this represents a relatively small revenue percentage when compared with traditional media. We believe that one reason for this is the publisher's lack of ability to understand high density content that lacks the adequate description. With mobile platforms providing users with easy methods for rich media upload, this problem will rapidly increase.
Tags are an attempt to mitigate this problem. They allow users an easy way to label content with the labels that make sense to them. Its strengths rely in the simplicity for the user and the ability of the user to use anything as tag, enabling an accurate description of content from the user's perspective. Yet, the strength of tags is also a weakness when it comes to the publisher's ability to understand that content. A tag is, realistically speaking, any sequence of characters. It could be a well formed word, a company name, a person name, an ISBN number, a concatenated version of dates and words, etc. The problem of coverage and disambiguation makes a hard problem to solve.
Bueda addresses this problem by presenting a new solution in the form of an API that can be used by developers to get clean information from noisy tags. It provides a low friction way of tapping into the latest in semantic analysis for tags in a scalable platform.
Bueda provides actionable information that enables targeted advertising, content recommendation, search engine optimization and semantic search, amongst other things. Even though the biggest impact might be in high-density content, such as rich media and pictures, the platform is open to any application and use case.
Bueda is a CMU spin-off and uses proprietary technology for Semantic Resource integration, enabling the integration of heterogeneous data sources that enable open domain coverage in a distributed and scalable framework. Bueda is also an Alphalab alum and currently funded by Innovation Works.
Bueda is currently in private beta. However, Semantic Focus readers have access to some exclusive API keys.
Got something to say? Leave a comment!
Semantic Data Storage in Oracle 15 Jan 2009
Oracle 10g Release 2 / Oracle 11g offers a robust, scalable, secure platform to store RDF and OWL data. It allows efficient storage, loading and querying of semantic data. Queries are enhanced by adding relationships (ontologies) to data and evaluated on the basis of semantics. Data storage is in the form of RDF triples (Subject, Predicate, Object) and can scale up to millions of triples. The triples stored in the semantic data store are modeled as a graphed structure. All the data is stored in a single central schema allowing access to users for loading and querying data.
The Subject and Object are modeled as nodes, while the predicates are denoted by links in the graphed structure. Nodes are stored and efficiently reused when required. An RDF triple in the semantic store has a subject (start node), predicate (relationship), object (end node), which comprises a link. A new link is created on inserting a new triple and nodes are reused if similar nodes already exists.
New object types are defined to manage Semantic Data viz. SDO_RDF_TRIPLE and SDO_RDF_TRIPLE_S. The former stores the references to the data and the latter holds the actual data content. The nodes (Subject, Object) are stored in the RDF_NODE$ table, which can be further broken down into START_NODE_ID and END_NODE_ID. The RDF_LINKS$ table stores the record for the link whenever a new triple is inserted. Blank nodes may also be inserted as a part of any triple, which are stored in the RDF_BLANK_NODE$. An RDF model stores references to all the RDF data in the database and can be created by executing the sem_apis.create_sem_model procedure.
Get started with semantic data management on Windows XP and configure semantic web technology support in Oracle 11g Release 1.
This article gives an overview of semantic data storage, however to get additional in-depth information on Semantic Data support in Oracle, here are some useful links:
References: RDF support in Oracle (.pdf)
Got something to say? Leave a comment!
Calling All RDF Dumps 18 Dec 2008
Today on the Linking Open Data mailing list, Kingsley Idehen of OpenLink Software announced that he is preparing to load the entire LOD cloud into Virtuoso 6.0 Cluster Edition. The datasets are being added to a table on the ESW wiki, making it convenient for anyone doing Semantic Web research to get a hold of the datasets. Once all the datasets are added we should have a better idea of how much linked data there really is out there. This may also raise the bar for other triple stores and force them to develop methods for storing several billion triples.
Here are his instructions for adding your dataset to the table:
- Go to: http://esw.w3.org/topic/DataSetRDFDumps
- Add your data set to the table (if it isn't already listed) or correct erroneous entries
- Add a URL entry to the "Archive URL" column
- Add a Publisher URI to the "Publisher / Maintainer" column (used for the construction of Attribution Triples)
If you don't have a URI for yourself, you can get one by registering and you will receive one.
Got something to say? Leave a comment!
Service Ontologies 14 Dec 2008
Ontologies classifying and describing services are called service ontologies. The currently used WSDL interface describes a service by specifying the operation name, inputs required for the service invocation, output of the service and its target address for invocation. Human intervention is required in this loop since the current architecture only addresses the syntactical aspects of Web services and lacks choreography mechanisms.
Service ontologies supplements the WSDL interface, since additional knowledge is required to enable automation discovery, invocation and composition of services. The idea is to annotate web services, enabling the automation of the web service life cycle.
The existing conceptual models for describing services are OWL-S, WSMO, WSDL-S, SWSF, SAWSDL. Web services can be modeled in different tools like OWL-S Editor, OWL-S IDE, Protege, IRS-III, METEOR-S.
For example, the OWL-S service ontology is classified into three categories: profile, model, grounding. The service component is actually an instance of the service and is linked to the profile, model, grounding by different properties. The profile is an advertisement of what the service does i.e what the service offers in terms of functionality. It considers input, output, preconditions, effects (IOPE).
The input specifies the actual input required for invoking the web service, output specifies the actual output the client gets or expects. Preconditions indicates the conditions that need to be satisfied for the successful execution of the web service and finally effect describes the state of the web service after its execution.
The service model describes how the service works in order to achieve its functionality. It describes atomic processes, composite processes and the message choreography involved in invoking the web service. Atomic processes are the ones, that undergo straight forward execution requiring standard input, whereas composite processes are the ones which involve a combination of different services.
Service grounding illustrates as to how the service can be accessed. It describes the network protocols, data exchange formats, required to invoke the web service.
Like OWL-S, the other models also address the semantic nature of web service descriptions thereby making an effort to automate the web service life cycle.
Got something to say? Leave a comment!
Semantic Web Service Life Cycle and Service Modeling 14 Dec 2008
Semantic Web services follow a life cycle, right from deployment to its invocation.
The life cycle of Semantic Web services comprises different stages like service modeling, service discovery, service definition and service delivery. The life cycle begins with modeling the web service and the service request by the provider and the consumer respectively. Web service descriptions are developed using models like OWL-S, WSMO. Service descriptions are used in the discovery stage on which discovery algorithms, matchmaking techniques are applied. Once a set of service providers are identified for a service requester, service definition takes place to select the concrete service. Finally, the concrete service is delivered to the service requester in the delivery phase.
Web service modeling is a critical aspect of the web service life cycle. It requires loads of human effort for annotating web services. Services can be modeled using two approaches viz. Code driven approach, model driven approach.
Code Driven Approach
It is assumed that the web service is already implemented, and the corresponding WSDL is generated from it. The web service can be annotated semantically by adding OWL-S specifications. Tools like Java2WSDL, WSDL2OWL-S can be used to generate abstract OWL-S specifications. The service description would later be published to the registry for discovery and invocation. This approach is referred to as the Code Driven Approach since the starting point of this process uses a web service (code).
Model Driven Approach
This approach uses the high level service descriptions to generate partial code. Service descriptions are created using ontologies and the process model is used to generate stubs for implementing the web service. The code generated is used to create the WSDL, and later published to the registry.
Got something to say? Leave a comment!
Can Graphd Scale to Meet Semantic Web Demands? 9 Dec 2008
Freebase stores millions of entities and assertions about nearly every topic one can ponder (thanks are owed to their seed dataset – Wikipedia – and their amazing community). The amount of information that Freebase stores is incredible, and is a testament to what can be accomplished with the help of a dedicated community and a little (or a lot) of clever software engineering.
Graphd is the in-house tuple store powering Freebase's backend. Written in C, Graphd runs on Unix-based machines (presumably some Linux distro) and processes commands in a simple, template-based query language called MQL. The query language looks strikingly similar to JSON and Python dictionary syntax, so developers familiar with either should find working with their API a sinch.
On performance, Freebase's Scott Meyer stated as of April 9th, 2008 that Graphd is able to demonstrate sustained throughput of about 200,000 simple queries per minute on a single AMD64 core (querying a graph of only 121 million tuples, however). For his example of what a simple query might look like, he gave the example "show me all people who are authors with names containing 'herman'." As well on April 9th, 2008, on disk, their current graph of 121 million primitives (tuples) consumed about 12gb (includes all index storage).
We see that Graphd is able to handle a stunning sustained ~3300 queries/sec on a single AMD64 core. That's not anything to scoff at, either. However, the question I am finally getting around to, can Graphd scale to meet the demands of the Semantic Web? Eventually, Freebase will be much larger. 121m tuples is nothing when compared to the amount of data currently available in RDF (already in the order of billions of assertions).
I have read in comments that Graphd runs completely in memory (or perhaps more likely, only the indices). This explains the amazing performance to a degree. On an AMD64 Phenom Quad Core with 2gb of RAM I can run "simple" operations linearly through a flat file of 17m Freebase tuples in under 6 seconds (in memory). On a slice of 1m tuples the test was able complete the iterations within ~0.003 seconds. The test was written in Python, so it isn't even as quick as the potential Graphd has (written in C).
The test should illustrate the amazing performance you can achieve when processing entirely in memory, but when you can no longer store your entire set of indices in memory (say, for 3b+ tuples) you have to apply some of that clever software engineering to quickly locate data positions regardless of the number or distribution of indices.
Can Freebase scale Graphd to meet the demands of the Semantic Web, or will they need to completely redesign the architecture of their backend to reach a scale not originally designed for? I cannot say, but I wish them the best of luck. I think I speak for everyone when I say I would really like to see Graphd open sourced!
PS: Freebase, I promise I'll use the new logo in my posts going forward.
Got something to say? Leave a comment!
The Map of Data: Over 10 Billion Pieces of Reusable Information 19 Nov 2008
I just stumbled upon a useful resource from Sindice (the Semantic Web search engine) called the Map of Data. The Map of Data lists sites that export their information via Microformats and embedded RDF (as well which format(s) the sites are using). Each site has been categorized and conveniently placed into lists. The categories include books, people, places, products and listings, social news, events, politics, and more. According to Sindice over 10 billion pieces of reusable information can already be found across 100 million pages.
Got something to say? Leave a comment!
Algorithms vs. Data: The Seesaw Effect 30 Oct 2008
Over the years I've noticed that the importance of algorithms and data tends to shift back and forth, depending on which at the time is hardest to duplicate (often from a business perspective). This effect seems to be caused by the availability or demand of one side increasing or decreasing, shifting the balance of importance to the other. At one point the world of software was dominated by the proprietary. The organization with the best software (backend, algorithms, etc) was the dominant entity and data (from say, a Web 2.0 perspective) was generally not the focus. This may have partly been the responsibility of a mindset formed during an era with very little storage space and before mass user activity on the Web.
Things have changed and the word proprietary has become a sort-of developer faux pas. Open source has caused a paradigm shift away from the old proprietary software models and has allowed organizations to focus their attention on the other side of the equation: data. As a result of this shift we saw the start of the Web 2.0 era (perhaps with a few years of padding before the phrase started floating around). Now many organizations focus on the data they acquire and how they can leverage it to their advantage. As a result we see many walled gardens in an attempt to preserve this advantage.
However we may be seeing another shift, this time back to software once again. The Semantic Web calls for making data open and ubiquitous. This is a strong paradigm shift away from the walled garden mindset (and most people understand this, especially the business set). After writing about the cross-pollination of DBpedia and Freebase it occurred to me that the project with the most advanced proprietary information extraction algorithms would in a sense be the "dominant" project because it would be able to leverage its software in a space where data is becoming a commodity.
Freebase has a secret sauce and that is probably their biggest advantage over competing projects. In the Semantic Web/Linked Data Web/Web 3.0 (whatever we feel like calling it at the time), data may decrease in value as it spreads and becomes more commoditized; at least in the original sense of value it once had: as a tool that only the walled gardens could leverage.
We are seeing the walls come down, possibly to be replaced once again by proprietary algorithms.
Got something to say? Leave a comment!
Cross-Pollinating DBpedia and Freebase 29 Oct 2008
Now that Freebase is available as Linked Data a big question that comes to mind is whether these two major projects will move to assimilate one another. DBpedia and Freebase – two endeavors primarily focused on curating unstructured and semi-structured data about everything and releasing it back into the wild (with structure) – get the bulk of their information from Wikipedia, so the amount of topical overlap is assumed to be extremely high. DBpedia gains new information when it extracts data from the latest Wikipedia dump, whereas Freebase, in addition to Wikipedia extractions, gains new information through its userbase of editors.
It is this incredible amount of overlap (with regard to content and purpose) which creates a sort of paradox, where it can be speculated that DBpedia and Freebase would both gain and lose value through efforts to cross-pollinate. Assimilating each other's updates would cause both to become "more complete" (in the same sense that an incrementing number is closer to infinity after each increment), thus gaining value. However, both may lose value as well if "value" is the perception of being "the most complete database about everything." Freebase may see a drop in userbase growth and participation if it becomes a mirror of DBpedia (or vice-versa) and the popularity once garnered by one project may shift towards the other, or away entirely.
This may not be an actual paradox since we're talking about mixing two different perceptions of value (value from the developer's point of view and value from the point of view of the project itself), but we must still look at it from both vantage points. This may simply be another issue of business interest vs. developer interest. All issues regarding popularity and ubiquity aside, cross-pollination is a Good Thing for the purposes of the Semantic Web and Linked Data in general.
Got something to say? Leave a comment!
Freebase Officially Linked Data with Release of RDF Service 29 Oct 2008
At ISWC2008 Freebase released its new RDF service for generating RDF representations of Freebase topics, allowing Freebase to be used as Linked Data! To obtain the RDF data for a topic send a GET request to http://rdf.freebase.com/rdf/some.topic.id where "some.topic.id" is replaced by the desired topic identifier (slashes in the identifier must be replaced by dots). Topic data can be represented as N3, RDF/XML or Turtle depending on the preferences expressed in your client's HTTP Accept header. Try it out with the Freebase topic Semantic Web.
You can also cater to clients that prefer HTML output by using the /ns end-point (http://rdf.freebase.com/ns/some.topic.id). The service performs the content negotiation automatically; delivering human-friendly HTML representations to Web browsers, and redirecting clients expecting RDF to the /rdf URL (via 302 redirect).
One downside is the data doesn't appear to link to external resources, in a sense walling itself in. It should be trivial to link the topics that came from Wikipedia back to Wikipedia as well as DBpedia (which would be killer, by the way).
Got something to say? Leave a comment!