Tuesday, May 12, 2009
The Internet is an amazing accomplishment in several respects. It contains an enormous amount of information; it has revolutionized business, education, and everyday life; and it's still running on a widely outdated infrastructure.
The Internet, these days, is caught in an interesting cycle. The more widespread the adoption of the Internet is, the more information gets added to it. The more information it holds, the more useful it is and the more people use it. The trouble with the growing usage of the Internet and the growing amount of information it allows access to is that the structure of the Internet isn't keeping up. In other words it's not much easier to gain access to the information the web holds now then it was when it first started. Sure, the web has undergone some helpful changes such as starting to standardize the way pages are coded and rendered. And, there are some promising changes on the horizon. For example, the implementation of HTML 5 will make it easier to code and—by extension—parse information semantically. Similarly, Google is in constant development of projects (FriendConnect, OpenSocial, Google Webmaster Tools, etc.) designed to standardize the way information is stored and used on the Internet. There is also a small but growing trend among sites to open up systematic access to their data by way of an API, and there's an ongoing movement to implement semantic technologies such as RDF and OWL.
Support our sponsor:
The tricky thing about each of these strategies is that they are fairly costly to actually put into production. For example, the ideas behind HTML 5 were generated in 2004 and it's taken four years to publish a working draft of the specification. Once a specification for HTML 5 has been agreed upon it will still take countless hours of development to get browsers to properly render the syntax, countless hours of research by website designers/programmers to learn and adopt the new syntax, and countless years before all the websites available in the public domain are using HTML 5. Even the adoption of an API technology like OpenSocial can require months or years of changes to a site's data infrastructure before they can be rolled out; and that's after Google invested their own time and money to develop and market the API in the first place. So, it would seem that—in order for the semantic web to happen—there needs to be one or more technologies that allow companies to expose their data in a systematic way, at a low cost, and without a lot of implementation work; and there are a couple of companies that are working towards filling these sorts of requirements.
Mozenda is a data management platform that allows users to combine and use data from multiple sources. With Mozenda, users can set up agents that routinely extract data from nearly any website. The information, once collected, is stored on one of Mozenda's secure servers and can be exported in a number of file formats or systematically accessed through Mozenda's API. By allowing users to both gather data and access it through a call, Mozenda has essentially created the ability create an API for nearly any website.
Get a free 14 day trial!
Mozenda Web Scraping Software