WebCat Documentation

3.2 WebCat Wrappers

In our system, wrappers have to be built for each Web catalog. In the existing system, this process is semi-automated. The process of building WebCat wrappers is a layered one. We have to complete several steps that are:

First, we identify the levels of the catalog and we chose a typical HTML page for each level.

Second, we define the Abstract Syntax Tree (AST) for each level based on the typical page.

Third, we define WebOQL queries for each level based on the typical page.

Finally, we encapsulate the WebOQL queries in the wrapper.

We present by example how we built a wrapper. For this task, we have chosen the Eddie-Bauer (EB) clothing catalog, and from the EB catalog we have chosen the first level of the catalog. In Figure 3 we present the first level of the catalog, in which we circled the information of interest, the links to level 2 in this case.

The next step is to establish the Abstract Syntax Trees (AST). In Figure 4 we present a simplified AST for this level. By simplified we mean that we depicted only the HTML links that we have to follow in order to reach the information of interest.

In our representation of AST, each arrow identifies an HTML tag; if an arrow starts from another one that means that the tags are nested. In the representation, we kept the original HTML tags, except for the last level that refers to another HTML page, "level 2" tags in this case.

In Figure 3 we circled the information of interest, which are the links to the next level (level 2). It is easy to guess that all the links are grouped in a table. As depicted in Figure 4 the information of interest is really in a table. The problem is that this table is not quite easy to reach. Once we "navigate" to it, we have to iterate through the table’s rows in order to extract all the links pointing to the second level. The WebOQL query for this task is presented in Figure 5.

Figure 3

Figure 4

new_target := browse("http://www.eddiebauer.com/eb/ShopEB/frame_category.asp?LineID=312");

select [z.base, z.url, z.text]
from x in new_target via ^[tag = "table"],
y in ((((x'')!!)')!!!)', z in (((y')!)');

Figure 5: WebOQL query to extract the links to level 2

The query works as follows: first, we locate the first table: via ^[tag = "table"] second, we navigate "down" to the table of interest: ((((x'')!!)')!!!)' third, we iterate variable y through the rows of the table and finally, variable z extracts from variable y (each row) the link to the second level: (((y')!)')&.

Once the WebOQL queries for each level are created, we glue them together in the wrapper. The wrapper is a Java application that accesses the WebOQL query engine using the WebOQL API.

The WebOQL query above is a good example of generating wrappers in a high level declarative manner. In the end, for the whole EB catalog wrapper, we need only 7 WebOQL queries similar to the one from Figure 5, thus quite a little amount of work compared to a manual manner of writing wrappers in some kind of general purpose programming language.

The most complicated part of the wrapper building process is the extraction of the AST for each level. The catalog HTML pages are complicated, and not always regular or clean in terms of syntax. By regular we refer to the indentation in text, and the presence of comments. By clean, we mean the right nesting of tags. If just one tag is overlooked, the AST is no longer accurate and if this tag is in the "extraction path" the outcome becomes unpredictable. Once we have the AST, the WebOQL queries are easy to write.

Building a wrapper is moderately time consuming. In our experience, a new wrapper needs 2-3 man/day effort. The major burden in this process is to establish the AST.