What are the future works in "Crawling deep web entity pages" ?

While these techniques are shown to be useful, their experience points to a few areas that warrant future studies. Given the ubiquity of entity-oriented deep-web sites and the variety of uses that entity-oriented content can enable, the authors believe entity-oriented crawl is a useful research effort, and they hope their initial efforts in this area can serve as a springboard for future research.

What is the problem with the prefix “whereto stay in”?

when all the prefix/suffix in the query logs are aggregated, the correct prefix “whereto stay in” occurs much more frequently and should clearly stand out as a entity irrelevant pattern.

What is the reason why the authors filter second-level URLs?

a significant portion of these second-level URLs are in fact entirely irrelevant with no deep-web content (many URLs are static and navigational, for example browsing URLs, login URLs, etc.), which need to be filtered out.

(Open Access) Crawling deep web entity pages (2013) | Yeye He

Q: What contributions have the authors mentioned in the paper "Crawling deep web entity pages" ?

In this work, the authors describe a prototype system they have built that specializes in crawling entity-oriented deep-web sites. The authors propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites.

Q: How can the authors use this approach to extract relevant entities?

The authors demonstrate that classical techniques for infor-mation retrieval and entity extraction can be used to robustly derive relevant entities for each site, so that crawling bandwidth can be utilized efficiently and effectively (Section 5).

Q: What is the definition of deep-web crawl?

While crawling the deep-web can be immensely useful for a variety of tasks including web indexing [15] and data integration [14], crawling the deep-web content is known to be hard.

Crawling Deep Web Entity Pages

Yeye He

∗

Univ. of Wisconsin-Madison

Madison, WI 53706

heyeye@cs.wisc.edu

Dong Xin

Google Inc.

Mountain View, CA, 94043

dongxin@google.com

Venkatesh Ganti

Google Inc.

Mountain View, CA, 94043

vganti@google.com

Sriram Rajaraman

Google Inc.

Mountain View, CA, 94043

sriramr@google.com

Nirav Shah

Google Inc.

Mountain View, CA, 94043

nshah@google.com

ABSTRACT

Deep-web crawl is concerned with the problem of surfacing hid-

den content behind search interfaces on the Web. While many

deep-web sites maintain document-oriented textual content (e.g.,

Wikipedia, PubMed, Twitter, etc.), which has traditionally been the

focus of the deep-web literature, we observe that a signiﬁcant por-

tion of deep-web sites, including almost all online shopping sites,

curate structured entities as opposed to text documents. Although

crawling such entity-oriented content is clearly useful for a variety

of purposes, existing crawling techniques optimized for document

oriented content are not best suited for entity-oriented sites. In this

work, we describe a prototype system we have built that specializes

in crawling entity-oriented deep-web sites. We propose techniques

tailored to tackle important subproblems including query genera-

tion, empty page ﬁltering and URL deduplication in the speciﬁc

context of entity oriented deep-web sites. These techniques are ex-

perimentally evaluated and shown to be effective.

Categories and Subject Descriptors: H.2.8 Database Applica-

tion: Data Mining

Keywords: Deep-web crawl, web data, entities.

1. INTRODUCTION

Deep-web crawl refers to the problem of surfacing rich infor-

mation behind the web search interface of diverse sites across the

Web. It was estimated by various accounts that the deep-web has

as much as an order of magnitude more content than that of the

surface web [10, 14]. While crawling the deep-web can be im-

mensely useful for a variety of tasks including web indexing [15]

and data integration [14], crawling the deep-web content is known

to be hard. The difﬁculty in surfacing the deep-web has inspired a

long and fruitful line of research [3, 4, 5, 10, 14, 15, 17, 22, 23].

In this paper we focus on entity-oriented deep-web sites. These

sites curate structured entities and expose them through search in-

terfaces. Examples include almost all online shopping sites (e.g.,

∗

Work done while author at Google, now at Microsoft Research.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

WSDM’13, February 4–8, 2013, Rome, Italy.

ebay.com, amazon.com, etc.), where each entity is typically a prod-

uct that is associated with rich structured information like item

name, brand name, price, and so forth. Additional examples of

entity-oriented deep-web sites include movie sites, job listings, etc.

Note that this is to contrast with traditional document-oriented deep-

web sites that mostly maintain unstructured text documents (e.g.,

Wikipedia, PubMed, etc.).

Entity-oriented sites are very common and represent a signiﬁ-

cant portion of the deep-web sites. The variety of tasks that entity-

oriented content enables makes the general problem of crawling

entities an important problem.

The practical use of our system is to crawl product entities from

a large number of online retailers for advertisement landing page

purposes. While the exact use of such entities content in adver-

tisement is beyond the scope of this paper, the system requirement

is simple to state: We are provided as input a list of retailers’ web-

sites, and the objective is to crawl high-quality product entity pages

efﬁciently and effectively.

There are two key properties that set our problem apart from tra-

ditional deep-web crawling literature. First, we speciﬁcally focus

on the entity-oriented model, because of our interest in product en-

tities from online retailers, which are entity-oriented deep-web sites

in most cases. While existing general crawling techniques are still

applicable to some extent, the speciﬁc focus on entity-oriented sites

brings unique opportunities. Second, a large number of entity sites

(online retailers) are provided as input to our system, from which

entity pages are to be crawled. Note that with thousands of sites as

input, the realistic objective is to only obtain a representative con-

tent coverage of each site, instead of an exhaustive one. Ebay.com,

for example, has hundreds of thousands of listings returned for the

query “iphone”; the purpose of the system is not to obtain all iphone

listings, but only a representative few of these listings for ads land-

ing pages. This goal of obtaining representative coverage contrasts

with traditional deep-web crawl literature, which tends to deal with

individual site and focus on obtaining exhaustive content coverage.

Our objective is more in line with the pioneering work [15], which

also operates at the Web scale but focuses on general web content.

We have developed a prototype system that is designed speciﬁ-

cally to crawl representative entity content. The crawling process

is optimized by exploiting features unique to entity-oriented sites.

In this paper, we will focus on describing important components of

our system, including query generation, empty page ﬁltering and

URL deduplication.

Our ﬁrst contribution is to show how query logs and knowledge

bases (e.g., Freebase) can be leveraged to generate entity queries

for crawling. We demonstrate that classical techniques for infor-

URL

Template

Generation

Query

Generation

URL

generation

URL

Repository

URL

scheduler

Web

document

crawler

URL

extraction/

deduplication

Web

document

filter

Freebase Querylog

Listof

deep‐web

sites

Crawled

web

document

Figure 1: Overview of the entity-oriented crawl system

mation retrieval and entity extraction can be used to robustly derive

relevant entities for each site, so that crawling bandwidth can be

utilized efﬁciently and effectively (Section 5).

The second contribution of this work is a new empty page ﬁlter-

ing algorithm that removes crawled pages that fail to retrieve any

entities. This seemingly simple problem is nontrivial due to the di-

verse nature of pages from different sites. We propose an intuitive

ﬁltering approach, based on the observation that empty pages from

the same site tend to be highly similar (e.g., with the same page lay-

out and the same error message). In particular, we ﬁrst submit to

each target site a small set of queries that are intentionally “bad”,

to retrieve a “reference set” of pages that are highly likely to be

empty. At crawl time, each newly crawled page is compared with

the reference set, and pages that are highly similar to the reference

set are predicted as empty and ﬁltered out from further processing.

This weakly-supervised approach is shown to be robust across sites

on the Web (Section 6).

Additionally, we observe that the search result pages typically

expose additional deep-web content that deserves a second round of

crawling, which is an interesting topic that has been overlooked in

the literature. In order to obtain such content, we identify promis-

ing URLs on the result pages, from which further crawling can be

bootstrapped. Furthermore, we propose a URL deduplication al-

gorithm that prevents URLs with near-identical results from be-

ing crawled. Speciﬁcally, whereas existing techniques use con-

tent analysis for deduplication which works only after pages are

crawled, our approach identiﬁes the semantic relevance of URL

query segments by analyzing URL patterns, so that URLs with sim-

ilar content that differ in non-essential ways (e.g., how retrieved en-

tities are rendered and sorted) can be deduplicated. This approach

is shown to be effective in preserving distinct content while reduc-

ing the bandwidth consumption (Section 7).

2. SYSTEM OVERVIEW

Deep-web sites URL templates

ebay.com www.ebay.com/sch/i.html?_nkw={query}&_sacat=All-Categories

chegg.com www.chegg.com/search/?search_by={query}

beso.com www.beso.com/classify?search_box=1&keyword={query}

... ...

Table 1: Example URL templates

In this section we explain each component of our system in turn

at a very high level. The overall architecture of our system is illus-

trated in Figure 1.

URL template generation. At the top left corner the system

takes a list of domain names of deep-web sites as input, and an ex-

ample of which is illustrated in the ﬁrst column of Table 1. The

URL template generation component then crawls the home-pages

of these sites, extracts and parses the web forms found on the home-

pages, and produces URL templates. Example URL templates are

illustrated in the second column of Table 1. Here the boldfaced

“{query}” represents a wild-card that can be substituted by any

keyword query (e.g., “iphone”); the resulting URL can be used to

crawl deep-web content as if the web forms are submitted.

Query generation and URL generation. The query generation

component at the lower left corner takes the Freebase [6] and query

logs as input, outputs queries consistent with the semantics of each

deep-web site (for example, query “iphone” may be generated for

sites like amazon.com or ebay.com but not for tripadvisor.com).

Such queries can then be plugged into URL templates to sub-

stitute the “{query}” wild-card to produce ﬁnal URLs, which will

be stored in a central URL repository. URLs can then be retrieved

from the URL repository and scheduled for crawling at runtime.

Empty page ﬁlter. It is inevitable that some URLs correspond-

ing to previously generated queries will retrieve empty or error

pages that contain no entity. Once pages are crawled, we move to

the next stage, where pages are inspected to ﬁlter out empty ones.

The process of ﬁltering empty pages is critical (to avoid polluting

downstream operations), but also non-trivial, for different sites in-

dicate empty pages in disparate ways. The key insight here is that

empty pages from the same site tend to be highly similar. So we in-

tentionally retrieve a set of pages that are highly likely to be empty,

and ﬁlter out any crawled pages from the same site that are similar

to the reference set. Remaining pages with rich entity information

can then be used for a variety of purposes.

URL extraction/deduplication. Additionally, we observe that

a signiﬁcant fraction of URLs on search result pages (henceforth

referred to as “second-level URLs”, to distinguish from the URLs

generated using URL template, which are “ﬁrst-level URLs”) typi-

cally link to additional deep-web content. However, crawling all

second-level URLs indiscriminately is wasteful due to the large

number of second level URLs available. Accordingly, in this com-

ponent, we ﬁlter out second-level URLs that are less likely to lead

to deep-web content, and dynamically deduplicate remaining URLs

to obtain a much smaller set of “representative” URLs that can

be crawled efﬁciently. These URLs then iterate through the same

crawling process to obtain additional deep-web content.

3. RELATED WORK

The aforementioned problems studied in this work have been ex-

plored in the literature to various extents. In this section, we will

describe related work and discuss key differences between our ap-

proach in this work and existing techniques.

URL template generation. The problem of generating URL

templates has been studied in the literature in different contexts.

For example, authors in [4, 5] looked at the problem of identify-

ing searchable forms that are deep-web entry points, from which

templates can then be generated. The problem of parsing HTML

forms for URL templates has been addressed in [15]. In addition,

authors in [15, 20] studied the problem of assigning combinations

of values to multiple input ﬁelds in the search form so that content

can be retrieved from the deep-web effectively.

In our URL template generation component, search forms are

parsed using techniques similar to what was outlined in [15]. How-

ever, our analysis shows that generating URL templates by enu-

merating values combination in multiple input ﬁelds can lead to an

inefﬁciently large number of templates and may not scale to the

number of websites that we are interested in crawling. As will be

discussed in Section 4, our main insight is to leverage the fact that

for entity-oriented sites, search forms predominantly employ one

text ﬁeld for keyword queries, and additional input ﬁelds with good

“default value” behavior. Our URL template generation based on

this observation provides a tractable solution for a large number of

potentially complex search forms without signiﬁcantly sacriﬁcing

content coverage.

Query generation and URL generation. Prior art in query gen-

eration for deep web crawl focused on bootstrapping using text ex-

tracted from retrieved pages [15, 17, 22, 23]. That is, a set of seed

queries are ﬁrst used to crawl pages. The retrieved pages are an-

alyzed for promising keywords, which are then used as queries to

crawl more pages recursively.

There are several key reasons why existing approaches are not

very well suited for our purpose. First of all, most previous work

[17, 22, 23] aims to optimize coverage of individual sites, that is,

to retrieve as much deep-web content as possible from one or a few

sites, where success is measured by percentage of content retrieved.

Authors in [3] go as far as suggesting to crawl using common stop

words “a, the” etc. to improve site coverage when these words are

indexed. We are in line with [15] in aiming to improve content cov-

erage for a large number of sites on the Web. Because of the sheer

number of deep-web sites crawled we trade off complete coverage

of individual site for incomplete but “representative” coverage of a

large number of sites.

The second important difference is that since we are crawling

entity-oriented pages, the queries we come up with should be en-

tity names instead of arbitrary phrases segments. As such, we lever-

age two important data sources, namely query logs and knowledge

bases. We will show that classical information retrieval and entity

extraction techniques can be used effectively for entity query gen-

eration. To our knowledge neither of these data sources has been

very well studied for deep-web crawl purposes.

Empty page ﬁltering. Authors in [15] developed an interest-

ing notion of informativeness to ﬁlter search forms, which is com-

puted by clustering signatures that summarize content of crawled

pages. If crawled pages only have a few signature clusters, then

the search form is uninformative and will be pruned accordingly.

This approach addresses the problem of empty pages to an extent

by ﬁltering uninformative forms. However, this approach operates

at the level of search form / URL template, it may still miss empty

pages crawled using an informative URL template.

Since our system generates only one high-quality URL template

for each site, ﬁltering at the granularity of URL templates is likely

to be ill-suited. Instead, our approach considers in this work ﬁl-

ters at page level — it can automatically distinguishes empty pages

from useful entity pages by utilizing intentionally generated bad

queries. To our knowledge this simple yet effective approach has

not been explored in the literature.

A novel page-level empty page ﬁltering technique was described

in [20], which labels a result page as empty, if either certain prede-

ﬁned error messages are detected from the “signiﬁcant portion” of

result pages (e.g., the portion of the page formatted using frames,

or visually laid out at the center of the page), or a large fraction

of result pages are hashed to the same value. In comparison, our

approach obviates the need to recognize the signiﬁcant portion of

result pages, and we use content signature instead of hashing that

is more robust against minor page differences.

URL deduplication. The problem of URL deduplication has re-

ceived considerable attention in the context of web crawling and in-

dexing [2, 8, 13]. Current techniques consider two URLs as dupli-

cates if their content are highly similar. These approaches, referred

to as content-based URL deduplication, proposes to ﬁrst summa-

rize page contents using content sketches [7] so that pages with

Figure 2: A typical search interface (ebay.com)

similar content are grouped into clusters. URLs in the same cluster

are then analyzed to learn URL transformation rules (for example,

it can learn that www.cnn.com/story?id=num is equivalent

to www.cnn.com/story_num).

In this paper, instead of looking at the traditional notion of page

similarity at the content level, we view page similarity at the se-

mantic level. That is, we view pages with entities from the same

result set (but perhaps containing different portions of the result,

or presenting with different sorting orders) as semantically simi-

lar, which can then be deduplicated. This signiﬁcantly reduces the

number of crawls needed, and is in line with our goal of obtaining

representative content coverage given the sheer number of websites

crawled.

Using semantic similarity, our approach can analyze URL pat-

terns and deduplicate before pages are crawled. In comparison,

existing content-based deduplication not only requires pages to be

crawled ﬁrst for content analysis, it would not be able to recognize

the semantic similarity between URLs and would require billions

of more URLs crawled.

Authors in [15] pioneered the notion of presentation criteria, and

pointed out that crawling pages with content that differ only in pre-

sentation criteria are undesirable. Their approach, however, dedu-

plicates at the level of search forms and cannot be used to dedupli-

cate URLs directly.

4. URL TEMPLATE GENERATION

As input to our system, we are given a list of entity-oriented

deep-web sites that need to be crawled. Our ﬁrst problem is to gen-

erate URL templates for each site that are equivalent to submitting

search forms, so that entities can be crawled directly using URL

templates.

As a concrete example, the search form from ebay.com is shown

in Figure 2, which represents a typical entity-oriented deep-web

search interface. Searching this form using query “ipad 2” without

changing the default value “All Categories” of the drop-down box

is equivalent to using the URL template for ebay.com in Table 1,

with wild-card “{query}” replaced by “ipad+2”.

The exact technique that parses search forms is developed based

on techniques proposed in [15], which we will not discuss in details

in the interest of space. However, our experience with URL tem-

plate generation leads to two interesting observations worth men-

tioning.

Our ﬁrst observation is that for entity-oriented sites, the main

search form is almost always on home pages instead of somewhere

deep in the site. The search form is such an effective informa-

tion retrieval paradigm that websites are only too eager to expose

them. A manual survey suggests that only 1 out of 100 randomly

sampled sites does not have the search form on the home page

(www.arke.nl). This obviates the need to use sophisticated tech-

niques to locate search forms deep in websites (e.g., [4, 5]).

The second observation is that in entity-oriented sites, search

forms predominantly use one main text input ﬁelds to accept key-

word queries (a full 93% of sites surveyed have exactly one text

ﬁeld to accept keyword queries). At the same time, other non-text

input ﬁelds exhibit good “default value” behavior (94% of sites out

of the 100 sampled are judged to be able to retrieve entities using

default values without sacriﬁcing coverage).

Since enumerating values combination in multiple input ﬁelds

Deep-web sites sample queries from query logs

ebay.com cheap iPhone 4, lenovo x61, ...

bestbuy.com hp touchpad review, price of sony vaio, ...

booking.com where to stay in new york, hyatt seattle review, ...

hotels.com hotels in london, san francisco hostels, ...

barnesandnobel.com star trek books, stephen king insomnia, ...

chegg.com harry potter book 1-7, dark knight returns, ...

Table 2: Example queries from query logs

(e.g., [15, 20]) can lead to an inefﬁciently large number of tem-

plates and may not scale to the number of websites that we are

interested in crawling, we leverage aforementioned observation to

simplify URL template generation by producing one template for

each search form. Speciﬁcally, only the text ﬁeld is allowed to vary

(represented by a placeholder “{query}”) while others ﬁelds will

take on default values, as shown in Table 1. In our experience this

provides a more tractable way to generate templates than the previ-

ous multi-value enumeration approach that works well in practice.

We will not discuss details of template generation any further in the

interest of space.

5. QUERY GENERATION

After obtaining URL templates for each site, the next step is to

ﬁll relevant keyword query into the “{query}” wild-card to produce

ﬁnal URLs. The challenge here is to come up with queries that

match the semantics of the sites. A dictionary-based brute force

approach that sends every known entity to every site is clearly in-

efﬁcient – crawling queries like "ipad" on tripadvisor.com does not

make sense, and will most likely result in an empty/error page.

We utilize two data sources for query generation: query logs and

knowledge-bases. Our main observation here is that classical tech-

niques in information retrieval and entity extraction are already ef-

fective in generating entity queries.

5.1 Entity extraction from query logs

Query logs refer to keyword queries searched and URLs clicked

on search engines (e.g., Google). Conceptually query logs make a

good candidate for query generation in deep-web crawls — queries

with high number of clicks to a certain site is an indication of the

relevance between the query and the site, submitting such queries

through the site’s search interface for deep-web crawl thus makes

intuitive sense.

We used Google’s query logs with the following normalized form

< keyword_query, url_clicked, num_times_clicked >. To

ﬁlter out undesirable queries (e.g., navigational queries), we only

consider queries that are clicked for at least 2 pages in the same

site, for at least 3 times.

Although query logs contain rich information, it is also too noisy

to be used directly for crawling. Speciﬁcally, queries in the query

logs tend to contain extraneous tokens in addition to the central

entity of interest. However, it is not uncommon for the search in-

terface on deep-web sites to expect only entity names as queries.

Figure 3 serves as an illustration of this problem. When feeding a

search engine query “HP touchpad reviews” into the search inter-

face on deep-web sites, (in this example, ebay.com), no results are

returned (Figure 3a), while searching using only the entity name

“HP touchpad” retrieves 6617 such products (Figure 3b).

This issue above is not isolated. On the one hand, search en-

gine queries typically contain tokens in addition to entity mentions,

which either specify certain aspects of entities of interest (e.g., “HP

touchpad review”, “price of chrome book spec”), or are simply nat-

ural language fragments (e.g., “where to buy iPad 2”, “where to

stay in new york”). On the other hand, many search interfaces only

expect clean entity queries. This is because a signiﬁcant portion of

(a) search with “hp touchpad reviews”

(b) search with “hp touchpad”

Figure 3: An example of Keyword-And based search interface

entity sites employ the simple Keyword-And mechanism, where all

tokens in the query have to be matched in a tuple before the tuple

can be returned (thus the no match problem in Figure 3b). Even if

the other conceptual alternative, Keyword-Or is used, the presence

of extraneous tokens can promote spurious matches and lead to less

desirable crawls.

We reduce the aforementioned problem to entity extraction from

query logs. Or to view it the other way, we clean the search engine

queries by removing tokens that are not entity related (e.g., remov-

ing “reviews” from “HP touchpad reviews”, or “where to stay in”

from “where to stay in new york”, etc.).

In the absence of a comprehensive entity dictionary, it is hard

to tell if a token belongs to (ever-growing) entity names and their

name variations, abbreviations or even typos. At the same time, the

diverse nature of the query logs makes it all the more valuable, for

it captures a wide variety of entities and their name variations.

Inspired by an inﬂuential work on entity extraction from query

logs [18], we ﬁrst identify common patterns in query logs that are

clearly not entity related (e.g., “reviews”, “specs”, “where to stay

in” etc.) by leveraging known entities. Query logs can then be

“cleaned” to extract entities by removing such patterns.

Speciﬁcally, we ﬁrst obtained a dump of the Freebase data [6]

— a manually curated repository with about 22M entities. We then

ﬁnd the maximum-length subsequence in each search engine query

that matches Freebase entities as an entity mention. The remaining

tokens are treated as entity-irrelevant preﬁx/sufﬁx. We aggregate

distinct preﬁx/sufﬁx across the query logs to obtain common pat-

terns ordered by their frequency of occurrences. The most frequent

patterns are likely to be irrelevant to entities and need to be cleaned.

EXAMPLE 1. Table 2 illustrates the sample queries with men-

tions of Freebase entity names underlined. Observe that this entity

recognition this way is not perfect. For example, the query “where

to stay in new york” for booking.com has two matches with Free-

base entities, the match of “where to” to a musical release with

that name, and the match of “new york” as city name. Since both

matches are of length two, we obtain the false sufﬁx “stay in new

york” (with an empty preﬁx) and the correct preﬁx “where to stay

in” (with an empty sufﬁx), respectively. However, when all the pre-

ﬁx/sufﬁx in the query logs are aggregated, the correct preﬁx “where

Deep-web sites sample queries from query logs

ebay.com iPhone 4, lenovo, ...

bestbuy.com hp touchpad, sony vaio, ...

booking.com where to, new york, hyatt, seattle, review, ...

hotels.com hotels, london, san francisco, ...

barnesandnobel.com star trek, stephen king, ...

chegg.com harry potter, dark knight, ...

Table 3: Example entities extracted for each deep-web site

to stay in” occurs much more frequently and should clearly stand

out as a entity irrelevant pattern.

Another potential problem is that Freebase may not contain all

possible entities. For example in the query “hyatt seattle review”

for booking.com, the ﬁrst two tokens “hyatt seattle” refer to the

Hyatt hotel in Seattle, which however is absent in Freebase. Using

Freebase entities “hyatt” (a hotel company), and “seattle” (a loca-

tion) will be recognized separately. However, with preﬁxes/sufﬁxes

aggregation, the sufﬁx “review” is so frequent across the query logs

such that it will be recognized as an entity-irrelevant pattern. This

can be used to clean the query to produce entity “hyatt seattle”.

Our experiments using Google’s query log (to be discussed in

Section 8) will show that this simple approach of entity extraction

by pattern aggregation is effective in producing entity queries.

5.2 Entity expansion using knowledge-bases

While query logs provide a good set of initial seed entities, its

coverage for each site depends on the site’s popularity as well as

the item’s popularity (recall that the number of clicks is used to

predict the relevance between the query and the site). Even for

highly popular sites, there is a long tail of less popular items which

may not be captured by query logs.

On the other hand, we observe that there exists manually curated

entity repositories (e.g., Freebase), that maintain entities in certain

domains with very high coverage. For example. Freebase contains

comprehensive lists of city names, books, car models, movies, etc.

Such categories, if matched appropriately with relevant deep-web

sites, can be used to greatly improve crawl coverage. For exam-

ple, names of all locations/cities can be used to crawl travel sites

(e.g., tripadvisor.com, booking.com), housing sites (e.g., apartmen-

thomeliving.com, zillow.com); names of all known books can be

useful on book retailers (amazon.com, barnesandnoble.com), book

rental sites (chegg.com, bookrenter.com), so on and so forth. In

this section, we consider the problem of expanding the initial set of

entities using Freebase.

Recall that we can already extract Freebase entities from the

query logs for each site. Table 3, for example, contains lists of

entities extracted from the sample queries in Table 2. Thus, for

each site, we need to bootstrap from these seed entities to expand

to Freebase entity “types” that are relevant to each site’s semantics.

We borrow classical techniques from information retrieval: if

we view the multi-set of Freebase entity mentions for each site as a

document, and the list of entities in each Freebase type as a query,

then the classical term-frequency, inverse document frequency (TF-

IDF) ranking can be applied.

For each Freebase type, we use TF-IDF to produce a ranked list

of deep-web sites by their similarity scores. We then “threshold”

the sorted list using a relative score. That is, we include all sites

with scores above a ﬁxed percentage, τ , of the highest similarity

score in the same Freebase type as matches. Empirically results in

Section 8 show that setting τ = 0.5 achieves good results and is

used in our system. This approach is signiﬁcantly more effective

than other alternatives like Cosine or Jaccard Similarity [21], with

precision reaching 0.9 for τ = 0.5.

6. EMPTY PAGE FILTERING

Once the ﬁnal URLs are generated, pages can be crawled in a

fairly standard manner. The next important issue that arises is to

ﬁlter empty pages with no entity in them, in order to avoid pollut-

ing downstream pipelines. However, different sites can display dis-

parate error messages, from textual messages (e.g., “sorry, no items

is found”, “0 item matches your search”, etc.), to image-based er-

ror messages. While such messages are easily comprehensible for

humans, it is difﬁcult to detect automatically across all different

sites. The presence of dynamically generated ads content further

complicates the problem of detecting empty pages.

We develop a page-level ﬁltering approach that ﬁlters out crawled

pages that fail to retrieve any entities. Our main observation is that

empty pages from the same site are typically extremely similar to

each other, while empty pages from different sites are normally

very different. Ideally we should obtain “sample” empty pages for

each deep-web site, with which newly crawled pages can be com-

pared. To do so, we generate a set of “background queries”, that are

long strings of arbitrary characters that lack any semantic meanings

(e.g., “zzzzzzzzzzzzz”, or “xyzxyzxyzxyz”). Such queries, when

searched on deep-web sites, will almost certainly generate empty

pages. In practice, we generate N (10 in our experiments) such

background queries in order to be robust against the rare case where

a bad “background query” accidentally matches some records and

produces a non-empty page. We then crawl and store the corre-

sponding “background pages” as the reference set of empty pages.

At crawl time, each newly crawled page is compared with back-

ground pages to determine if the new page is actually empty.

Our content comparison mechanism uses a signature based page

summarization techniques also used in [15]. The signature is es-

sentially a set of tokens that are descriptive of the page content, but

also robust against minor differences in page content (e.g., dynam-

ically generated advertisements).

We then calculate the Jaccard

Similarity between the signature of the newly crawled page and the

“background pages”, as deﬁned below.

DEFINITION 1. [21] Let S

and S

be the sets of tokens rep-

resenting the signature of the crawled page p

and p

. The Jaccard

Similarity between S

and S

, denoted Sim

Jac

, S

), is

deﬁned as Sim

Jac

, S

∩S

∪S

The similarity scores are averaged over the set of N “background

pages”, and if the average score is above certain threshold θ, we

label the newly crawled page as empty. As we will show in ex-

periments, this approach is very effective in detecting empty pages

across different websites (with an overall precision of 0.89 and a

recall of 0.9).

7. SECOND-LEVEL CRAWL

7.1 The motivation for second level crawl

We observe that the ﬁrst set of pages crawled using URL tem-

plates often contain URLs that link to additional deep-web con-

tents. In this work, we refer to the ﬁrst set of pages obtained

through URL templates as “ﬁrst-level pages” (because they are one

click away from the homepage), and those pages that are linked

from ﬁrst-level pages as “second-level pages” (and the correspond-

ing URLs “second-level URLs”). There are at least a few common

cases in which crawling second-level pages can be useful.

Our signatures are generated using a proprietary method also used in [15],

the details of which is beyond the scope of this paper. In principle well-

known content summarization techniques like [7, 16] can be used in place.

Crawling deep web entity pages

Figures

Citations

SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Deep Web crawling: a survey

A quantitative approach to evaluate Website Archivability using the CLEAR+ method

Enabling maps/location searches on mobile devices: constructing a POI database via focused crawling and information extraction

Data Capture and Analysis of Darknet Markets

References

The anatomy of a large-scale hypertextual Web search engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Term Weighting Approaches in Automatic Text Retrieval

Introduction to Data Mining

Freebase: a collaboratively created graph database for structuring human knowledge

Related Papers (5)

Google's Deep Web crawl

White Paper: The Deep Web: Surfacing Hidden Value

Crawling the Hidden Web

Downloading textual hidden web content through keyword queries

Toward large scale integration: Building a MetaQuerier over databases on the Web

Frequently Asked Questions (8)

Q1. What contributions have the authors mentioned in the paper "Crawling deep web entity pages" ?

Q2. What are the future works in "Crawling deep web entity pages" ?

Q3. What is the purpose of the URL template generation component?

Q4. How can the authors use this approach to extract relevant entities?

Q5. What is the problem with the prefix “whereto stay in”?

Q6. What is the definition of deep-web crawl?

Q7. What is the definition of deep web crawl?

Q8. What is the reason why the authors filter second-level URLs?