scispace - formally typeset
Open AccessProceedings ArticleDOI

Crawling deep web entity pages

Reads0
Chats0
TLDR
This work describes a prototype system built that specializes in crawling entity-oriented deep-web sites and proposes techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep- web sites.
Abstract
Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective.

read more

Content maybe subject to copyright    Report

Crawling Deep Web Entity Pages
Yeye He
Univ. of Wisconsin-Madison
Madison, WI 53706
heyeye@cs.wisc.edu
Dong Xin
Google Inc.
Mountain View, CA, 94043
dongxin@google.com
Venkatesh Ganti
Google Inc.
Mountain View, CA, 94043
vganti@google.com
Sriram Rajaraman
Google Inc.
Mountain View, CA, 94043
sriramr@google.com
Nirav Shah
Google Inc.
Mountain View, CA, 94043
nshah@google.com
ABSTRACT
Deep-web crawl is concerned with the problem of surfacing hid-
den content behind search interfaces on the Web. While many
deep-web sites maintain document-oriented textual content (e.g.,
Wikipedia, PubMed, Twitter, etc.), which has traditionally been the
focus of the deep-web literature, we observe that a significant por-
tion of deep-web sites, including almost all online shopping sites,
curate structured entities as opposed to text documents. Although
crawling such entity-oriented content is clearly useful for a variety
of purposes, existing crawling techniques optimized for document
oriented content are not best suited for entity-oriented sites. In this
work, we describe a prototype system we have built that specializes
in crawling entity-oriented deep-web sites. We propose techniques
tailored to tackle important subproblems including query genera-
tion, empty page filtering and URL deduplication in the specific
context of entity oriented deep-web sites. These techniques are ex-
perimentally evaluated and shown to be effective.
Categories and Subject Descriptors: H.2.8 Database Applica-
tion: Data Mining
Keywords: Deep-web crawl, web data, entities.
1. INTRODUCTION
Deep-web crawl refers to the problem of surfacing rich infor-
mation behind the web search interface of diverse sites across the
Web. It was estimated by various accounts that the deep-web has
as much as an order of magnitude more content than that of the
surface web [10, 14]. While crawling the deep-web can be im-
mensely useful for a variety of tasks including web indexing [15]
and data integration [14], crawling the deep-web content is known
to be hard. The difficulty in surfacing the deep-web has inspired a
long and fruitful line of research [3, 4, 5, 10, 14, 15, 17, 22, 23].
In this paper we focus on entity-oriented deep-web sites. These
sites curate structured entities and expose them through search in-
terfaces. Examples include almost all online shopping sites (e.g.,
Work done while author at Google, now at Microsoft Research.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
WSDM’13, February 4–8, 2013, Rome, Italy.
Copyright 2013 ACM 978-1-4503-1869-3/13/02 ...$10.00.
ebay.com, amazon.com, etc.), where each entity is typically a prod-
uct that is associated with rich structured information like item
name, brand name, price, and so forth. Additional examples of
entity-oriented deep-web sites include movie sites, job listings, etc.
Note that this is to contrast with traditional document-oriented deep-
web sites that mostly maintain unstructured text documents (e.g.,
Wikipedia, PubMed, etc.).
Entity-oriented sites are very common and represent a signifi-
cant portion of the deep-web sites. The variety of tasks that entity-
oriented content enables makes the general problem of crawling
entities an important problem.
The practical use of our system is to crawl product entities from
a large number of online retailers for advertisement landing page
purposes. While the exact use of such entities content in adver-
tisement is beyond the scope of this paper, the system requirement
is simple to state: We are provided as input a list of retailers’ web-
sites, and the objective is to crawl high-quality product entity pages
efficiently and effectively.
There are two key properties that set our problem apart from tra-
ditional deep-web crawling literature. First, we specifically focus
on the entity-oriented model, because of our interest in product en-
tities from online retailers, which are entity-oriented deep-web sites
in most cases. While existing general crawling techniques are still
applicable to some extent, the specific focus on entity-oriented sites
brings unique opportunities. Second, a large number of entity sites
(online retailers) are provided as input to our system, from which
entity pages are to be crawled. Note that with thousands of sites as
input, the realistic objective is to only obtain a representative con-
tent coverage of each site, instead of an exhaustive one. Ebay.com,
for example, has hundreds of thousands of listings returned for the
query “iphone”; the purpose of the system is not to obtain all iphone
listings, but only a representative few of these listings for ads land-
ing pages. This goal of obtaining representative coverage contrasts
with traditional deep-web crawl literature, which tends to deal with
individual site and focus on obtaining exhaustive content coverage.
Our objective is more in line with the pioneering work [15], which
also operates at the Web scale but focuses on general web content.
We have developed a prototype system that is designed specifi-
cally to crawl representative entity content. The crawling process
is optimized by exploiting features unique to entity-oriented sites.
In this paper, we will focus on describing important components of
our system, including query generation, empty page filtering and
URL deduplication.
Our first contribution is to show how query logs and knowledge
bases (e.g., Freebase) can be leveraged to generate entity queries
for crawling. We demonstrate that classical techniques for infor-

URL
Template
Generation
Query
Generation
URL
generation
URL
Repository
URL
scheduler
Web
document
crawler
URL
extraction/
deduplication
Web
document
filter
Freebase Querylog
Listof
deepweb
sites
Crawled
web
document
Figure 1: Overview of the entity-oriented crawl system
mation retrieval and entity extraction can be used to robustly derive
relevant entities for each site, so that crawling bandwidth can be
utilized efficiently and effectively (Section 5).
The second contribution of this work is a new empty page filter-
ing algorithm that removes crawled pages that fail to retrieve any
entities. This seemingly simple problem is nontrivial due to the di-
verse nature of pages from different sites. We propose an intuitive
filtering approach, based on the observation that empty pages from
the same site tend to be highly similar (e.g., with the same page lay-
out and the same error message). In particular, we first submit to
each target site a small set of queries that are intentionally “bad”,
to retrieve a “reference set” of pages that are highly likely to be
empty. At crawl time, each newly crawled page is compared with
the reference set, and pages that are highly similar to the reference
set are predicted as empty and filtered out from further processing.
This weakly-supervised approach is shown to be robust across sites
on the Web (Section 6).
Additionally, we observe that the search result pages typically
expose additional deep-web content that deserves a second round of
crawling, which is an interesting topic that has been overlooked in
the literature. In order to obtain such content, we identify promis-
ing URLs on the result pages, from which further crawling can be
bootstrapped. Furthermore, we propose a URL deduplication al-
gorithm that prevents URLs with near-identical results from be-
ing crawled. Specifically, whereas existing techniques use con-
tent analysis for deduplication which works only after pages are
crawled, our approach identifies the semantic relevance of URL
query segments by analyzing URL patterns, so that URLs with sim-
ilar content that differ in non-essential ways (e.g., how retrieved en-
tities are rendered and sorted) can be deduplicated. This approach
is shown to be effective in preserving distinct content while reduc-
ing the bandwidth consumption (Section 7).
2. SYSTEM OVERVIEW
Deep-web sites URL templates
ebay.com www.ebay.com/sch/i.html?_nkw={query}&_sacat=All-Categories
chegg.com www.chegg.com/search/?search_by={query}
beso.com www.beso.com/classify?search_box=1&keyword={query}
... ...
Table 1: Example URL templates
In this section we explain each component of our system in turn
at a very high level. The overall architecture of our system is illus-
trated in Figure 1.
URL template generation. At the top left corner the system
takes a list of domain names of deep-web sites as input, and an ex-
ample of which is illustrated in the first column of Table 1. The
URL template generation component then crawls the home-pages
of these sites, extracts and parses the web forms found on the home-
pages, and produces URL templates. Example URL templates are
illustrated in the second column of Table 1. Here the boldfaced
“{query}” represents a wild-card that can be substituted by any
keyword query (e.g., “iphone”); the resulting URL can be used to
crawl deep-web content as if the web forms are submitted.
Query generation and URL generation. The query generation
component at the lower left corner takes the Freebase [6] and query
logs as input, outputs queries consistent with the semantics of each
deep-web site (for example, query “iphone” may be generated for
sites like amazon.com or ebay.com but not for tripadvisor.com).
Such queries can then be plugged into URL templates to sub-
stitute the “{query}” wild-card to produce final URLs, which will
be stored in a central URL repository. URLs can then be retrieved
from the URL repository and scheduled for crawling at runtime.
Empty page filter. It is inevitable that some URLs correspond-
ing to previously generated queries will retrieve empty or error
pages that contain no entity. Once pages are crawled, we move to
the next stage, where pages are inspected to filter out empty ones.
The process of filtering empty pages is critical (to avoid polluting
downstream operations), but also non-trivial, for different sites in-
dicate empty pages in disparate ways. The key insight here is that
empty pages from the same site tend to be highly similar. So we in-
tentionally retrieve a set of pages that are highly likely to be empty,
and filter out any crawled pages from the same site that are similar
to the reference set. Remaining pages with rich entity information
can then be used for a variety of purposes.
URL extraction/deduplication. Additionally, we observe that
a significant fraction of URLs on search result pages (henceforth
referred to as “second-level URLs”, to distinguish from the URLs
generated using URL template, which are “first-level URLs”) typi-
cally link to additional deep-web content. However, crawling all
second-level URLs indiscriminately is wasteful due to the large
number of second level URLs available. Accordingly, in this com-
ponent, we filter out second-level URLs that are less likely to lead
to deep-web content, and dynamically deduplicate remaining URLs
to obtain a much smaller set of “representative” URLs that can
be crawled efficiently. These URLs then iterate through the same
crawling process to obtain additional deep-web content.
3. RELATED WORK
The aforementioned problems studied in this work have been ex-
plored in the literature to various extents. In this section, we will
describe related work and discuss key differences between our ap-
proach in this work and existing techniques.
URL template generation. The problem of generating URL
templates has been studied in the literature in different contexts.
For example, authors in [4, 5] looked at the problem of identify-
ing searchable forms that are deep-web entry points, from which
templates can then be generated. The problem of parsing HTML
forms for URL templates has been addressed in [15]. In addition,
authors in [15, 20] studied the problem of assigning combinations
of values to multiple input fields in the search form so that content
can be retrieved from the deep-web effectively.
In our URL template generation component, search forms are
parsed using techniques similar to what was outlined in [15]. How-
ever, our analysis shows that generating URL templates by enu-
merating values combination in multiple input fields can lead to an
inefficiently large number of templates and may not scale to the
number of websites that we are interested in crawling. As will be
discussed in Section 4, our main insight is to leverage the fact that

for entity-oriented sites, search forms predominantly employ one
text field for keyword queries, and additional input fields with good
“default value” behavior. Our URL template generation based on
this observation provides a tractable solution for a large number of
potentially complex search forms without significantly sacrificing
content coverage.
Query generation and URL generation. Prior art in query gen-
eration for deep web crawl focused on bootstrapping using text ex-
tracted from retrieved pages [15, 17, 22, 23]. That is, a set of seed
queries are first used to crawl pages. The retrieved pages are an-
alyzed for promising keywords, which are then used as queries to
crawl more pages recursively.
There are several key reasons why existing approaches are not
very well suited for our purpose. First of all, most previous work
[17, 22, 23] aims to optimize coverage of individual sites, that is,
to retrieve as much deep-web content as possible from one or a few
sites, where success is measured by percentage of content retrieved.
Authors in [3] go as far as suggesting to crawl using common stop
words “a, the” etc. to improve site coverage when these words are
indexed. We are in line with [15] in aiming to improve content cov-
erage for a large number of sites on the Web. Because of the sheer
number of deep-web sites crawled we trade off complete coverage
of individual site for incomplete but “representative” coverage of a
large number of sites.
The second important difference is that since we are crawling
entity-oriented pages, the queries we come up with should be en-
tity names instead of arbitrary phrases segments. As such, we lever-
age two important data sources, namely query logs and knowledge
bases. We will show that classical information retrieval and entity
extraction techniques can be used effectively for entity query gen-
eration. To our knowledge neither of these data sources has been
very well studied for deep-web crawl purposes.
Empty page filtering. Authors in [15] developed an interest-
ing notion of informativeness to filter search forms, which is com-
puted by clustering signatures that summarize content of crawled
pages. If crawled pages only have a few signature clusters, then
the search form is uninformative and will be pruned accordingly.
This approach addresses the problem of empty pages to an extent
by filtering uninformative forms. However, this approach operates
at the level of search form / URL template, it may still miss empty
pages crawled using an informative URL template.
Since our system generates only one high-quality URL template
for each site, filtering at the granularity of URL templates is likely
to be ill-suited. Instead, our approach considers in this work fil-
ters at page level it can automatically distinguishes empty pages
from useful entity pages by utilizing intentionally generated bad
queries. To our knowledge this simple yet effective approach has
not been explored in the literature.
A novel page-level empty page filtering technique was described
in [20], which labels a result page as empty, if either certain prede-
fined error messages are detected from the “significant portion” of
result pages (e.g., the portion of the page formatted using frames,
or visually laid out at the center of the page), or a large fraction
of result pages are hashed to the same value. In comparison, our
approach obviates the need to recognize the significant portion of
result pages, and we use content signature instead of hashing that
is more robust against minor page differences.
URL deduplication. The problem of URL deduplication has re-
ceived considerable attention in the context of web crawling and in-
dexing [2, 8, 13]. Current techniques consider two URLs as dupli-
cates if their content are highly similar. These approaches, referred
to as content-based URL deduplication, proposes to first summa-
rize page contents using content sketches [7] so that pages with
Figure 2: A typical search interface (ebay.com)
similar content are grouped into clusters. URLs in the same cluster
are then analyzed to learn URL transformation rules (for example,
it can learn that www.cnn.com/story?id=num is equivalent
to www.cnn.com/story_num).
In this paper, instead of looking at the traditional notion of page
similarity at the content level, we view page similarity at the se-
mantic level. That is, we view pages with entities from the same
result set (but perhaps containing different portions of the result,
or presenting with different sorting orders) as semantically simi-
lar, which can then be deduplicated. This significantly reduces the
number of crawls needed, and is in line with our goal of obtaining
representative content coverage given the sheer number of websites
crawled.
Using semantic similarity, our approach can analyze URL pat-
terns and deduplicate before pages are crawled. In comparison,
existing content-based deduplication not only requires pages to be
crawled first for content analysis, it would not be able to recognize
the semantic similarity between URLs and would require billions
of more URLs crawled.
Authors in [15] pioneered the notion of presentation criteria, and
pointed out that crawling pages with content that differ only in pre-
sentation criteria are undesirable. Their approach, however, dedu-
plicates at the level of search forms and cannot be used to dedupli-
cate URLs directly.
4. URL TEMPLATE GENERATION
As input to our system, we are given a list of entity-oriented
deep-web sites that need to be crawled. Our first problem is to gen-
erate URL templates for each site that are equivalent to submitting
search forms, so that entities can be crawled directly using URL
templates.
As a concrete example, the search form from ebay.com is shown
in Figure 2, which represents a typical entity-oriented deep-web
search interface. Searching this form using query “ipad 2” without
changing the default value All Categories” of the drop-down box
is equivalent to using the URL template for ebay.com in Table 1,
with wild-card “{query}” replaced by “ipad+2”.
The exact technique that parses search forms is developed based
on techniques proposed in [15], which we will not discuss in details
in the interest of space. However, our experience with URL tem-
plate generation leads to two interesting observations worth men-
tioning.
Our first observation is that for entity-oriented sites, the main
search form is almost always on home pages instead of somewhere
deep in the site. The search form is such an effective informa-
tion retrieval paradigm that websites are only too eager to expose
them. A manual survey suggests that only 1 out of 100 randomly
sampled sites does not have the search form on the home page
(www.arke.nl). This obviates the need to use sophisticated tech-
niques to locate search forms deep in websites (e.g., [4, 5]).
The second observation is that in entity-oriented sites, search
forms predominantly use one main text input fields to accept key-
word queries (a full 93% of sites surveyed have exactly one text
field to accept keyword queries). At the same time, other non-text
input fields exhibit good “default value” behavior (94% of sites out
of the 100 sampled are judged to be able to retrieve entities using
default values without sacrificing coverage).
Since enumerating values combination in multiple input fields

Deep-web sites sample queries from query logs
ebay.com cheap iPhone 4, lenovo x61, ...
bestbuy.com hp touchpad review, price of sony vaio, ...
booking.com where to stay in new york, hyatt seattle review, ...
hotels.com hotels in london, san francisco hostels, ...
barnesandnobel.com star trek books, stephen king insomnia, ...
chegg.com harry potter book 1-7, dark knight returns, ...
Table 2: Example queries from query logs
(e.g., [15, 20]) can lead to an inefficiently large number of tem-
plates and may not scale to the number of websites that we are
interested in crawling, we leverage aforementioned observation to
simplify URL template generation by producing one template for
each search form. Specifically, only the text field is allowed to vary
(represented by a placeholder “{query}”) while others fields will
take on default values, as shown in Table 1. In our experience this
provides a more tractable way to generate templates than the previ-
ous multi-value enumeration approach that works well in practice.
We will not discuss details of template generation any further in the
interest of space.
5. QUERY GENERATION
After obtaining URL templates for each site, the next step is to
fill relevant keyword query into the “{query}” wild-card to produce
final URLs. The challenge here is to come up with queries that
match the semantics of the sites. A dictionary-based brute force
approach that sends every known entity to every site is clearly in-
efficient crawling queries like "ipad" on tripadvisor.com does not
make sense, and will most likely result in an empty/error page.
We utilize two data sources for query generation: query logs and
knowledge-bases. Our main observation here is that classical tech-
niques in information retrieval and entity extraction are already ef-
fective in generating entity queries.
5.1 Entity extraction from query logs
Query logs refer to keyword queries searched and URLs clicked
on search engines (e.g., Google). Conceptually query logs make a
good candidate for query generation in deep-web crawls queries
with high number of clicks to a certain site is an indication of the
relevance between the query and the site, submitting such queries
through the site’s search interface for deep-web crawl thus makes
intuitive sense.
We used Google’s query logs with the following normalized form
< keyword_query, url_clicked, num_times_clicked >. To
filter out undesirable queries (e.g., navigational queries), we only
consider queries that are clicked for at least 2 pages in the same
site, for at least 3 times.
Although query logs contain rich information, it is also too noisy
to be used directly for crawling. Specifically, queries in the query
logs tend to contain extraneous tokens in addition to the central
entity of interest. However, it is not uncommon for the search in-
terface on deep-web sites to expect only entity names as queries.
Figure 3 serves as an illustration of this problem. When feeding a
search engine query “HP touchpad reviews” into the search inter-
face on deep-web sites, (in this example, ebay.com), no results are
returned (Figure 3a), while searching using only the entity name
“HP touchpad” retrieves 6617 such products (Figure 3b).
This issue above is not isolated. On the one hand, search en-
gine queries typically contain tokens in addition to entity mentions,
which either specify certain aspects of entities of interest (e.g., “HP
touchpad review”, “price of chrome book spec”), or are simply nat-
ural language fragments (e.g., “where to buy iPad 2”, “where to
stay in new york”). On the other hand, many search interfaces only
expect clean entity queries. This is because a significant portion of
(a) search with “hp touchpad reviews”
(b) search with “hp touchpad”
Figure 3: An example of Keyword-And based search interface
entity sites employ the simple Keyword-And mechanism, where all
tokens in the query have to be matched in a tuple before the tuple
can be returned (thus the no match problem in Figure 3b). Even if
the other conceptual alternative, Keyword-Or is used, the presence
of extraneous tokens can promote spurious matches and lead to less
desirable crawls.
We reduce the aforementioned problem to entity extraction from
query logs. Or to view it the other way, we clean the search engine
queries by removing tokens that are not entity related (e.g., remov-
ing “reviews” from “HP touchpad reviews”, or “where to stay in”
from “where to stay in new york”, etc.).
In the absence of a comprehensive entity dictionary, it is hard
to tell if a token belongs to (ever-growing) entity names and their
name variations, abbreviations or even typos. At the same time, the
diverse nature of the query logs makes it all the more valuable, for
it captures a wide variety of entities and their name variations.
Inspired by an influential work on entity extraction from query
logs [18], we first identify common patterns in query logs that are
clearly not entity related (e.g., “reviews”, “specs”, “where to stay
in” etc.) by leveraging known entities. Query logs can then be
“cleaned” to extract entities by removing such patterns.
Specifically, we first obtained a dump of the Freebase data [6]
a manually curated repository with about 22M entities. We then
find the maximum-length subsequence in each search engine query
that matches Freebase entities as an entity mention. The remaining
tokens are treated as entity-irrelevant prefix/suffix. We aggregate
distinct prefix/suffix across the query logs to obtain common pat-
terns ordered by their frequency of occurrences. The most frequent
patterns are likely to be irrelevant to entities and need to be cleaned.
EXAMPLE 1. Table 2 illustrates the sample queries with men-
tions of Freebase entity names underlined. Observe that this entity
recognition this way is not perfect. For example, the query “where
to stay in new york” for booking.com has two matches with Free-
base entities, the match of “where to” to a musical release with
that name, and the match of “new york” as city name. Since both
matches are of length two, we obtain the false suffix “stay in new
york” (with an empty prefix) and the correct prefix “where to stay
in” (with an empty suffix), respectively. However, when all the pre-
fix/suffix in the query logs are aggregated, the correct prefix “where

Deep-web sites sample queries from query logs
ebay.com iPhone 4, lenovo, ...
bestbuy.com hp touchpad, sony vaio, ...
booking.com where to, new york, hyatt, seattle, review, ...
hotels.com hotels, london, san francisco, ...
barnesandnobel.com star trek, stephen king, ...
chegg.com harry potter, dark knight, ...
Table 3: Example entities extracted for each deep-web site
to stay in” occurs much more frequently and should clearly stand
out as a entity irrelevant pattern.
Another potential problem is that Freebase may not contain all
possible entities. For example in the query “hyatt seattle review”
for booking.com, the first two tokens “hyatt seattle” refer to the
Hyatt hotel in Seattle, which however is absent in Freebase. Using
Freebase entities “hyatt” (a hotel company), and “seattle” (a loca-
tion) will be recognized separately. However, with prefixes/suffixes
aggregation, the suffix “review” is so frequent across the query logs
such that it will be recognized as an entity-irrelevant pattern. This
can be used to clean the query to produce entity “hyatt seattle”.
Our experiments using Google’s query log (to be discussed in
Section 8) will show that this simple approach of entity extraction
by pattern aggregation is effective in producing entity queries.
5.2 Entity expansion using knowledge-bases
While query logs provide a good set of initial seed entities, its
coverage for each site depends on the site’s popularity as well as
the item’s popularity (recall that the number of clicks is used to
predict the relevance between the query and the site). Even for
highly popular sites, there is a long tail of less popular items which
may not be captured by query logs.
On the other hand, we observe that there exists manually curated
entity repositories (e.g., Freebase), that maintain entities in certain
domains with very high coverage. For example. Freebase contains
comprehensive lists of city names, books, car models, movies, etc.
Such categories, if matched appropriately with relevant deep-web
sites, can be used to greatly improve crawl coverage. For exam-
ple, names of all locations/cities can be used to crawl travel sites
(e.g., tripadvisor.com, booking.com), housing sites (e.g., apartmen-
thomeliving.com, zillow.com); names of all known books can be
useful on book retailers (amazon.com, barnesandnoble.com), book
rental sites (chegg.com, bookrenter.com), so on and so forth. In
this section, we consider the problem of expanding the initial set of
entities using Freebase.
Recall that we can already extract Freebase entities from the
query logs for each site. Table 3, for example, contains lists of
entities extracted from the sample queries in Table 2. Thus, for
each site, we need to bootstrap from these seed entities to expand
to Freebase entity “types” that are relevant to each site’s semantics.
We borrow classical techniques from information retrieval: if
we view the multi-set of Freebase entity mentions for each site as a
document, and the list of entities in each Freebase type as a query,
then the classical term-frequency, inverse document frequency (TF-
IDF) ranking can be applied.
For each Freebase type, we use TF-IDF to produce a ranked list
of deep-web sites by their similarity scores. We then “threshold”
the sorted list using a relative score. That is, we include all sites
with scores above a fixed percentage, τ , of the highest similarity
score in the same Freebase type as matches. Empirically results in
Section 8 show that setting τ = 0.5 achieves good results and is
used in our system. This approach is significantly more effective
than other alternatives like Cosine or Jaccard Similarity [21], with
precision reaching 0.9 for τ = 0.5.
6. EMPTY PAGE FILTERING
Once the final URLs are generated, pages can be crawled in a
fairly standard manner. The next important issue that arises is to
filter empty pages with no entity in them, in order to avoid pollut-
ing downstream pipelines. However, different sites can display dis-
parate error messages, from textual messages (e.g., “sorry, no items
is found”, “0 item matches your search”, etc.), to image-based er-
ror messages. While such messages are easily comprehensible for
humans, it is difficult to detect automatically across all different
sites. The presence of dynamically generated ads content further
complicates the problem of detecting empty pages.
We develop a page-level filtering approach that filters out crawled
pages that fail to retrieve any entities. Our main observation is that
empty pages from the same site are typically extremely similar to
each other, while empty pages from different sites are normally
very different. Ideally we should obtain “sample” empty pages for
each deep-web site, with which newly crawled pages can be com-
pared. To do so, we generate a set of “background queries”, that are
long strings of arbitrary characters that lack any semantic meanings
(e.g., “zzzzzzzzzzzzz”, or “xyzxyzxyzxyz”). Such queries, when
searched on deep-web sites, will almost certainly generate empty
pages. In practice, we generate N (10 in our experiments) such
background queries in order to be robust against the rare case where
a bad “background query” accidentally matches some records and
produces a non-empty page. We then crawl and store the corre-
sponding “background pages” as the reference set of empty pages.
At crawl time, each newly crawled page is compared with back-
ground pages to determine if the new page is actually empty.
Our content comparison mechanism uses a signature based page
summarization techniques also used in [15]. The signature is es-
sentially a set of tokens that are descriptive of the page content, but
also robust against minor differences in page content (e.g., dynam-
ically generated advertisements).
1
We then calculate the Jaccard
Similarity between the signature of the newly crawled page and the
“background pages”, as defined below.
DEFINITION 1. [21] Let S
p
1
and S
p
2
be the sets of tokens rep-
resenting the signature of the crawled page p
1
and p
2
. The Jaccard
Similarity between S
p
1
and S
p
2
, denoted Sim
Jac
(S
p
1
, S
p
2
), is
defined as Sim
Jac
(S
p
1
, S
p
2
)=
S
p
1
S
p
2
S
p
1
S
p
2
The similarity scores are averaged over the set of N “background
pages”, and if the average score is above certain threshold θ, we
label the newly crawled page as empty. As we will show in ex-
periments, this approach is very effective in detecting empty pages
across different websites (with an overall precision of 0.89 and a
recall of 0.9).
7. SECOND-LEVEL CRAWL
7.1 The motivation for second level crawl
We observe that the first set of pages crawled using URL tem-
plates often contain URLs that link to additional deep-web con-
tents. In this work, we refer to the first set of pages obtained
through URL templates as “first-level pages” (because they are one
click away from the homepage), and those pages that are linked
from first-level pages as “second-level pages” (and the correspond-
ing URLs “second-level URLs”). There are at least a few common
cases in which crawling second-level pages can be useful.
1
Our signatures are generated using a proprietary method also used in [15],
the details of which is beyond the scope of this paper. In principle well-
known content summarization techniques like [7, 16] can be used in place.

Citations
More filters
Journal ArticleDOI

SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

TL;DR: The experimental results show the agility and accuracy of the proposed crawler framework, SmartCrawler, which efficiently retrieves deep-web interfaces from large-scale sites and achieves higher harvest rates than other crawlers.
Journal ArticleDOI

Deep Web crawling: a survey

TL;DR: It is concluded that crawler evaluation is an immature research area due to the lack of a standard set of performance measures, or a benchmark or publicly available dataset to evaluate the crawlers, and future work in this area should be focused on devising crawlers to deal with ever-evolving Web technologies and improving the crawling efficiency and scalability.
Journal ArticleDOI

A quantitative approach to evaluate Website Archivability using the CLEAR+ method

TL;DR: This work introduces and elaborate on all aspects of CLEAR+, an extended version of the Credible Live Evaluation Method for Archive Readiness (CLEAR) method, and uses a systematic approach to evaluate WA from multiple different perspectives, which is called Website Archivability Facets.
Journal ArticleDOI

Enabling maps/location searches on mobile devices: constructing a POI database via focused crawling and information extraction

TL;DR: This paper proposes techniques that are required to construct a POI database, including focused crawling, information extraction, and information retrieval techniques, and demonstrates that the proposed geographical information retrieval model outperforms Wikimapia and a commercial app called ‘What’s the Number?’
Journal ArticleDOI

Data Capture and Analysis of Darknet Markets

TL;DR: These tools are described in detail and the proposed methods provide a further step in providing a transparent and comprehensive solution for observing darknet markets tailored for data scientists, social scientists, criminologist and others interested in analysing trends fromdarknet markets.
References
More filters
Journal ArticleDOI

The anatomy of a large-scale hypertextual Web search engine

TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Journal Article

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Sergey Brin, +1 more
- 01 Jan 1998 - 
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Journal ArticleDOI

Term Weighting Approaches in Automatic Text Retrieval

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.
Book

Introduction to Data Mining

TL;DR: This book discusses data mining through the lens of cluster analysis, which examines the relationships between data, clusters, and algorithms, and some of the techniques used to solve these problems.
Proceedings ArticleDOI

Freebase: a collaboratively created graph database for structuring human knowledge

TL;DR: MQL provides an easy-to-use object-oriented interface to the tuple data in Freebase and is designed to facilitate the creation of collaborative, Web-based data-oriented applications.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What contributions have the authors mentioned in the paper "Crawling deep web entity pages" ?

In this work, the authors describe a prototype system they have built that specializes in crawling entity-oriented deep-web sites. The authors propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. 

While these techniques are shown to be useful, their experience points to a few areas that warrant future studies. Given the ubiquity of entity-oriented deep-web sites and the variety of uses that entity-oriented content can enable, the authors believe entity-oriented crawl is a useful research effort, and they hope their initial efforts in this area can serve as a springboard for future research. 

The URL template generation component then crawls the home-pages of these sites, extracts and parses the web forms found on the homepages, and produces URL templates. 

The authors demonstrate that classical techniques for infor-mation retrieval and entity extraction can be used to robustly derive relevant entities for each site, so that crawling bandwidth can be utilized efficiently and effectively (Section 5). 

when all the prefix/suffix in the query logs are aggregated, the correct prefix “whereto stay in” occurs much more frequently and should clearly stand out as a entity irrelevant pattern. 

While crawling the deep-web can be immensely useful for a variety of tasks including web indexing [15] and data integration [14], crawling the deep-web content is known to be hard. 

Deep-web crawl refers to the problem of surfacing rich information behind the web search interface of diverse sites across the Web. 

a significant portion of these second-level URLs are in fact entirely irrelevant with no deep-web content (many URLs are static and navigational, for example browsing URLs, login URLs, etc.), which need to be filtered out.