scispace - formally typeset
Search or ask a question

Showing papers on "Web page published in 2006"


Book
03 Jul 2006
TL;DR: Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided.
Abstract: Why doesn't your home page appear on the first page of search results, even when you query your own name? How do other web pages always appear at the top? What creates these powerful rankings? And how? The first book ever about the science of web page rankings, Google's PageRank and Beyond supplies the answers to these and other questions and more. The book serves two very different audiences: the curious science reader and the technical computational reader. The chapters build in mathematical sophistication, so that the first five are accessible to the general academic reader. While other chapters are much more mathematical in nature, each one contains something for both audiences. For example, the authors include entertaining asides such as how search engines make money and how the Great Firewall of China influences research. The book includes an extensive background chapter designed to help readers learn more about the mathematics of search engines, and it contains several MATLAB codes and links to sample web data sets. The philosophy throughout is to encourage readers to experiment with the ideas and algorithms in the text. Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided. Many illustrative examples and entertaining asides MATLAB code Accessible and informal style Complete and self-contained section for mathematics review

1,548 citations


Journal ArticleDOI
TL;DR: If effectively deployed, wikis, blogs and podcasts could offer a way to enhance students', clinicians' and patients' learning experiences, and deepen levels of learners' engagement and collaboration within digital learning environments.
Abstract: We have witnessed a rapid increase in the use of Web-based 'collaborationware' in recent years. These Web 2.0 applications, particularly wikis, blogs and podcasts, have been increasingly adopted by many online health-related professional and educational services. Because of their ease of use and rapidity of deployment, they offer the opportunity for powerful information sharing and ease of collaboration. Wikis are Web sites that can be edited by anyone who has access to them. The word 'blog' is a contraction of 'Web Log' – an online Web journal that can offer a resource rich multimedia environment. Podcasts are repositories of audio and video materials that can be "pushed" to subscribers, even without user intervention. These audio and video files can be downloaded to portable media players that can be taken anywhere, providing the potential for "anytime, anywhere" learning experiences (mobile learning). Wikis, blogs and podcasts are all relatively easy to use, which partly accounts for their proliferation. The fact that there are many free and Open Source versions of these tools may also be responsible for their explosive growth. Thus it would be relatively easy to implement any or all within a Health Professions' Educational Environment. Paradoxically, some of their disadvantages also relate to their openness and ease of use. With virtually anybody able to alter, edit or otherwise contribute to the collaborative Web pages, it can be problematic to gauge the reliability and accuracy of such resources. While arguably, the very process of collaboration leads to a Darwinian type 'survival of the fittest' content within a Web page, the veracity of these resources can be assured through careful monitoring, moderation, and operation of the collaborationware in a closed and secure digital environment. Empirical research is still needed to build our pedagogic evidence base about the different aspects of these tools in the context of medical/health education. If effectively deployed, wikis, blogs and podcasts could offer a way to enhance students', clinicians' and patients' learning experiences, and deepen levels of learners' engagement and collaboration within digital learning environments. Therefore, research should be conducted to determine the best ways to integrate these tools into existing e-Learning programmes for students, health professionals and patients, taking into account the different, but also overlapping, needs of these three audience classes and the opportunities of virtual collaboration between them. Of particular importance is research into novel integrative applications, to serve as the "glue" to bind the different forms of Web-based collaborationware synergistically in order to provide a coherent wholesome learning experience.

1,219 citations


Journal ArticleDOI
TL;DR: It is found that visual appeal can be assessed within 50 ms, suggesting that web designers have about 50 ms to make a good first impression.
Abstract: Three studies were conducted to ascertain how quickly people form an opinion about web page visual appeal. In the first study, participants twice rated the visual appeal of web homepages presented for 500 ms each. The second study replicated the first, but participants also rated each web page on seven specific design dimensions. Visual appeal was found to be closely related to most of these. Study 3 again replicated the 500 ms condition as well as adding a 50 ms condition using the same stimuli to determine whether the first impression may be interpreted as a 'mere exposure effect' (Zajonc 1980). Throughout, visual appeal ratings were highly correlated from one phase to the next as were the correlations between the 50 ms and 500 ms conditions. Thus, visual appeal can be assessed within 50 ms, suggesting that web designers have about 50 ms to make a good first impression.

950 citations


Journal Article
TL;DR: These sections of the Web break away from the page metaphor and are predicated on microcontent, which means that reading and searching this world is significantly different from searching the entire Web world.
Abstract: © 2 0 0 6 B r y a n A l e x a n d e r chronological structure implies a different rhetorical purpose than a Web page, which has no inherent timeliness That altered rhetoric helped shape a different audience, the blogging public, with its emergent social practices of blogrolling, extensive hyperlinking, and discussion threads attached not to pages but to content chunks within them Reading and searching this world is significantly different from searching the entire Web world Still, social software does not indicate a sharp break with the old but, rather, the gradual emergence of a new type of practice These sections of the Web break away from the page metaphor Rather than following the notion of the Web as book, they are predicated on microcontent Blogs are about posts, not pages Wikis are streams of conversation, revision, amendment, and truncation Podcasts are shuttled between Web sites, RSS feeds, and diverse players These content blocks can be saved, summarized, addressed, copied, quoted, and built into new projects Browsers respond to this boom in

881 citations


Journal ArticleDOI
TL;DR: This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used and believes these criteria provide qualitatively measures to evaluate various IE approaches.
Abstract: The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Therefore, the availability of robust, flexible information extraction (IE) systems that transform the Web pages into program-friendly structures such as a relational database will become a great necessity. Although many approaches for data extraction from Web pages have been developed, there has been limited effort to compare such tools. Unfortunately, in only a few cases can the results generated by distinct tools be directly compared since the addressed extraction tasks are different. This paper surveys the major Web data extraction approaches and compares them in three dimensions: the task domain, the automation degree, and the techniques used. The criteria of the first dimension explain why an IE system fails to handle some Web sites of particular structures. The criteria of the second dimension classify IE systems based on the techniques used. The criteria of the third dimension measure the degree of automation for IE systems. We believe these criteria provide qualitatively measures to evaluate various IE approaches

855 citations


Journal ArticleDOI
01 Jan 2006
TL;DR: In this paper, the authors report results from research that examines characteristics and changes in Web searching from nine studies of five Web search engines based in the US and Europe and find that users are viewing fewer result pages, searchers on US-based web search engines use more query operators, and there are statistically significant differences in the use of Boolean operators and result pages viewed, and one cannot necessary apply results from studies of one particular Web search engine to another web search engine.
Abstract: The Web and especially major Web search engines are essential tools in the quest to locate online information for many people. This paper reports results from research that examines characteristics and changes in Web searching from nine studies of five Web search engines based in the US and Europe. We compare interactions occurring between users and Web search engines from the perspectives of session length, query length, query complexity, and content viewed among the Web search engines. The results of our research shows (1) users are viewing fewer result pages, (2) searchers on US-based Web search engines use more query operators than searchers on European-based search engines, (3) there are statistically significant differences in the use of Boolean operators and result pages viewed, and (4) one cannot necessary apply results from studies of one particular Web search engine to another Web search engine. The wide spread use of Web search engines, employment of simple queries, and decreased viewing of result pages may have resulted from algorithmic enhancements by Web search engine companies. We discuss the implications of the findings for the development of Web search engines and design of online content.

810 citations


Proceedings ArticleDOI
23 May 2006
TL;DR: Some previously-undescribed techniques for automatically detecting spam pages are considered, and the effectiveness of these techniques in isolation and when aggregated using classification algorithms is examined.
Abstract: In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).

684 citations


Journal ArticleDOI
TL;DR: Ask a dozen Internet experts what the term Web 2.0 means, and a few journalists maintain that the term doesn't mean anything at all -it's just a marketing ploy used to hype social networking sites.
Abstract: Ask a dozen Internet experts what the term Web 2.0 means, and you'll get a dozen different answers. Some say that Web 2.0 is a set of philosophies and practices that provide Web users with a deep and rich experience. Others say it's a new collection of applications and technologies that make it easier for people to find information and connect with one another online. A few journalists maintain that the term doesn't mean anything at all -it's just a marketing ploy used to hype social networking sites.

670 citations


Proceedings ArticleDOI
11 Jan 2006
TL;DR: This paper presents the first formal definition of command injection attacks in the context of web applications, and gives a sound and complete algorithm for preventing them based on context-free grammars and compiler parsing techniques.
Abstract: Web applications typically interact with a back-end database to retrieve persistent data and then present the data to the user as dynamically generated output, such as HTML web pages. However, this interaction is commonly done through a low-level API by dynamically constructing query strings within a general-purpose programming language, such as Java. This low-level interaction is ad hoc because it does not take into account the structure of the output language. Accordingly, user inputs are treated as isolated lexical entities which, if not properly sanitized, can cause the web application to generate unintended output. This is called a command injection attack, which poses a serious threat to web application security. This paper presents the first formal definition of command injection attacks in the context of web applications, and gives a sound and complete algorithm for preventing them based on context-free grammars and compiler parsing techniques. Our key observation is that, for an attack to succeed, the input that gets propagated into the database query or the output document must change the intended syntactic structure of the query or document. Our definition and algorithm are general and apply to many forms of command injection attacks. We validate our approach with SqlCheckS, an implementation for the setting of SQL command injection attacks. We evaluated SqlCheckS on real-world web applications with systematically compiled real-world attack data as input. SqlCheckS produced no false positives or false negatives, incurred low runtime overhead, and applied straightforwardly to web applications written in different languages.

590 citations


Journal ArticleDOI
TL;DR: Developing a platform for recording, storing, and accessing a personal lifetime archive for individuals to record, store, and access their history.
Abstract: MyLifeBits is a system that began in 2001 to explore the use of SQL to store all personal information found in PCs. The system initially focused on capturing and storing scanned and encoded archival material e.g. articles, books, music, photos, and video as well as everything born digital e.g. office documents, email, digital photos. It evolved to have a goal of storing everything that could be captured. The later included web pages, phone calls, meetings, room conversations, keystrokes and mouse clicks for every active screen or document, and all the 1-2 thousand photos that SenseCam captures every day. In 2006 the software platform is used for research including real time data collection, advanced SenseCams, and particular applications e.g. health and wellness. This article expands on the January 2006, CACM publication of the same name. MyLifeBits features, functions, and use experience are given in the main body, followed by an appendix of future research and product needs that the research has identified.

543 citations


Proceedings ArticleDOI
Monika Henzinger1
06 Aug 2006
TL;DR: A combined algorithm is presented which achieves precision 0.79 with 79% of the recall of the other algorithms, and since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall than Broder et al.'s algorithm.
Abstract: Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered "state-of-the-art" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al.'s algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.

Proceedings ArticleDOI
23 May 2006
TL;DR: This paper investigates how detailed tracking of user interaction can be monitored using standard web technologies to enable implicit interaction and to ease usability evaluation of web applications outside the lab.
Abstract: In this paper, we investigate how detailed tracking of user interaction can be monitored using standard web technologies. Our motivation is to enable implicit interaction and to ease usability evaluation of web applications outside the lab. To obtain meaningful statements on how users interact with a web application, the collected information needs to be more detailed and fine-grained than that provided by classical log files. We focus on tasks such as classifying the user with regard to computer usage proficiency or making a detailed assessment of how long it took users to fill in fields of a form. Additionally, it is important in the context of our work that usage tracking should not alter the user's experience and that it should work with existing server and browser setups. We present an implementation for detailed tracking of user actions on web pages. An HTTP proxy modifies HTML pages by adding JavaScript code before delivering them to the client. This JavaScript tracking code collects data about mouse movements, keyboard input and more. We demonstrate the usefulness of our approach in a case study.

Proceedings Article
01 Feb 2006
TL;DR: The design and implementation of the Strider HoneyMonkey Exploit Detection System is described, which consists of a pipeline of “monkey programs” running possibly vulnerable browsers on virtual machines with different patch levels and patrolling the Web to seek out and classify web sites that exploit browser vulnerabilities.
Abstract: Internet attacks that use malicious web sites to install malware programs by exploiting browser vulnerabilities are a serious emerging threat. In response, we have developed an automated web patrol system to automatically identify and monitor these malicious sites. We describe the design and implementation of the Strider HoneyMonkey Exploit Detection System, which consists of a pipeline of “monkey programs” running possibly vulnerable browsers on virtual machines with different patch levels and patrolling the Web to seek out and classify web sites that exploit browser vulnerabilities. Within the first month of utilizing this system, we identified 752 unique URLs hosted on 288 web sites that could successfully exploit unpatched Windows XP machines. The system automatically constructed topology graphs based on traffic redirection to capture the relationship between the exploit sites. This allowed us to identify several major players who are responsible for a large number of exploit pages. By monitoring these 752 exploit-URLs on a daily basis, we discovered a malicious web site that was performing zero-day exploits of the unpatched javaprxy.dll vulnerability and was operating behind 25 exploit-URLs. It was confirmed as the first “inthe-wild”, zero-day exploit of this vulnerability that was reported to the Microsoft Security Response Center. Additionally, by scanning the most popular one million URLs as classified by a search engine, we found over seven hundred exploit-URLs, many of which serve popular content related to celebrities, song lyrics, wallpapers, video game cheats, and wrestling.

01 Jan 2006
TL;DR: This work defines a set of general criteria for a good tagging system and proposes a collaborative tag suggestion algorithm using these criteria to spot high-quality tags and employs a goodness measure for tags derived from collective user authorities to combat spam.
Abstract: Content organization over the Internet went through several interesting phases of evolution: from structured directories to unstructured Web search engines and more recently, to tagging as a way for aggregating information, a step towards the semantic web vision. Tagging allows ranking and data organization to directly utilize inputs from end users, enabling machine processing of Web content. Since tags are created by individual users in a free form, one important problem facing tagging is to identify most appropriate tags, while eliminating noise and spam. For this purpose, we define a set of general criteria for a good tagging system. These criteria include high coverage of multiple facets to ensure good recall, least effort to reduce the cost involved in browsing, and high popularity to ensure tag quality. We propose a collaborative tag suggestion algorithm using these criteria to spot high-quality tags. The proposed algorithm employs a goodness measure for tags derived from collective user authorities to combat spam. The goodness measure is iteratively adjusted by a reward-penalty algorithm, which also incorporates other sources of tags, e.g., content-based auto-generated tags. Our experiments based on My Web 2.0 show that the algorithm is effective.

Patent
26 Jun 2006
TL;DR: In this paper, the authors use a virtual machine (VM) to sandbox and analyze potentially malicious content accessed at a network site to determine whether it is malicious, such as drive-by download attacks.
Abstract: A system analyzes content accessed at a network site to determine whether it is malicious. The system employs a tool able to identify spyware that is piggy-backed on executable files (such as software downloads) and is able to detect “drive-by download” attacks that install software on the victim's computer when a page is rendered by a browser program. The tool uses a virtual machine (VM) to sandbox and analyze potentially malicious content. By installing and running executable files within a clean VM environment, commercial anti-spyware tools can be employed to determine whether a specific executable contains piggy-backed spyware. By visiting a Web page with an unmodified browser inside a clean VM environment, predefined “triggers,” such as the installation of a new library, or the creation of a new process, can be used to determine whether the page mounts a drive-by download attack.

Proceedings ArticleDOI
23 May 2006
TL;DR: A system that learns how to extract keywords from web pages for advertisement targeting, using a number of features, such as term frequency of each potential keyword, inverse document frequency, presence in meta-data, and how often the term occurs in search query logs.
Abstract: A large and growing number of web pages display contextual advertising based on keywords automatically extracted from the text of the page, and this is a substantial source of revenue supporting the web today. Despite the importance of this area, little formal, published research exists. We describe a system that learns how to extract keywords from web pages for advertisement targeting. The system uses a number of features, such as term frequency of each potential keyword, inverse document frequency, presence in meta-data, and how often the term occurs in search query logs. The system is trained with a set of example pages that have been hand-labeled with "relevant" keywords. Based on this training, it can then extract new keywords from previously unseen pages. Accuracy is substantially better than several baseline systems.

20 Jun 2006
TL;DR: It is suggested that recent thinking describing the changing Web as "Web 2.0" will have substantial implications for libraries, and that while these implications keep very close to the history and mission of libraries, they still necessitate a new paradigm for librarianship.
Abstract: This article posits a definition and theory for "Library 2.0". It suggests that recent thinking describing the changing Web as "Web 2.0" will have substantial implications for libraries, and recognizes that while these implications keep very close to the history and mission of libraries, they still necessitate a new paradigm for librarianship. The paper applies the theory and definition to the practice of librarianship, specifically addressing how Web 2.0 technologies such as synchronous messaging and streaming media, blogs, wikis, social networks, tagging, RSS feeds, and mashups might intimate changes in how libraries provide access to their collections and user support for that access.

Proceedings ArticleDOI
23 Apr 2006
TL;DR: Noxes is presented, which is, to the best of the knowledge, the first client-side solution to mitigate cross-site scripting attacks and effectively protects against information leakage from the user's environment while requiring minimal user interaction and customization effort.
Abstract: Web applications are becoming the dominant way to provide access to on-line services. At the same time, web application vulnerabilities are being discovered and disclosed at an alarming rate. Web applications often make use of JavaScript code that is embedded into web pages to support dynamic client-side behavior. This script code is executed in the context of the user's web browser. To protect the user's environment from malicious JavaScript code, a sand-boxing mechanism is used that limits a program to access only resources associated with its origin site. Unfortunately, these security mechanisms fail if a user can be lured into downloading malicious JavaScript code from an intermediate, trusted site. In this case, the malicious script is granted full access to all resources (e.g., authentication tokens and cookies) that belong to the trusted site. Such attacks are called cross-site scripting (XSS) attacks.In general, XSS attacks are easy to execute, but difficult to detect and prevent. One reason is the high flexibility of HTML encoding schemes, offering the attacker many possibilities for circumventing server-side input filters that should prevent malicious scripts from being injected into trusted sites. Also, devising a client-side solution is not easy because of the difficulty of identifying JavaScript code as being malicious. This paper presents Noxes, which is, to the best of our knowledge, the first client-side solution to mitigate cross-site scripting attacks. Noxes acts as a web proxy and uses both manual and automatically generated rules to mitigate possible cross-site scripting attempts. Noxes effectively protects against information leakage from the user's environment while requiring minimal user interaction and customization effort.

Journal ArticleDOI
TL;DR: The results provide direct evidence in support of the premise that aesthetic impressions of web pages are formed quickly and suggest that visual aesthetics plays an important role in users' evaluations of the IT artifact and in their attitudes toward interactive systems.
Abstract: Two experiments were designed to replicate and extend [Lindgaard et al.'s, 2006. Attention web designers: you have 50 ms to make a good first impression! Behaviour and Information Technology 25(2), 115-126] findings that users can form immediate aesthetic impression of web pages, and that these impressions are highly stable. Using explicit (subjective evaluations) and implicit (response latency) measures, the experiments demonstrated that, averaged over users, immediate aesthetic impressions of web pages are remarkably consistent. In Experiment 1, 40 participants evaluated 50 web pages in two phases. The average attractiveness ratings of web pages after a very short exposure of 500 ms were highly correlated with average attractiveness ratings after an exposure of 10 s. Extreme attractiveness evaluations (both positive and negative) were faster than moderate evaluations, landing convergent evidence to the hypothesis of immediate impression. The findings also suggest considerable individual differences in evaluations and in the consistency of those evaluations. In Experiment 2, 24 of the 50 web pages from Experiment 1 were evaluated again for their attractiveness after 500ms exposure. Subsequently, users evaluated the design of the web pages on the dimensions of classical and expressive aesthetics. The results showed high correlation between attractiveness ratings from Experiments 1 and 2. In addition, it appears that low attractiveness is associated mainly with very low ratings of expressive aesthetics. Overall, the results provide direct evidence in support of the premise that aesthetic impressions of web pages are formed quickly. Indirectly, these results also suggest that visual aesthetics plays an important role in users' evaluations of the IT artifact and in their attitudes toward interactive systems.

Book ChapterDOI
07 Nov 2006
TL;DR: Links as mentioned in this paper is a programming language for web applications that generates code for all three tiers of a web application from a single source, compiling into JavaScript to run on the client and into SQL for running on the database.
Abstract: Links is a programming language for web applications that generates code for all three tiers of a web application from a single source, compiling into JavaScript to run on the client and into SQL to run on the database Links supports rich clients running in what has been dubbed 'Ajax' style, and supports concurrent processes with statically-typed message passing Links is scalable in the sense that session state is preserved in the client rather than the server, in contrast to other approaches such as Java Servlets or PLT Scheme Client-side concurrency in JavaScript and transfer of computation between client and server are both supported by translation into continuation-passing style

Journal ArticleDOI
TL;DR: An effective approach to phishing Web page detection is proposed, which uses Earth mover's distance (EMD) to measure Web page visual similarity and train an EMD threshold vector for classifying a Web page as a phishing or a normal one.
Abstract: An effective approach to phishing Web page detection is proposed, which uses Earth mover's distance (EMD) to measure Web page visual similarity. We first convert the involved Web pages into low resolution images and then use color and coordinate features to represent the image signatures. We use EMD to calculate the signature distances of the images of the Web pages. We train an EMD threshold vector for classifying a Web page as a phishing or a normal one. Large-scale experiments with 10,281 suspected Web pages are carried out to show high classification precision, phishing recall, and applicable time performance for online enterprise solution. We also compare our method with two others to manifest its advantage. We also built up a real system which is already used online and it has caught many real phishing cases

Patent
18 Sep 2006
TL;DR: In this article, a method and apparatus for generating new web pages using pre-existing web pages as a template, and selection of information from a database or index based on user-allocated ratings, where the selected information may be used to provide content for a web site or for broadcast on TV, radio or other media.
Abstract: A method and apparatus for generating new web pages using pre-existing web pages as a template, and a method and apparatus for selection of information from a database or index based on user-allocated ratings, where the selected information may be used to provide content for a web site or for broadcast on TV, radio or other media. For the web page creation, preferably each web page is provided with a “remix” button, which a user can click on to create an editable copy of the web page. The user may subsequently edit individual components of the web page to customise it. A ranking system for web pages, modules, content and users may be implemented, and may be used to automatically select highly rated content. The automatic selection may also use meta data such as “tags” specifying the subject matter of the web site, and meta data relating to the particular user who is viewing the web site. The automatic selection process may use “tag clouds” comprising groups of related tags, in order to improve the search and matching process. The selected content may include advertising material or any other web site content, such as news feeds, blogs, pictures, etc. Where advertising material is selected for display, the selection criteria may take into account the user allocated ratings of each advert, as well as the price that advertisers are willing to pay in order to have the advert displayed.

Journal ArticleDOI
TL;DR: The authors' experience using e-mail to study a national sample of Internet users is presented, beginning with a discussion of how a sample of on-line users can be selected using a ‘people finder’ search engine.
Abstract: The Internet's potential for academic and applied research has recently begun to be acknowledged and assessed. To date, researchers have used Web page-based surveys to study large groups of on-line users and e-mail surveys to study smaller, more homogenous on-line user groups. A relatively untapped use for the Internet is to use e-mail to survey broader Internet populations on both a national and international basis. Our experience using e-mail to study a national sample of Internet users is presented, beginning with a discussion of how a sample of on-line users can be selected using a ‘people finder’ search engine. We include an evaluation of the demographic characteristics of the respondent pool compared to both a web page-based survey and a telephone survey of Internet users. Considerations for researchers who are evaluating this method for their own studies are provided.

Patent
10 Jan 2006
TL;DR: In this article, the authors present techniques and implementations for providing enhanced functionality for handling data in Internet browsers or other applications used for accessing data over a network, including providing thumbnail image displays of the current appearance of webpages referenced by URLs returned in a set of search results.
Abstract: Techniques and implementations for providing enhanced functionality for handling data in Internet browsers or other applications used for accessing data over a network, including providing thumbnail image displays of the current appearance of webpages referenced by URLs returned in a set of search results, providing thumbnail image displays of the webpages referenced by a list of favorite or bookmarked websites, providing thumbnail image displays of webpages which have been blocked from appearing on a user's screen, and providing thumbnail image displays of images which have been extracted from webpages and stored for potential future use.

Proceedings ArticleDOI
23 May 2006
TL;DR: SecuBat, a generic and modular web vulnerability scanner that, similar to a port scanner, automatically analyzes web sites with the aim of finding exploitable SQL injection and XSS vulnerabilities is developed.
Abstract: As the popularity of the web increases and web applications become tools of everyday use, the role of web security has been gaining importance as well. The last years have shown a significant increase in the number of web-based attacks. For example, there has been extensive press coverage of recent security incidences involving the loss of sensitive credit card information belonging to millions of customers.Many web application security vulnerabilities result from generic input validation problems. Examples of such vulnerabilities are SQL injection and Cross-Site Scripting (XSS). Although the majority of web vulnerabilities are easy to understand and to avoid, many web developers are, unfortunately, not security-aware. As a result, there exist many web sites on the Internet that are vulnerable.This paper demonstrates how easy it is for attackers to automatically discover and exploit application-level vulnerabilities in a large number of web applications. To this end, we developed SecuBat, a generic and modular web vulnerability scanner that, similar to a port scanner, automatically analyzes web sites with the aim of finding exploitable SQL injection and XSS vulnerabilities. Using SecuBat, we were able to find many potentially vulnerable web sites. To verify the accuracy of SecuBat, we picked one hundred interesting web sites from the potential victim list for further analysis and confirmed exploitable flaws in the identified web pages. Among our victims were well-known global companies and a finance ministry. Of course, we notified the administrators of vulnerable sites about potential security problems. More than fifty responded to request additional information or to report that the security hole was closed.

Journal ArticleDOI
TL;DR: It is concluded that hyperlinks are a highly promising but problematic new source of data that can be mined for previously hidden patterns of information, although much care must be taken in the collection of raw data and in the interpretation of the results.
Abstract: We have recently witnessed the growth of hyperlink studies in the field of Internet research. Although investigations have been conducted across many disciplines and topics, their approaches can be largely divided into hyperlink network analysis (HNA) and Webometrics. This article is an extensive review of the two analytical methods, and a reflection on their application. HNA casts hyperlinks between Web sites (or Web pages) as social and communicational ties, applying standard techniques from Social Networks Analysis to this new data source. Webometrics has tended to apply much simpler techniques combined with a more in-depth investigation into the validity of hypotheses about possible interpretations of the results. We conclude that hyperlinks are a highly promising but problematic new source of data that can be mined for previously hidden patterns of information, although much care must be taken in the collection of raw data and in the interpretation of the results. In particular, link creation is an unregulated phenomenon and so it would not be sensible to assume that the meaning of hyperlinks in any given context is evident, without a systematic study of the context of link creation, and of the relationship between link counts, among other measurements. Social Networks Analysis tools and techniques form an excellent resource for hyperlink analysis, but should only be used in conjunction with improved techniques for data collection, validation and interpretation.

Proceedings ArticleDOI
04 Sep 2006
TL;DR: A new prototype search tool called Mica is presented that augments standard Web search results to help programmers find the right API classes and methods given a description of the desired functionality, and help programmers finding examples when they already know which methods to use.
Abstract: Because software libraries are numerous and large, learning how to use them is a common and problematic task for experienced programmers and novices alike. Internet search engines such as Google have emerged as important resources to help programmers successfully use APIs. However, observations of programmers using web search have revealed problems and inefficiencies in their use. We present a new prototype search tool called Mica that augments standard web search results to help programmers find the right API classes and methods given a description of the desired functionality, and help programmers find examples when they already know which methods to use. Mica works by using the Google Web APIs to find relevant pages, and then analyzing the content of those pages to extract the most relevant programming terms and to classify the type of each result.

Proceedings ArticleDOI
18 Dec 2006
TL;DR: A state of the art survey of the works done on social network analysis ranging from pure mathematical analyses in graphs to analysing the social networks in semantic Web is given to provide a road map for researchers working on different aspects of social networkAnalysis.
Abstract: A social network is a set of people (or organizations or other social entities) connected by a set of social relationships, such as friendship, co-working or information exchange. Social network analysis focuses on the analysis of patterns of relationships among people, organizations, states and such social entities. Social network analysis provides both a visual and a mathematical analysis of human relationships. Web can also be considered as a social network. Social networks are formed between Web pages by hyperlinking to other Web pages. In this paper a state of the art survey of the works done on social network analysis ranging from pure mathematical analyses in graphs to analyzing the social networks in Semantic Web is given. The main goal is to provide a road map for researchers working on different aspects of Social Network Analysis.

Patent
18 Jul 2006
TL;DR: In this article, a system for device-independent point-to-multipoint communication is presented, where the system is configured to receive a message addressed to one or more destination users, the message type being, for example, Short Message Service (SMS), Instant Messaging (IM), E-mail, web form input, or Application Program Interface (API) function call.
Abstract: A system (and method) for device-independent point to multipoint communication is disclosed The system is configured to receive a message addressed to one or more destination users, the message type being, for example, Short Message Service (SMS), Instant Messaging (IM), E-mail, web form input, or Application Program Interface (API) function call The system also is configured to determine information about the destination users, the information comprising preferred devices and interfaces for receiving messages, the information further comprising message receiving preferences The system applies rules to the message based on destination user information to determine the message endpoints, the message endpoints being, for example, Short Message Service (SMS), Instant Messaging (IM), E-mail, web page output, or Application Program Interface (API) function call The system translates the message based on the destination user information and message endpoints and transmits the message to each endpoint of the message A system (and method) also enables user to perform a keyword search, after performing the keyword search, a result display on the first computing device to the first user, in which are displayed individual messages located in the keyword search, wherein the result display provides a graphical subscribe indicator that, when selected after the search is performed and after the individual messages are displayed, subscribes the first user to a second user who provided the selected individual message from the search results, so as to enable the first user to be a follower of the second user, wherein the first user becomes one of several followers of the second user and the second user has a second computing device and storing, by the one or more computer processors, the followers of the second user, including the first user, in a first storage User is also enabling to post, a new message for distribution to one or more unspecified recipients wherein server identifies the followers of the posting user as recipients of the new message and sending the new message to the followers of the second user, including the first user

Journal ArticleDOI
01 Dec 2006
TL;DR: This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by a large and diverse set of judges.
Abstract: We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by a large and diverse set of judges.