Webology, Volume 5, Number 2, June, 2008

Home Table of Contents Titles & Subject Index Authors Index

Information retrieval and machine learning:
Supporting technologies for web mining research and practice

MPS Bhatia
Netaji Subhas Institute of Technology, University of Delhi, India. E-mail: mpsbatia (at) nsit.ac.in

Akshi Kumar Khalid
Netaji Subhas Institute of Technology, University of Delhi, India akshi.kumar (at) gmail.com

Received May 18, 2008; Accepted June 24, 2008


With the enormous increase in recent years in the volume of information available on-line, and the consequent need for better techniques to access this information, there has been a strong resurgence of interest in Web Mining research. This paper expounds how research in Machine Learning (ML) and Information Retrieval (IR) will help develop applications that can drive the next generation of Web search with the key to support relevant search results by effectively and efficiently digging out user- centric information. This study attempts to probe the role of these two web mining supporting technologies (ML & IR). It reviews the web mining systems from the perspective of the Information Retrieval and illustrates how machine learning is likely to make substantial gains in web mining research and practice by developing standards and improving effectiveness.


Web Information Retrieval, Machine Learning Paradigms, Web mining


The World Wide Web is a huge, widely distributed, global source for information services, hyper-link information, access and usage information and web site content and organization. With the transformation of the Web into a ubiquitous tool for "e-activities" such as e-commerce, e-learning, e-government, e-science, its use has pervaded to the realms of day-to-day work, information retrieval and business management. Therefore, it is imperative to provide users with tools for efficient and effective resource and knowledge discovery. By all measures, the Web is enormous and growing at a staggering rate, which has made it increasingly intricate and crucial for both people and programs to have quick and accurate access to Web information and services. Search engines have played a key role in the World Wide Web's infrastructure as its scale and impact have escalated. Although search engines are important tools for knowledge discovery on the Web, they are far from perfect. The poor quality of retrieved results, handling a huge quantity of information, addressing subjective and time-varying search needs, finding fresh information and dealing with poor quality queries are commonly cited glitches.

The World Wide Web Impact: Opportunities and Challenges

The stakeholders could encounter, among others, the following problems when interacting with the Web (Kobayashi & Takeda, 2000):

a) The "Abundance" problem:
With the phenomenal growth of the Web, there is an ever increasing volume of data and information published in numerous web pages. According to worldwidewebsize.com, the indexed Web contains at least 27.56 billion pages (Sunday, 24 August, 2008).

b) Web Search results usually have low precision and recall:
For finding relevant information, the search service is generally a keyword-based, query-triggered process which results in problems of Low Precision (difficulty to find relevant information) and Low Recall (inability to index all information available on the web).

c) Limited query interface based on keyword-oriented search:
It is hard to extract useful knowledge out of information available because the search service used to find out specific information on the Web is retrieval-oriented, whereas to extract potentially useful knowledge out of it, is a data-mining oriented, data-triggered process.

d) Lack of Personalization of Information and Limited Customization to individual users:
Most knowledge on the Web is presented as natural-language text with occasional pictures and graphics. This is convenient for human users to read and view but difficult for computers to understand. It also limits the state of art search engines, since they cannot infer contextual meaning. For example the occurrence of word 'bat' refers to a bird or to a cricket bat. These factors uphold the inevitable creation of intelligent server and client-side systems that can effectively mine for knowledge both across the Internet and in particular web localities.

e) Heterogeneity:

f) Dynamics:
The freedom for anyone to publish information on the Web at anytime and anywhere implies that information on the Web is constantly changing. It is a dynamic information environment whereas traditional systems are typically based on static document collection.

g) Duplication:
Several studies indicate that nearly 30% of the Web's content is duplicated, mainly due to mirroring.

Despite its success as a preferred or de-facto source of information, the Web implicates a key challenge: How to target search & provide improved systems that retrieve the most relevant information available to satisfy user's need with accurate balance of novelty and pertinence.

Web Mining Layered Framework

Web Mining tries to solve these issues that arise due to the Web phenomenon. It tries to overcome these problems by applying data mining techniques to content, (hyperlink) structure and usage of web resources (Kosala & Blockeel, 2000; Bin & Zhijing, 2003; Chakarbarti, 2003; Henzinger, 2004; Srivastava, Desikan & Kumar, 2002). In the following sections of the paper, we expound the research in machine learning and information retrieval and help develop and review web mining applications that can more effectively and efficiently utilize the Web of knowledge.

A Three-layered Web Mining Framework is illustrated in Figure 1, where, the Layer 1 is the Supporting Web Mining Technologies Layer that expounds the various supporting technologies that contribute to Web Mining and the Layer 2 & Layer 3 are Web Mining Concepts Layer & Web Mining Applications Layer respectively. In this paper we attempt to talk about the use of Information Retrieval & Machine Learning Supporting Technologies of Web Mining that drive the next generation of Web search with the key to support relevant search results by effectively and efficiently digging out user- centric information.

Figure 1. Web mining framework
Defining IR Metrics

From IR to Web IR

Web Information Retrieval (Web IR) can be defined as the application of theories and methodologies from IR to the World Wide Web. It is concerned with addressing the technological challenges facing Information Retrieval (IR) in the setting of WWW (Nunes, 2006). The characteristics of Web make the task of retrieving information from it quite different from the Pre- Web (traditional) information retrieval. The Web is seemingly unlimited source of information with users from cross-section of society seeking to find information to satisfy their information need. They require the Web to be accessible through effective and efficient information retrieval systems that deliver information need fulfillments through the retrieval of web content. The paper (Bhatia & Khalid, 2008a) talks about the Web IR paradigm, as a variant of classical IR, by illustrating its basics, the components, model categories, tools, tasks and the performance measures that quantify the quality of retrieval results.

Diversity and complexity of Web information, as well as its richness, call for approaches that reach beyond conventional IR (Kuhlen, 1991; Baeza-Yates, 1999). For many years, information retrieval research focused mainly on the problem of ad-hoc document retrieval, a topical search task that assumes all queries are meant to express a broad request for information on a topic identified by the query. This task is exemplified by the early TREC conferences, where the ad-hoc document retrieval track was prominent. In recent years, particularly since the popular advent of the World Wide Web and e-commerce, IR researchers have begun to expand their efforts to understand the nature of the information need that users express in their queries. The unprecedented growth of available data coupled with the vast number of available online activities has introduced a new wrinkle to the problem of search: it is now important to attempt to determine not only what the user is looking for, but also the task they are trying to accomplish and the method by which they would prefer to accomplish it. In addition, all users are not created equal; different users may use different terms to describe similar information needs; the concept of "what is relevant" to a user has only become more and more unclear as the Web has matured and more diverse data have become available. Because of this, it is of key interest to search services to discover sets of identifying features that an information retrieval system can use to associate a specific user query with a broader information need. Specifically, the operative challenges motivating researchers in Web IR include problems relating either to data quality or user satisfaction (Kobayashi & Takeda, 2000). The problems facing successful Web Information Retrieval are a combination of challenges that stem from traditional information retrieval and challenges characterized by the nature of the World Wide Web.

The ultimate challenge of Web IR research is to provide improved systems that retrieve the most relevant information available on the Web to better satisfy a user's information need. In an information retrieval scenario, the most common evaluation is retrieval effectiveness and the effect of indexing exhaustivity and term specificity on retrieval effectiveness that can be explained by two widely accepted measures: Precision and Recall.

Precision: The proportion of retrieved and relevant documents (A) to all the documents retrieved:

| Ra | / | A |

Recall: The fraction of the relevant documents (R) that is successfully retrieved:

| Ra | / | R |

A perfect Precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant documents were retrieved) and a perfect Recall score of 1.0 means that all relevant documents were retrieved by the search (but says nothing about how many irrelevant documents were also retrieved) (Singhal, 2001).

Figure 2. Defining IR metrics
Defining IR metrics

Existing search engines focus mainly on basic term-based techniques for general search, and do not attempt query understanding. Traditionally, a term in a given document is considered to be significant if it occurs multiple times within that document. This observation is commonly referred to as Term Frequency (tf) (Luhn, 1957). His study was based on the fact that the authors of documents typically emphasize a topic or concept by repeatedly using the same words. Since then, most information retrieval approaches (Kobayashi & Takeda, 2000; Baeza-Yates, 1999) have adopted tf (or variations of it) as a benchmark for indicating term significance or relevance within a given document. In particular, it is normally combined with inverse document frequency (idf) to form the tf-idf measure (Salton & Yang, 1973). Even with the emergence of Web Information Retrieval, tf still continues to be a standard measure of term significance within a document. There are several examples for content-based web information retrieval systems (Anh & Moffat, 2003; Craswell et al., 2003; Robertson & Walker, 1999; Yu et al., 2003) that assess term significance using tf. But as is the case for many potentially relevant documents, tf is not always the best or most useful indicator of term significance or relevancy. Quite often, there are relevant documents that contain only a single or a few occurrences of a particular term. Consequently, through tf these terms will rarely be considered significant, and thus never contribute impressively to the rank score of the potentially relevant document they appear within. This is especially the case when infrequently occurring terms appear in large documents containing hundreds or even thousands of terms.

In the query-centric approaches to retrieval (Plachouris et al., 2003; Plachouris & Ounis, 2002) queries can be classified to aid in the choice of retrieval strategy. Kang & Kim (2003) classify queries as either pertaining to topic relevance, homepage finding or service task and use this classification as a basis of dynamically combining multiple evidences in different ways to improve retrieval. Plachouris & Ounis (2002) use WordNet in a concept-based probabilistic approach to information retrieval where queries are biased according to their calculated scope. In their work, scope is an indication of generality or specificity of a query and is used as a factor of uncertainty in Dempster-Shafer's theory of evidence.

Context-based retrieval approaches aim to provide a more complete retrieval process by incorporating contextual information into the retrieval process. The use of context in information retrieval is not a new idea. Jing & Tzoukermann (1999) use context as a basis of measuring the semantic distances between words. Billhardt, Borrajo and Maojo (2002) propose a context-based vector space model for information retrieval. The WEBSOM (Honkela et al., 1997) system is an example of another way in which context has been used for information retrieval. It uses a two level Kohonen's self-organizing map approach to group words and documents of contextual similarity. Context in WEBSOM is limited to the terms that occur direct either sides of the term in question. IntelliZap (Finkelstein et al., 2001) is a context-based web search engine that requires the user to select a keyword in the context of some text. The approach makes effective use of the contextual information in the immediate vicinity of the keywords selected, so that retrieval precision can be improved.

Inquirus (Glover et al., 2001; Lawrence & Giles, 1998) is another web search engine that uses contextual information to improve search results. A user must specify some contextual information, considered as preferences, pertaining to the query. This context (preferences) provides a high-level description of the users information need and ultimately control the search strategy used by the system.

Kleinberg (1999) illustrates how hyperlink information in web pages can be used for web search when using a set of retrieved documents. An approach that also uses the characteristics of link information from a set of retrieved documents for topic distillation is presented by Amitay et al. (2002).

PageRank is hyperlink-based retrieval algorithm that calculates document scores by considering the entire hyperlink connected graph represented by all the links in the entire document collection (Brin & Page, 1998). The model described in (Zakos & Verma, 2006), uses traditional query expansion to determine context of query. Another closely related work on context-driven ranking on the Web (Jonathan, 2006), implicitly deduce context using three different algorithms. Finally (Pickens & Farlance, 2006), offers Term context model as a new tool for accessing term presence in a document. Bhatia & Khalid (2007; 2008b) offer Web retrieval system architecture with a novel re-ranking algorithm to effectively refine the ranking of web search results for locating the most useful documents.

An Overview of Machine Learning

Since 1940's, many knowledge-based systems have been built that acquire knowledge manually from human experts, which is very time-consuming and labor-intensive. To address this problem, Machine Learning algorithms have been developed to acquire knowledge automatically from examples or source data. It is defined as "any process by which a system improves its performance" (Simon, 1983).

Machine Learning Paradigms

Following are the five major Machine Learning paradigms (see Table 1).

Table 1. Various Machine Learning Paradigms
Probabilistic Models Symbolic Learning and Rule Induction Neural Networks Analytic Learning and Fuzzy Logic Evolution-based models
The use of probabilistic models was one of the earliest attempts to perform machine learning.

The most popular example: Bayesian method, Originating in pattern recognition research (Duda & Hart, 1973)

A Bayesian model stores the probability of each class, the probability of each feature, each feature given each class, based on the training data, to classify new instances according to these probabilities (Langley et al., 1992)

A variation of the Bayesian model, Naïve Bayesian model has been widely used in various applications in different domains (Fisher, 1987; Kononenko, 1993).
Symbolic learning can be classified as rote learning, learning by being told, learning by analogy, learning from examples, and learning from discovery (Cohen & Feigenbaum, 1982; Carbonell et al., 1983).

Learning from examples is implemented by applying an algorithm that attempts to induce a general concept description that best describes the different classes of the training examples.

ID3 decision-tree building algorithm (Quinlan, 1983), and its variations such as C4.5 (Quinlan, 1993)

A Neural Network is a graph of many active nodes (neurons) that are connected with each other by weighted links (synapses).

Knowledge is learned and remembered by a network of interconnected neurons, weighted synapses, and threshold logic units (Rumelhart et al., 1986a; 1986b; Lippmann, 1987).

Based on training examples, learning algorithms can be used to adjust the connection weights in the network so that it can predict or classify unknown examples correctly (Belew, 1989; Kwok, 1989; Chen & Ng, 1995).

Evolution-based algorithms rely on analogies to natural processes and Darwinian survival of the fittest.

There are three categories of evolution-based algorithms: genetic algorithms, evolution strategies, and evolutionary programming.

Among these, Genetic Algorithms are most popular and have been successfully applied to various optimization problems. They were developed based on the principle of genetics (Goldberg, 1989; Michalewicz, 1992).

The Analytic Learning represents knowledge as logical rules, and performs reasoning on such rules to search for proofs which can be compiled into more complex rules to solve similar problems.

Fuzzy systems and logic have been applied for imprecision and approximate reasoning by allowing the values of False or True to operate over the range of real numbers from 0 to 1 (Zadeh, 1965).

Machine Learning Algorithms

Machine learning algorithms are organized into a taxonomy, based on the desired outcome of the algorithm. In general, Machine learning algorithms can be classified as:

Machine Learning for Information Retrieval ¬

Learning techniques had been applied in Information Retrieval (IR) applications long before the recent advances of the Web. In this section, we will briefly survey some of the research in this area, covering the use of machine learning in:

Information Extraction

Information extraction (IE) refers to the techniques designed to identify useful information from text documents automatically. It has the goal of transforming a collection of documents, usually with the help of an IR system, into information that is more readily digested and analyzed.

Table 2. Differences between IR and IE
Information Retrieval (IR) Information Extraction (IE)
Aims to select relevant documents, i.e., it finds documents. Aims to extract relevant facts from the document, i.e., it extracts information.
Views text as a bag of unordered words. Interested in structure and representation of the document.

Thus IE works at a finer granularity level than IR does on documents and makes information retrieval more precise. Applications of IE include the summarization of documents in well defined subject areas and automatic generation of databases from text. IE includes the following subtasks:

Named entity recognition (NER)
It refers to recognizing relevant entities in text. It is the automatic identification from text documents of the names of entities of interest.

Relation extraction
Linking recognized entities having particular relevant relations.
The Machine learning-based entity extraction systems rely on algorithms based on Neural networks, Decision tree (Baluja et al., 1999), Hidden Markov Model (Miller et al., 1998), Entropy maximization (Borthwick et al., 1998), rather than human-created rules to extract knowledge or identify patterns from texts.

Relevance Feedback

Relevance Feedback helps users conduct searches iteratively and reformulate search queries based on evaluation of previously retrieved documents. Using relevance feedback, a model can learn the common characteristics of a set of relevant documents in order to estimate the probability of relevance for the remaining documents (Fuhr & Buckley, 1991). Various Machine Learning algorithms, such as genetic algorithms, ID3, and simulated annealing, have been used in relevance feedback applications (Kraft et al., 1995; 1997; Chen et al., 1998).

Information filtering

Information filtering techniques try to learn about users' interests based on their evaluations and actions, and then to use this information to analyze new documents. Many personalization and collaborative systems have been implemented as software agents to help users in information systems (Maes, 1994).

Text classification and Clustering

Text classification is the classification of textual documents into predefined categories (supervised learning). For example, Support Vector Machine (SVM), a statistical method that tries to find a hyper plane that best separates two classes. Text clustering groups documents into non-predefined categories which dynamically defined based on their similarities (unsupervised learning). Kohonen's Self-Organizing Map (SOM), a type of neural network that produces a 2-dimensional grid representation for n-dimensional features, has been widely applied in IR (Lin et al., 1991; Kohonen, 1995). Machine learning is the basis of most text classification and clustering applications.

Web Mining

Web mining refers to the use of data mining techniques to automatically retrieve, extract and evaluate (generalize/analyze) information for knowledge discovery from web documents and services. It is about making implicit or "hidden" knowledge into explicit. The digital revolution and the phenomenal growth of the Web have lead to the generation and storage of huge amounts of data, prompting the need for intelligent analysis methodologies to discover useful knowledge from it. Due to the heterogeneous, semi-structured, distributed, time-varying and multi-dimensional facets of web data, automated discovery of targeted knowledge is a challenging task. It calls for novel methods that draw from a wide range of patent areas of data mining, machine learning, information retrieval, natural language processing, Multimedia, and Statistics. In this article, we will provide a review of the field from the perspectives of machine learning and information retrieval and how they have been applied in web mining systems. Machine learning is the basis for most data mining and text mining techniques and information retrieval research has largely influenced the research directions of web mining applications.

From Data Mining to Web Mining

Two notable and active areas of current research are data mining and the World Wide Web. An expected alliance of the two areas, sometimes referred to as web mining, has been the focus of several recent research projects and papers. Nevertheless, web mining has many unique characteristics compared with data mining. First, the source of web mining is web documents. We consider the use of the Web as a middleware in mining database and the mining of logs and user profiles on the Web server still belong to the category of traditional data mining. Second, the Web is a directed-graph consists of document nodes and hyperlinks. Therefore, the pattern identified can be possibly about the content of documents or about the structure of the Web. Moreover, the Web documents are semi-structured or non-structured with little machine-readable semantic while the source of data mining is confined to the structural data in database. As a result, some traditional data mining methods are not applicable to web mining (Mahanta, 2008).

The distinct axes in the Web Mining Research

The diversity of information on the Web leads to the variety of web mining, defined in the following three categories (Kosala & Blockeel, 2000; Bin & Zhijing, 2003; Chakarbarti, 2003; Henzinger, 2004; Srivastava, Desikan & Kumar, 2002):

1. Web Usage Mining

2. Web Structure Mining

Web structure mining can be further divided into external structure (hyperlinks between web pages) mining, internal structure (of a web page) mining and URL mining.

3. Web Content Mining

The Road Ahead: Open Problems and Solutions

Web Mining Open problems

Nevertheless, web content is not always easy to use. Due to the unstructured and semi-structured nature of web pages and design idiosyncrasy of web sites, it is a challenging task to organize and manage content from the Web. Thus, web Mining continues to play an ever expanding and inevitable role. There are several major challenges for web mining research:

Machine Learning as a Solution

Machine Learning is likely to make substantial gains in web mining research and practice. We exploit few perspectives:

Web Content Mining. Web content mining is mainly based on research in information retrieval and text mining, such as information extraction, text classification and clustering, and information visualization. However, it also includes some new applications, such as knowledge discovery on the Web. Some important web content mining techniques and applications are reviewed in following subsections:

1. Text Mining for Web Documents
Text mining for Web documents can be considered a sub-field of web content mining. Information extraction techniques have been applied to web HTML documents. For example, Chang & Lui (2001) used a PAT tree to construct automatically a set of rules for information extraction. Text clustering algorithms also have been applied to web applications. For example, Chen et al. (2001; 2002) used a combination of noun phrasing and SOM to cluster the search results of search agents that collect web pages by meta-searching popular search engines.

2. Intelligent Web Spiders
Web Spiders, have been defined as "software programs that traverse the Web by following hypertext links and retrieving web documents by HTTP protocol" (Cheong, 1996). They can be used to build the databases of search engines (e.g., Pinkerton, 1994), perform personal search (e.g., Chau et al., 2001), archive web sites or even the whole Web (e.g., Kahle, 1997) and collect web statistics (e.g., Broder et al., 2000). Intelligent Web Spiders: some spiders that use more advanced algorithms during the search process have been developed. For example, the Itsy Bitsy Spider searches the Web using a best-first search and a genetic algorithm approach (Chen et al., 1998).

3. Multilingual Web Mining
In order to extract non-English knowledge from the Web, web mining systems have to deal with issues in language-specific text processing. The base algorithms behind most machine learning systems are language-independent. Most algorithms, e.g., text classification and clustering, need only to take a set of features (a vector of keywords) for the learning process. However, the algorithms usually depend on some phrase segmentation and extraction programs to generate a set of features or keywords to represent web documents. Other learning algorithms such as information extraction and entity extraction also have to be tailored for different languages.

4. Web Visualization
Web visualization tools have been used to help users maintain a "big picture" of the retrieval results from search engines, web sites, a subset of the Web, or even the entire Web. The most well known example of using the tree-metaphor for web browsing is the hyperbolic tree developed by Xerox PARC (Lamping & Rao, 1996). In these visualization systems, machine learning techniques are often used to determine how web pages should be placed in the 2-D or 3-D space. One example is the SOM algorithm described earlier (Chen et al., 1996).

5. The Semantic Web
Semantic web technology (Berners-Lee et al., 2001) tries to add metadata to describe data and information on the Web. Machine learning can play three roles in the Semantic Web:

Web Structure Mining. Web link structure has been widely used to infer important web pages information. Web structure mining has been largely influenced by research in:

Conclusion and Future Directions

Extracting user-centric information from the world's largest repository - the Web efficiently and effectively is becoming increasingly imperative. This article reviews how machine learning techniques can be applied to web mining. Major limitations of web mining research are lack of suitable test collections that can be reused by researchers and difficulty to collect web usage data across different web sites. Most web mining applications have been assessed in this paper. Although the activities are still in their early stages and should continue to develop as the Web evolves. Future research directions include:


Bibliographic information of this paper for citing:

Bhatia, MPS and Khalid, Akshi Kumar (2008).   "Information retrieval and machine learning: Supporting technologies for web mining research and practice."   Webology, 5(2), Article 55. Available at: http://www.webology.org/2008/v5n2/a55.html

Alert us when: New articles cite this article

Copyright © 2008, MPS Bhatia and Akshi Kumar Khalid.