Abstract—the personalized web page recommendation is much needed these
days. Generally, Web page recommendation systems are implemented in Web
servers. They use data implicitly obtained as a collection of
Web browsing patterns of the users for recommending webpages. The existing
system collects the Web logs and generates a cluster of similar users and
recommends pages to the user by actively analyzing it in online. However the
time complexity for analyzing it in online is more. In order to optimize
this and to improve the correctness of recommendation systems we
propose the method of applying Firefly based algorithm for recommending Web
pages along with Naïve Bayes clustering. It clusters Web logs in
offline using Naive Bayes clustering technique. To find the similarity between
the active user query and other users in the cluster Firefly algorithm based
similarity measure is used. The proposed approach uses a probability based
clustering which eliminates the odd records while forming clusters. Firefly
algorithm meticulously searches the generated web logs present in the cluster
of the active user and recommends the top pages. Firefly algorithm utilizes
time efficiently, thus it can be used for processing in online. When pages are
obtained, they are ranked and the top pages that are more relevant to the query
are recommended. The efficiency of proposed system can be evaluated using the
measures like precision, recall-Score, Matthews’s correlation and Fallout rate.
The proposed approach is expected to improve time utilization in online
process as well as recommends more accurate Webpages.
webpage; recommendation; naïve bayes; browsing ;cluster
Web page recommendation system is a
sub-domain of recommendation systems that recommends a set of Web pages to the
users based on their past browsing patterns. It is done by applying special
mining techniques on the data that are previously gathered from the users which
in turn discovers and extract information from Web documents and services. The
major concern is about finding the most accurate recommendation algorithms.
Recommendation system typically produces the result by following one of the two
ways – through collaborative and content based filtering.
II. Types of Recommedation Systems
Most recommendation system has wide use
of collaborative filtering for recommending items. This method lies on collecting
and processing the information’s on user’s behaviors or activities and then predicting
the items relating to their similarity with other users. Collaborative filtering
approaches building a structure from a user’s past behaviors and decisions
of other similar users. This model is then used
to predict items that the user may have an interest in. Since collaborative
filtering does not rely on machine analyzable contents, it is capable
of recommending for complex items accurately without “understanding” of the item
B. Content Based Filtering
Content based filtering is another common approach
when designing recommendation systems. This technique is based
on a definition of the item and a user’s preferred profile. In a content based recommendation
systems, the keywords are considered as user’s interest. Content based filtering
approaches utilize a series of distinct property of an item in
order to obtain and recommend items with same properties.
These approaches are often combined as Hybrid Recommendation Systems.
These algorithm try to recommend items based on examining
the items that are liked by a user in the past or in the
present. In general, various candidate items are
compared with items previously rated by the user and the best matching
items are recommended.
III. Literature survey
Recommendation system plays a vital role in
recommending personalized items for the users based on their interest in a web services.
The web also contains a rich and dynamic information’s. The amount of
information on the web is growing rapidly, as well as the number of web sites
and webpages per web site. Predicting the needs of a web user as she visits web
sites has gained importance. Many webpage recommendation system were developed in
the past, since they compute recommendations in online process, their time utilization
should be efficient.
A system 4 that uses support
vector machine (SVM) learning based model was
developed for computing similarity between two items which performed better than
latent factor approach for group recommendations. Since the matrix representation
was followed, the data sparsity problem was solved. However,
the system was not able to stably scale when size of the group
dynamically increased. A system for assigning an electronic
document to one or more predefined categories or classes based
on its textual context and use of agglomerative clustering algorithm was
developed 6. This type of clustering along with sample correlation coefficient
as similarity measure, allowed high indexing term space reduction factor with
a gain of higher classification accuracy.
In order to minimize noise and outlier
data, a modified DBSCALE algorithm using Naïve Bayes has been designed 7.
This algorithm is basically a prospect based utility. This function is used to
estimate the outlier cluster data and increase the correctness rate of
algorithm on given threshold value. Since Naïve Bayes is a probability based
function, it removes outlier cluster data and increases the correctness rate
according to threshold value. It also computes maximum posterior hypothesis for
outlier data. In order to minimize noise and outlier data, a modified DBSCALE
algorithm using Naïve Bayes has been designed 7. This algorithm is basically
a prospect based utility. This function is used to increase the
correctness rate of algorithm on given threshold value
and to estimate the outlier cluster data. Since Naïve Bayes is a
probability based function, it removes outlier cluster data
and increases the correctness rate according to
threshold value. It also computes maximum posterior hypothesis for
The memory based collaborative system
uses matrix based computation and solves data sparsity problem but, scalability
of the system cannot be stable when size of the group dynamically increases.
Hybrid system could be helpful in overcoming the scalability issue but it again
leads to cold start problem. To eliminate outliers as well as overcoming
other two problems Naive Bayes clustering, a probability
based method was used in past. Firefly algorithm has a faster
convergence and searches all possible subsets with better time
utilization. Thus, to design an efficient recommendation system, Naïve
Bayes method can be followed for clustering in offline. Since
the time complexity should be less, Firefly algorithm that is
more efficient in terms of time utilization, it can be used
for calculating similarity in online. Combination of these two technique might
increase the accuracy of the recommendation system as well as results in efficient
IV. Overview of the proposed work
Initially, the web log files are obtained
from the 1 America Online Inc. The log files consists of five
fields i.e. anonymous ID for individual user, query of each user along with
query time, list of URLs which user proceeded and its
rank in the result. These logs are collected
and grouped based on anonymous ID. The URL among all
the users are obtained and its content are downloaded and
processed. The processing of data includes removal of stop
words from the URL’s data and keyword extraction. Similar users are clustered based
on fetched keywords by using Naïve Bayes clustering technique which provides efficient
clusters compared to clustering by the use of association rules.
The created clusters are given to online component.
In online process, when an active user gives a query, the keywords from
the query is extracted. The similarity between the extracted
keywords with the other users in the same cluster
of the active user is calculated using Firefly similarity
measure. The similarity values are sorted along with the web pages browsed by
similar users in the cluster. The top k web pages are recommended for the active user
as a result.
V. proposed work
The proposed system follows a linear
process of initially collecting the web logs and processing them followed by
clustering similar users by Naïve Bayes clustering technique and finally
generating recommendations based on a similarity measure from firefly algorithm.
A. Preprocessing of Web Logs
The web logs are collected form 1 AOL
Inc. It consists of 20 million web queries from 650 thousand real users over 3
months. The data set includes anonymous ID, query, query time, item rank and click URL. The log file contains many number of users along with the web pages visited by them. It is validated and separated based on anonymous ID. The user is separated
into individual file using anonymous ID. The content from the URL are fetched
and downloaded. Those keywords are processed which undergoes stop words removal
and stemming process. The final keywords are then extracted. The features like
keywords, Timings, Frequency, Click URL and Revisit are fetched. The user
profile is constructed using those features. The user profile that constructed
is based on the features that are taken form the user log files.
Timing: The timing that the
user spent on that particular URL
Frequency: The amount of time
the user visited the URL
Clickstream: The number of
click stream that are visited by user
Revisit: Whether the user
visited the web page
The keywords are generated from the data fetched form the URL. Timing for each URL is estimated from the given date and time by calculating the
difference between the each URL that are searched in a single day by having some time constraints. Frequency is hence calculated such that number of times the user clicked the URL. The clickstreams are those that are clicked by the user for additional information. The timing of revisit is calculated
such that to decide whether the user preferred it much or not. Keywords: Keywords are those which are extracted from the URL. The information from the URL is hence collected and processed to obtain features
of the user.
B. Naïve Bayes
Clustering, also known as unsupervised
classification, is a descriptive task with many applications. Clustering is
decomposition or partition of a data set into groups in such a way that the
object in one group are similar to each other but as different as possible from
the object in other groups. Three main approach for clustering of data is
partition based clustering, hierarchical clustering and probabilistic model
based clustering. Probabilistic model based clustering is a soft clustering
were an object can belong to more than one cluster following a probability distribution.
A clustering is useful if it produces some interesting insight in the problem
that we are analyzing. Naïve Bayes clustering is also a probabilistic
clustering technique that is based in Bayes theorem with strong independent
assumption between features. The feature variables can be discrete or continuous.
This probabilistic clustering lies on nominal and numeric variables in the data
set and its novelty lies in the use of mixture of truncated exponential (MTE)
densities to model the numeric variables. In Naïve Bayes clustering the class is the only root variable and all the attributes are conditionally independent given the class. The clustering problem reduces to take a data set of instances and a previously specified number of clusters (k), and work out each cluster’s distribution and the population distribution between the clusters. To obtain these parameters the expectation maximization (EM) algorithm is used. Since Naïve Bayes clustering is a probability based techniques. The items belongs to the cluster if and only if it has a relation to it. This helps in eliminating
outlier data in the process of clustering. It also provides proper clustering
with less computations.