Abstract—the personalized web page recommendation is much needed thesedays. Generally, Web page recommendation systems are implemented in Webservers. They use data implicitly obtained as a collection ofWeb browsing patterns of the users for recommending webpages. The existingsystem collects the Web logs and generates a cluster of similar users andrecommends pages to the user by actively analyzing it in online. However thetime complexity for analyzing it in online is more.
In order to optimizethis and to improve the correctness of recommendation systems wepropose the method of applying Firefly based algorithm for recommending Webpages along with Naïve Bayes clustering. It clusters Web logs inoffline using Naive Bayes clustering technique. To find the similarity betweenthe active user query and other users in the cluster Firefly algorithm basedsimilarity measure is used. The proposed approach uses a probability basedclustering which eliminates the odd records while forming clusters. Fireflyalgorithm meticulously searches the generated web logs present in the clusterof the active user and recommends the top pages. Firefly algorithm utilizestime efficiently, thus it can be used for processing in online. When pages areobtained, they are ranked and the top pages that are more relevant to the queryare recommended.
The efficiency of proposed system can be evaluated using themeasures like precision, recall-Score, Matthews’s correlation and Fallout rate.The proposed approach is expected to improve time utilization in onlineprocess as well as recommends more accurate Webpages. Keywords—firefly;webpage; recommendation; naïve bayes; browsing ;cluster I. IntroductionWeb page recommendation system is asub-domain of recommendation systems that recommends a set of Web pages to theusers based on their past browsing patterns.
It is done by applying specialmining techniques on the data that are previously gathered from the users whichin turn discovers and extract information from Web documents and services. Themajor concern is about finding the most accurate recommendation algorithms.Recommendation system typically produces the result by following one of the twoways – through collaborative and content based filtering. II.
Types of Recommedation SystemsA. Colloborative FilteringMost recommendation system has wide useof collaborative filtering for recommending items. This method lies on collectingand processing the information’s on user’s behaviors or activities and then predictingthe items relating to their similarity with other users. Collaborative filteringapproaches building a structure from a user’s past behaviors and decisionsof other similar users. This model is then usedto predict items that the user may have an interest in. Since collaborativefiltering does not rely on machine analyzable contents, it is capableof recommending for complex items accurately without “understanding” of the itemitself.B. Content Based FilteringContent based filtering is another common approachwhen designing recommendation systems.
This technique is basedon a definition of the item and a user’s preferred profile. In a content based recommendationsystems, the keywords are considered as user’s interest. Content based filteringapproaches utilize a series of distinct property of an item inorder to obtain and recommend items with same properties.
These approaches are often combined as Hybrid Recommendation Systems.These algorithm try to recommend items based on examiningthe items that are liked by a user in the past or in thepresent. In general, various candidate items arecompared with items previously rated by the user and the best matchingitems are recommended. III. Literature survey Recommendation system plays a vital role inrecommending personalized items for the users based on their interest in a web services.
The web also contains a rich and dynamic information’s. The amount ofinformation on the web is growing rapidly, as well as the number of web sitesand webpages per web site. Predicting the needs of a web user as she visits websites has gained importance. Many webpage recommendation system were developed inthe past, since they compute recommendations in online process, their time utilizationshould be efficient.A system 4 that uses supportvector machine (SVM) learning based model wasdeveloped for computing similarity between two items which performed better thanlatent factor approach for group recommendations. Since the matrix representationwas followed, the data sparsity problem was solved. However,the system was not able to stably scale when size of the groupdynamically increased.
A system for assigning an electronicdocument to one or more predefined categories or classes basedon its textual context and use of agglomerative clustering algorithm wasdeveloped 6. This type of clustering along with sample correlation coefficientas similarity measure, allowed high indexing term space reduction factor witha gain of higher classification accuracy.In order to minimize noise and outlierdata, a modified DBSCALE algorithm using Naïve Bayes has been designed 7.This algorithm is basically a prospect based utility. This function is used toestimate the outlier cluster data and increase the correctness rate ofalgorithm on given threshold value. Since Naïve Bayes is a probability basedfunction, it removes outlier cluster data and increases the correctness rateaccording to threshold value. It also computes maximum posterior hypothesis foroutlier data. In order to minimize noise and outlier data, a modified DBSCALEalgorithm using Naïve Bayes has been designed 7.
This algorithm is basicallya prospect based utility. This function is used to increase thecorrectness rate of algorithm on given threshold valueand to estimate the outlier cluster data. Since Naïve Bayes is aprobability based function, it removes outlier cluster dataand increases the correctness rate according tothreshold value. It also computes maximum posterior hypothesis foroutlier data.The memory based collaborative systemuses matrix based computation and solves data sparsity problem but, scalabilityof the system cannot be stable when size of the group dynamically increases.Hybrid system could be helpful in overcoming the scalability issue but it againleads to cold start problem. To eliminate outliers as well as overcomingother two problems Naive Bayes clustering, a probabilitybased method was used in past.
Firefly algorithm has a fasterconvergence and searches all possible subsets with better timeutilization. Thus, to design an efficient recommendation system, NaïveBayes method can be followed for clustering in offline. Sincethe time complexity should be less, Firefly algorithm that ismore efficient in terms of time utilization, it can be usedfor calculating similarity in online. Combination of these two technique mightincrease the accuracy of the recommendation system as well as results in efficienttime utilization.
IV. Overview of the proposed workInitially, the web log files are obtainedfrom the 1 America Online Inc. The log files consists of fivefields i.e.
anonymous ID for individual user, query of each user along withquery time, list of URLs which user proceeded and itsrank in the result. These logs are collectedand grouped based on anonymous ID. The URL among allthe users are obtained and its content are downloaded andprocessed. The processing of data includes removal of stopwords from the URL’s data and keyword extraction. Similar users are clustered basedon fetched keywords by using Naïve Bayes clustering technique which provides efficientclusters compared to clustering by the use of association rules.The created clusters are given to online component.In online process, when an active user gives a query, the keywords fromthe query is extracted. The similarity between the extractedkeywords with the other users in the same clusterof the active user is calculated using Firefly similaritymeasure.
The similarity values are sorted along with the web pages browsed bysimilar users in the cluster. The top k web pages are recommended for the active useras a result. V. proposed workThe proposed system follows a linearprocess of initially collecting the web logs and processing them followed byclustering similar users by Naïve Bayes clustering technique and finallygenerating recommendations based on a similarity measure from firefly algorithm.A.
Preprocessing of Web Logs The web logs are collected form 1 AOLInc. It consists of 20 million web queries from 650 thousand real users over 3months. The data set includes anonymous ID, query, query time, item rank and click URL. The log file contains many number of users along with the web pages visited by them. It is validated and separated based on anonymous ID. The user is separatedinto individual file using anonymous ID. The content from the URL are fetchedand downloaded.
Those keywords are processed which undergoes stop words removaland stemming process. The final keywords are then extracted. The features likekeywords, Timings, Frequency, Click URL and Revisit are fetched. The userprofile is constructed using those features.
The user profile that constructedis based on the features that are taken form the user log files. · Timing: The timing that theuser spent on that particular URL· Frequency: The amount of timethe user visited the URL· Clickstream: The number ofclick stream that are visited by user· Revisit: Whether the uservisited the web pageThe keywords are generated from the data fetched form the URL. Timing for each URL is estimated from the given date and time by calculating thedifference between the each URL that are searched in a single day by having some time constraints. Frequency is hence calculated such that number of times the user clicked the URL. The clickstreams are those that are clicked by the user for additional information. The timing of revisit is calculatedsuch that to decide whether the user preferred it much or not. Keywords: Keywords are those which are extracted from the URL. The information from the URL is hence collected and processed to obtain featuresof the user.
B. Naïve BayesClustering Clustering, also known as unsupervisedclassification, is a descriptive task with many applications. Clustering isdecomposition or partition of a data set into groups in such a way that theobject in one group are similar to each other but as different as possible fromthe object in other groups. Three main approach for clustering of data ispartition based clustering, hierarchical clustering and probabilistic modelbased clustering. Probabilistic model based clustering is a soft clusteringwere an object can belong to more than one cluster following a probability distribution.
A clustering is useful if it produces some interesting insight in the problemthat we are analyzing. Naïve Bayes clustering is also a probabilisticclustering technique that is based in Bayes theorem with strong independentassumption between features. The feature variables can be discrete or continuous.
This probabilistic clustering lies on nominal and numeric variables in the dataset and its novelty lies in the use of mixture of truncated exponential (MTE)densities to model the numeric variables. In Naïve Bayes clustering the class is the only root variable and all the attributes are conditionally independent given the class. The clustering problem reduces to take a data set of instances and a previously specified number of clusters (k), and work out each cluster’s distribution and the population distribution between the clusters.
To obtain these parameters the expectation maximization (EM) algorithm is used. Since Naïve Bayes clustering is a probability based techniques. The items belongs to the cluster if and only if it has a relation to it. This helps in eliminatingoutlier data in the process of clustering. It also provides proper clusteringwith less computations.