Abstract. Heaviest k-subgraph problem is determinea block of k nodes of a weighted graph(of n nodes) such that the total edge weight within the subgraph induced by theblock is maximized. Finding heaviestk-subgraph is a NP hard problem in literature. We have proposed an approach forapproximating the solution of heaviest k-subgraph in which greedy approach isused to reduce the size of graph which is used as input for branch and bound implementationof heaviest k-subgraph problem.1 IntroductionGiven an undirected graph the density of a subgraph on vertex set S is definedas , where E(S)is the set of edges in the subgraph induced by S.The problem of finding a densestsubgraph of a given graph G can be solved optimally in polynomial time 1.

Whenwe require the subgraph to have a specified size, the problem of finding amaximum density subgraph becomes NP-hard.Densest subgraph problems havereceived significant attention for detecting important sub structures inmassive graphs like web and different social networks. Algorithms for findingdense subgraphs have proved to be an effective tool in data analysis with applicationsin community detection, finding patterns in gene annotation graphs , link spamdetection, as well as event detection in social media.

In all these applications the underlyinggraph is massive and thus fast scalable algorithms for detecting densesubgraphs are required to be effective.In this paper we used greedy technique to approximate the solutionof heaviest k-subgrah. The main intuition of greedy method is to always selecta choice that looks best among the available options at that moment. Thisheuristic might not always give an optimal solution, but can be helpful ingiving near optimal solution. In this paper greedy approach is used whileconstructing graph from twitter data. We reduce the size of the graph which actas input to branch and bound implementation 2 of heaviest k-subgraph.2 Related Work Heaviest k-subraph problem is aNP-hard and there does not exist polynomial time algorithm to find exactsolution.

Still there exists some of the researches on approximating heaviestk-subgraph problem.Asahiro et al. 3 describe a greedyalgorithm for heaviest k-subgraph problem. They assign weight to each of thevertices. This is equal to the total sum of the weight of the edges connectingto that vertex known as weighed degree. They repeatedly remove a vertex withthe minimum weighted-degree in the currently remaining graph, until exactly kvertices are left.

Feige et al. 4 present anapproximation algorithm for heaviest k-subgraph problem using semidefiniteprogramming.Papailiopoulos et al.

5 present analgorithm that combines spectral and combinatorial techniques. This algorithmexamines candidate subgraphs obtained from vectors lying in a low-dimensionalsubspace of the adjacency matrix of the graph. Depend upon the spectrum ofgraph this algorithm comes with novel performance.

Khuller et al.6 give a1/2-approximation algorithm for (DalkS) and show that DamkS is as hard as DkSwithin a constant factor.Charikar in 7 describes a simpleheuristic that has a 2-approximation guarantee.

Nicolas et al. 8 alpresent severalapproximation algorithms which run in moderately exponential or parameterizedtime, describing trade-offs between complexity and approximability.3 Proposed Technique In this section, we presented aproposed technique for approximating solution of heaviest k-subgraph. Floediagram of various steps are shown in Fig. 1. Fig.1. Flow diagram of Proposed Technique3.

1 Pre-processing ofdataInthis step, pre-processing of twitter data is done. Each tweet is firsttokenized and then Stanford POS Tagger is used to remove the stop words. Remainingwords in particular tweet are stored in same sequence as in original tweet. Let each tweet be represented by which contains stop words and other words . After preprocessingstopwords are removed and tweet is represented by . 3.2 Graph FormationLet be the set of tweets and be one of the tweets where set of is the sequence of words in tweet . A graph be the structure specifying relationshipsbetween the words of a collection, where corresponds to the set of words, calledvertices or nodes and is the set of relations among words.

Each word in represents a node in and an edge between any two nodes correspondsto the simultaneous occurrence of the two words in a tweet. The weight denotesthe number of such instances. Let be the boolean function which tells whethertheir exists edge between node and . Undirected graph isrequired in heaviest k- subgraph to extract keywords and directed graph isrequired for forming sequences.For undirected graph,let it be , is an unordered pair and edge between x and yis bidirectional, thus is same as 3.3 Heaviestk-subgraph Problemdefinition (Heaviest-k-Subgraphproblem). Given an undirected weighted graph G and an integer k> 1, we wish to find a subgraph containing k nodes such that thesum of the weights of its edges is maximum.

3.4 Approximationof heaviest k-subgraphMainmain idea of this method is to reduce the size of graph which act as input tobranch and bound method. We used greedy approach to select most effectivevertices of the graph.

We didn’t pick directly top k weighted degree verticesrather we used some technique to reduce size of graph. We first calculate theweighed degree of each of the vertices. Then we take top k/2 weighted degreevertices in set H and sort the remaining vertices by sum of the weights of theedges connecting to set H. From thatsorted vertices, again we take top k/2 vertices in set G .In next step weextract graph induced by these k vertices of set H and G and vertices connecting these k vertices.According to our approach large graph is reduced to small size on which branchand bound implementation of heaviest k-subraph can give approximate solution.

In next section we have presented pseudocode of the above approach. 3.5 Proposed Algorithm 1. Input: A graph and integer 2.

Output: Aproximate heaviest k-subgraph of G(V,E)3. Let is an array that contains sum of the weightsof the edges connecting each node4. for each node in do5.

6. = (S); 7. Set ;8.

//remove top from and put in set 9. for to 10. = ; 11. //Put remaining element from in Set R12. ;13.

//for each node in R calculate sum of weight of edgesconnecting to set H14. For each node in 15. = ;16. = ;17. /remove top from and put in set 18. for to 19. = ; 20.

Set ;21. // Extract graph induced by vertices of set k and vertices connecting to these k vertices.22. Graph = ;23. //apply branch and bound on graph 24. = 25.

Return ; 4 Experiments and ResultsWe have collected a set of tweets bymeans of Twitter Streaming API and Twitter4J (an unofficial java library forTwitter API). We use these samples for our proposed algorithm for approximatingheaviest k-subgraph problem. The algorithms were implemented in Java and wererun on a machine with 2.53 GHz clock.

4.1 Results In this section we compare result ofapproximation algorithm of heaviest k-subgraph problem with exact branch andbound algorithm of heaviest k-subgraph problem. We compare running time andsolution of the algorithm i.e.

weight of heaviest k-subgraph Figure 2 Figure 2 shows the running the timeof approximate algorithm with exact branch and bound. It shows that exactbranch and bound runs faster when value of k is smaller i.e. upto 15. When the value of k get increase more thancertain value (here more then 15), approximate algorithm runs faster. Runningtime of exact branch and bound algorithm increase exponential after certainvalue of k.

After value of k=20, the time of exact branch and bound explode.But on the other hand approximate algorithm has able to solve problem of largek without exploding. Figure 3 In fig 3 we compared solution ofheaviest k-subgraph problem i.

e. the weight of k-subgraph returned by both ofthe algorithm. Branch and bound algorithm 1 gives the exact solution ofheaviest k-subgraph problem whereas approximate algorithm approximate thisvalue. According to results if we compare ratio of approximate solution withexact solution, it shows that most of the time, ration is above 0.9. After analyzing results of figure 2and 3, we can say that approximate algorithm can run efficiently for largevalue of k while maintaining the ratio of approximate solution to exactsolution more than 0.

9 most of the times. 5 ConclusionWe have proposed an approximating algorithm for heaviestk-subgraph problem where main idea is to reduce the size of graph which act asinput for branch and bound algorithm. Results show that as the value of k isincreased approximate algorithm runs faster with ratio of approximate solutionto exact solution above 0.9 most of the times. Future work includes improvementof this ratio so that approximate algorithm give more accurate result.