A novel approach to identify influential functions

Abstract

Identifying influential nodes is an important issue in understanding the process of information diffusion in complexsoftwarenetworks.Researchers generally define functions as nodes,relationshipoffunction calls as edges to map software into complex network.This paperproposes a novel approach to identify the influentialnodes in complex software network based on complex network. Firstly,we map the function and relationship of function calling or dependencyto adirectedweightedcallnetwork (DWCN). Secondly, we define NodeScore(NS) to evaluate the importance of nodes in network.It has higher influence when the node has lager value of NS.The NSof each node is calculated bytwo parts in our opinion, Direct Dependency Nodes(DDN) andIndirect DependencyNodes(IDN).Thirdly,algorithmComputeNodeScoreis proposed to calculate NS of each node.Finally, we mine top-k nodes based on the NS. Experimental resultsofvarious versionsabout software Tar and Cflow showtheapproachis effective foridentifying the influential nodes.

Keywords:Complex SoftwareNetwork,FunctionCalls,InfluentialNodes.

1 Introduction

Theunderstandingofsoftware network’sstructure has attracted much attention recently,especially identifyingtheinfluential nodes.These influential nodes play an important part in ensuingreliability and stabilityof software.Once wehaveidentifiedthese nodes,we can pay more attention to them in software versionupdatingandsoftwaremaintenance [1-2], even software refactoring[3].Therefore,identifying influentialnodes in software networkhas become an important task in software engineeringandbecame more and more important.

Alotofwork hasbeen done in identifying influential nodes in complex network,but less in software network.Mapping softwarestructureto complex network is important from different perspectivesbydifferent methods. Thus, theresearch approachesof complex networkcan be applied to theresearchfield of software network. Wang et al. [4]proposedan approachto study the evolution of special software kernel components, which adopted the theory ofcomplex networks. Theyalsoproposed a generic method to find major structural changes thathappenedduring the evolution of software systems. Li et al.[5] proposed a modular attachment mechanism of software network evolution. Theirapproachtreatedobject-oriented software system as amodularnetwork, which wasmore realistic.A new definition of asymmetric probabilitieswasgiven to acquire links in directed networks when new nodes attach to the existing network.Valverde and Sole[6] presented a complex network approach to the study of software engineering. They found universal network patterns in a large collection of object-oriented (OO) software systems. All the systems analyzed here display the small-world effects.It was the first timetostudy the software as a complex software network. The classes were treated asnodesand therelationshipamong classeswere treated asdirectededges.Inspired by the surprising discovery of several recurring structures in various complex networks, a number of works treated software systems as complex networks andindicatedthat software systems exposethe small-world effects and follow scale-free degree distributions.In addition,Cai and Yin[7] treated software executionprocess as an evolving complex network for the first time. Theyexamined there exist invariantpatterns in the dynamic behavior of software systems. The concept of software mirror graphwasintroduced as a new model of complex network toreflectthe software behaviorinformation.

The most usedmeasurementsto measure thecentralityof nodes aredegree centrality (DC),betweenness centrality (BC),and closeness centrality (CC)[8].Following,manyimproved methods were proposed. An approach named Semi-local centrality measure[9]wasproposed. It consideredboth the nearest and the next nearest neighbors. The approach wasa local centrality measure as a tradeoff between low-relevant degreecentrality and other time-consuming measures.Kitsak et al.[10]realized that the position of nodes is important in global networkandproposed k-shell decomposition analysis to obtain ranking index of nodes. It considered the position of nodes in global networkandillustratedmore accurate resultsthan degree and betweenness. However, the k-shell decomposition fails toimplementthe ranking of spreadersinthe same k-shell index.Bae and Kim[11] proposed a novel measurecalledcorenesscentrality to estimate the spreading influence of a node in a network, whichusedthe k-shell indices of its neighbors.The approachwas based on the idea that a powerful spreader has more connections toothernodeswhichreside in the core of network.The approach also pointed out that the number of a node's neighbors has a large influence toits spreading ability. Also, Liu et al. [12] presented an improved method to generate the ranking list to evaluate the node spreading influence, whichtook into account the shortest distance between a target node and node setwhich hasthe highest k-core value.They turned to a new perspective to understand the relationship between not only the k-shell location, butalsothe nodes’ shortest distance to the network core. It is helpful forus tounderstand the importanceofnodes ina network.

Based on the researches above,we find that theinfluenceof a function nodedepend on not only the nodes whichit calls directlybutalso the nodes which it callsindirectly.Firstly we define someconceptsto describe ourapproach, such as DDN (Direct Dependency Nodes) andRN (Reachable Nodes).Then an algorithmComputeNodeScore is proposed to measure the influence ofeachnode in the network. Werankthe influence ofeachnodeto mine the top-knodes.These function nodes have played an important part in ensuring software reliability and stability. So theyshould be taken more attention in the process of software updatingandsoftwaremaintenance.

The rest of this paper is organized as follows.InSection 2some definitions are proposed to describe the problem.Then,algorithmComputeNodeScoreusedto evaluateeachnode is given in Section 3.Experimental resultsin Section 4 show performances ofthealgorithm.Finally, the conclusionsand future workof the paper are presented in Section 5.

2 Preliminaries

Definition 1 DWCN (Directed Weighted Call Network). In DWCN, nodes represent functionsand directededgesrepresent the relationship amongfunctions.We can use a triple to describe the DWCN.

DWCN = (V, E, AL)(1)

WhereV is a set of nodes in the network, like {v1,v2,v3,…,vn}.Eis a set of directed edges, {1,v2>,2,v3>,…,i,vj>,…}.ALis an adjacencyliststored the nodes, edges and the score ofeachnode in the network.

Figure 1is the illustrationof a simple DWCN.

Figure 1.Illustrationof a simple DWCN

In this illustration, V = , E = {1,v2,>, 1,v3,>, 1,v4,>, 2,v5,>, 2,v6,>, 5,v7>}.Eachnode and the relations with other nodes in the network have a record in the adjacencylist.In addition, thescoreofeachnode in the network is different. It is depended on the nodes which it calls directlyandothers which itcalls indirectly.For example in figure 1,node v1calls nodes v2, v3, v4by edges 1,v2>, 1,v3>, 1,v4>. In our approach nodes v2, v3and v4has influence to theimportanceof nodev1. Andothernodes like v5, v6 which v1 can reach by edges 2, v5> and 2, v6>also have influence to nodev1.

Definition 2 DDN (Direct Dependency Nodes).For a nodevi,DDNis a set offunctionswhich calledby nodevidirectly. The DDN of nodeviis gotten by only one call step.

As shown in figure 1, DDN (v1) = { v2,v3,v4}.

We have mentioned theinfluenceof nodeviis based on the nodes it can reach directly andindirectly.

Definition 3IDN(Indirect DependencyNodes).IDN is a set ofnodes thatnodevican reachindirectly.

In figure 1,IDN (v1) = { v5,v6,v7}.

Definition 4NS (NodeScore).NS is defined to measure the influence of the node vi.The NS is given as follows:

NS(v) = p*|DDN| +∑p*NS(u), u∈IDN(v) (2)

It has higher influence when vihas lager value of NS.

p=cij/diout(3)

whichp is the probability of nodevicallsnode vj∈DDN.The numerator cijis a constant to representwhether there is an edge between nodes.If there isan edge i, vj>,cijis 1,otherwisecijis 0.The denominator is the out-degree of nodevi.

3 Method of Identifying Influential Nodes

Normally, thesoftwarewhich we mappedis consistedof functions and function callsamongthem. Firstly we extracted these from the source code. Then we map them toaDWCN, and we get the adjacencylist at the same time. Finally, algorithmComputeNodeScore is used toidentifyinfluential nodesby calculatingtheNS of each node. As shown in Figure 2.

Figure 2. Framework of our approach

To measure theNS of nodevi,we build adjacencylistbased onnode set N and edge set E inDWCN. A novel algorithm namedComputeNodeScore is proposed.In thisapproach,we evaluate the importance of the nodeviiteratively. Finally,we calculate the NS ofnodevi.

3.1Build AdjacencyList

Theresultof algorithm BuildAdjacencyList is a part of DWCN. We study the software as a complex software network and get two sets which are node set N and edge set E. The process of Building Adjacency List is based on the two sets. We build a list for each node.

Algorithm1: BuildAdjacencyList

Inputs:set V=, set E={s1,ve2>,s2,ve3>,…,si,vej>,…}

Output: OutDegreeList(vi) //the out-degree list of node vi

Process:

(1)for(each viV){

(2)for(si,vej>∈E){

(3)if(vi= vsi)

(4)outDegreeList(vi).add(vej);

(5)}

(6)RerurnOutDegreeList(vi);

(7)}

As shown in algorithm BuildAdjacencyList,for each node in set N wetraversethe edges in set E in line 1 and line 2. We define the nodes of an edge start node vsiand end node vejrespectively. Line 3 to line 4 we add the end node vejof an edge to theOutDegreeListof node vi, when node viequals to the start node vsiof the edge. Finally, we calculate theOutDegreeListof viin line 6.

3.2 Mininginfluentialnodes

Based on theout-Degree adjacencylist of node vi, we cancalculateNS of each node. Algorithm ComputeNodeScore is used to mine influence nodes in the software complex network. We calculate the influence from IDN(vi) and DDN(vi). As shown in algorithm ComputeNodeScore.

Algorithm2:ComputeNodeScore

Inputs: node vi, OutDegreeList(vi)

Output: the NS of node vi

Process:

(1)Initialise NS(vi)=0;

(2)if( OutDegreeList [vi] != null){

(3) p= 1/OutDegreeList.size();

(4)for(each vjOutDegreeList [vi]){

(5)if(vjis not in Queue(vi)){

(6)Queue(vi).add(vj);

(7) NS(vi) +=p*NodeScoreNumber(vj);

(8)}

(9)}

(10) NS(vi) =p*OutDegreeList.size()+NS(vi);

(11)}

(12)else{

p = 0;

(13)}

As shown in algorithmComputeNodeScore,we evaluate the influence ofanode in the network by an iterative process.Firstly,we initial NS of nodevias a constant 0. We also build a queue for node viwhich stored the set DDN(vi). Line2toline 3is the process to compute the probability of nodevicallsnode vj∈DDN.If the out-degree ofnode viis not equal to 0, we set the possibility to 1/djout. Otherwise,the possibilityis 0.Line 4 to line 6 is the process ofenqueueing of vjwhen one of the set DDN(vi) is not in the queue. Then the NSof nodeviin RN is calculatediteratively in line 7.Finally, we calculate NS of viin line 10.

Next,we sort theNSof these nodes and mine the top-k influential nodesinthe software network.

  • 发表于:
  • 原文链接https://kuaibao.qq.com/s/20180801G1LA1P00?refer=cp_1026
  • 腾讯「云+社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。

扫码关注云+社区

领取腾讯云代金券

年度创作总结 领取年终奖励