Crunch and Manage Graph data: the survival kit Marie-Aude AUFAURE Business Intelligence Team Ecole Centrale Paris - PDF

Please download to get full document.

View again

of 49
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Internet

Published:

Views: 5 | Pages: 49

Extension: PDF | Download: 0

Share
Related documents
Description
Crunch and Manage Graph data: the survival kit Marie-Aude AUFAURE Business Intelligence Team Ecole Centrale Paris Context Data everywhere Big Data phenomenon Data are mainly unstructured 80% of data manipulated
Transcript
Crunch and Manage Graph data: the survival kit Marie-Aude AUFAURE Business Intelligence Team Ecole Centrale Paris Context Data everywhere Big Data phenomenon Data are mainly unstructured 80% of data manipulated in an enterprise are unstructured Data are produced in real time and distributed Data come from heterogeneous sources in an unpredictable way Mobile phone, sensors, computers, TV, etc. Our objective: bridge the gap between structured data and unstructured content in a Business Intelligence perspective Context and Motivation 2 What happens in an enterprise context? Unstructured information can be modeled by graphs (RDF, GraphML, etc.) But, key information is still stored in relational databases Heterogeneous data Non explicit representation of relations between objects No link between all databases of an enterprise Objective: provide a unique representation for massive and heterogeneous data using graphs Problems: Extracting these graphs Processing and visualizing big graphs We developed a set of tools for managing complex heterogeneous graphs Context and Motivation 3 User level Aggregation Visualization Visual Query Language Nodes/Concepts selection Middelware for information search Knowledge layer SPIDER- Enterprise Graph ontology matching Extraction and Merging Extraction Data layer DB1 DB2 DBn Textual content Open Linking data A set of tools for managing large heterogeneous graphs DATA LAYER Graph extraction from relational databases A set of tools for managing large heterogeneous graphs 5 NoSQL databases Relational databases are widely used for any kind of application because of: Transaction management (ACID properties) Advanced functionalities Query language This richness is also a drawback Distributed databases are complex Transactions are difficult to manage (and consistency is not always necessary) Cost of join operation is too high The NoSQL ( Not Only SQL ) movement appeared in 2009 Data scalability Big Data Management Compromise on ACID properties, transaction management Overview of NoSQL Databases and Distributed Computing 6 Distributed Systems Issues for data replication: Performance (writing several copies of an item) Consistency: is the ability of a system to guarantee that the transaction of each user run in isolation from other transactions, and never fails difficult to manage with data replication consistency levels: Strong consistency (ACID properties): synchronous replication and possibly heavy locking mechanisms Eventual consistency: the system guarantees the convergence towards a consistent step Weak consistency: fully favors efficiency, never wait for write and read operations some requests may be processed on outdated data Overview of NoSQL Databases and Distributed Computing 7 The CAP theorem CAP: Consistency (all nodes see the same data at the same time) Availability (node failures do not prevent survivors from continuing to operate) Partition tolerance (system continues to operate despite arbitrary message loss) A distributed system cannot simultaneously satisfy consistency and availability while being tolerant to failures Overview of NoSQL Databases and Distributed Computing 8 NoSQL models Key/Value (e.g. Dynamo): One key, one value Hashtable Document databases (e.g. MongoDB) Key store, but the value is a (structured) document (XML or JSON) Querying data is possible (by other means than just a key) Column databases (e.g. Cassandra) Each key is associated with multiple attributes (number of columns dynamic) Inspired by Google BigTable Graphs databases (e.g. Neo4j) Key Alice Inspired by Euler Graph Theory Focused on modeling the structure of the data Value City_Paris Birthdate_01/01/02 Overview of NoSQL Databases and Distributed Computing 9 Modelling with Cassandra Key Value City Paris ZIP Key Value Home Key Value City Paris ZIP Work Key Value City Levallois ZIP Key Value Alice Key Value Home Key Value City Paris ZIP Work Key Value City Levallois ZIP Bob Key Value Home Key Value City Cannes ZIP Work Key Value City Antibes ZIP Overview 10 of NoSQL Databases and Distributed Computing 10 Variety of graphs From simple graphs (basic mathematic definition): No information about nodes (all nodes have the same semantics, no attributes) E V V Mainly focus on the relations between objects To labeled and attributed graphs Add semantic information to nodes And more complex structures like Hypergraphs and Hypernodes allowing nested structures (complex attributes and/or relations) Our need: a model able to represent heterogeneous graphs based on complex objects and a clear separation between the schema and the instance level A set of tools for managing large heterogeneous graphs 11 SPIDER-Graph Model Two levels: schema and instance (intent and extent) Complex-node CN CN is defined by CN:= (c n, A cn ) where: c n denotes the name of CN and A cn denotes a set of attributes A cn := { acn acn := n, t } where n is the attribute name, t is the type. t is a basic type (such as: Integer, String,...) or a reference to another CN. SPIDER-Graph Schema S S= N cn R where: N CN is a set of complex nodes and R is the set of relations between N cn defined by R:= { r,cn,cn, r R, CN, CN N } s d s d where r denotes the name of R, CN s denotes the node source name and CN d denotes the node destination name. CN A set of tools for managing large heterogeneous graphs 12 Data level: extraction of graphs from relational databases Definition of a set of transformation rules using metadata information (primary and foreign keys in the relational schema) for building a SPIDER graph (schema and instance levels) Each relation is transformed into a complex node Implicit relations are transformed into edges The resulting graph is populated Experimented with a real database Containing 30 tables about 1788 students SPIDER-Graph schema: 30 nodes and 48 relations SPIDER-Graph instance: nodes and relations A set of tools for managing large heterogeneous graphs 13 Patterns examples Relation Transformation R1 (a,b) R2(a #,c) a b R1 Type1 Type2 Is-A c a R2 R1 Type3 R1 (a, b) R2(a#, c#) R1 R2 R2 R3(c, d) a b Type1 Type2 d c Type3 Type4 A set of tools for managing large heterogeneous graphs 14 Transformation to a SPIDER-Graph Model: Schema Translation (1) Complexnode creation (2) Relations Identification Manager Employee Mid Role Product Employee String Is-A Enum Name LastName address DNO Integer String String String Department Department DNO Dname Location Integer String String NumP Pname type Price Integer String String Double Part-of NumP Pnum quantity Desc Product-Project Product Project Integer String Part-of Facility Works-on Pnum Pname StartDate Dnum Project Integer String String Department Fid name type Pnum Integer String String Project A set of tools for managing large heterogeneous graphs 15 Transformation to a SPIDER-Graph Model: Data Migration (3) Intance level (population) Mid Role Managing the project team NumP Pname type Price Manager_1 Product-1 Employee_3 123 Phone7 Mobile Phone 500eur Is-A Part-of Enum Name LastName address DNO NumP Pnum quantity Desc Employee_3 Yan Paris Product-Project_1 03 Smith Department-1 Product-1 Project hardware Part-of Works-on Pnum Pname StartDate Dnum Department_1 DNO Dname Location Project_1 101 Trans Department_1 01 Development Paris Works-on Enum Name LastName address DNO Employee_1 01 Alain Jones Paris Department-1 Part-of NumP Pname type Price Product Scr2 LCD screen 1200eur Part-of NumP Pnum quantity Product-Project_2 Desc Product-2 Project software A set of tools for managing large heterogeneous graphs 16 DATA LAYER Graphs merging A set of tools for managing large heterogeneous graphs 17 Motivation Being able to visualize information from different data sources into one graph Accelerate merging process through distributed computing Being able to treat very large graphs Our case study : extracting graphs from databases before merging them A set of tools for managing large heterogeneous graphs 18 Distributed computing: Map/Reduce Map Programming : The master node model for processing large distributed data divide the input in subproblems sets sent to the worker nodes. The worker Initially designed by Google Labs (and used internally) node processes a local subset of data and produces intermediate groups. Reduce : The master node collects the intermediate results corresponding to a sub-problem sand combines them to produce the output Now promoted by other web companies (Yahoo!, Amazon) Many open source implementation (Hadoop, MongoDB, etc.) Based on functional programming and operates in 2 steps: a function MAP and a function REDUCE Used for large-scale graph processing, text processing, machine learning, etc. [J Dean & S Ghemawat ] Overview of NoSQL Databases and Distributed Computing 19 Map/Reduce: a simple example split1 W word1,1 mot2,1 W word2,3 split2 W word1,1 word1,1 word1,1 split3 W word2,1 word1,1 word,1 W word1,5 Map Reduce Overview of NoSQL Databases and Distributed Computing 20 Algorithm DB1 DB2 Schema matching Graph1 A B Graph2 B' C A B=B' C Graph1 A2 B2 A1 B3 B1 Graph2 B'1 B'2 B' 3 C1 A2 B1=B'1 B3=B'2 A1 B2 Instance merging C1 B' 3 A set of tools for managing large heterogeneous graphs Schema matching algorithm Objective : return a mapping to prepare the merging 3 super steps : Graph extraction Matching nodes between two graphs Creating the merged node Input : N schema graphs Output : List of nodes to be modified (nodei,fromgraphj ; mergednode) A set of tools for managing large heterogeneous graphs 22 Schema matching algorithm Graph1 (N1,R1) Graph2 (N2,R2) Graph extraction Graph3 (N3,R3) (1&2,N1) (1&3,N1) (1&2,N2) (2&3,N2) (1&3,N3) (2&3,N3) (1&2,N1) (1&2,N2) (1&3,N1) (1&3,N3) (2&3,N2) (2&3,N3) Matching nodes (N1i,N2j) (N1i,N3j) (N2i,N3j) Creating merged node (N1i,MergedNode(N1i,N2j) (N2j,MergedNode(N1i,N2j) A set of tools for managing large heterogeneous graphs Matching two schema nodes For a given node N1j We select the nodes whose name is quite similar using string similarity sim (N1j,N2i) 0.4 In those nodes, we compute the similarity between attributes, and the match exists if simattributes treshold (0.9 for testing) Matching rules are produced: N1(name,type,attributes) N2(name,type,attributes) Merged node N: name(n1), type(n1), attr(n1) U (attr(n2)-attr(n1) attr(n2)) A set of tools for managing large heterogeneous graphs 24 Schema Matching Employee Scrum-Master Graph1 Leader-of Project Same-Department Works_on Employee IS-A Graph2 Project Manager Nodes Matching Process Matching Rules Extraction Merge Graph1.Employee and Graph2.Employee Merge Graph1.Project and Graph2.Project A set of tools for managing large heterogeneous graphs Instance merging algorithm This algorithm can be divided in two parts: Extracting candidates for matching Matching instance nodes between two graphs Creating the merged node Creating the merged graph For a given node N1j : We determine the attributes matched For those attributes, if the average of the values similarity is higher than a certain threshold β (around 0.8), it is a match A set of tools for managing large heterogeneous graphs 26 Instance Matching Scrum-Master Employee_ 1 Leader-of Project_ 1 Alain Trans-2 Employee_ 2 Scrum-Master-of Smith Graph Merging Apply the matching rules Precision over the similarity threshold β 1 0,95 0,9 0,85 0,8 0,75 0,7 0,65 0,6 0,55 0,5 Precision Precision A set of tools for managing large heterogeneous graphs Recall over the similarity threshold β 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 Recall Recall A set of tools for managing large heterogeneous graphs Issues Performance evaluation: Better with the increase of the number of graphs (due to the matching process) Precision and Recall: similarity threshold between 0.7 and 0.8 Limits of the string similarity measure Synonyms are not taken into account (e.g. city and town) Precision Jaro-Winkler precision is around 90% for normal text Test over a large cluster will have to be done The identification of matched nodes could be improved Hadoop is a batch process, not suited for real-time But, the algorithm is working on large complex graphs using Hadoop A set of tools for managing large heterogeneous graphs 31 USER LAYER Graph aggregation and visualization A set of tools for managing large heterogeneous graphs 32 Graph Aggregation: SNAP & k-snap Tian, Hankins and Patel (SIGMOD 2008) Summarization based on user-selected node attributes and relationships. Provide drill-down and roll-up abilities to navigate multi-resolution summaries. Produce meaningful summaries for real applications (and multiple points of view) Efficient and scalable for very large graphs A set of tools for managing large heterogeneous graphs Main principle: compatible groupings A-compatible Grouping : All nodes in the same group must have the same attributes. (A,R)-compatible Grouping: A-compatible grouping, all nodes in the same group must have the same neighbor groups. 2 steps: attributes selection and group division A set of tools for managing large heterogeneous graphs Example T=15 H=30% 6 8 T=30 H=70% T=15 H=30% 7 9 T=30 H=70% T=15 H=30% 3 4 T=30 H=70% T=15 H=30% 1 5 T=30 H=70% T=15 H=30% 2 A set of tools for managing large heterogeneous graphs 35 Example T=15 H=30% 6 8 T=30 H=70% T=15 H=30% 7 9 T=30 H=70% T=15 H=30% 3 4 T=30 H=70% T=15 H=30% 1 5 T=30 H=70% T=15 H=30% 2 A set of tools for managing large heterogeneous graphs 36 Example T=15 H=30% 6 8 T=30 H=70% T=15 H=30% 7 9 T=30 H=70% T=15 H=30% 3 4 T=30 H=70% T=15 H=30% 1 5 T=30 H=70% T=15 H=30% 2 A set of tools for managing large heterogeneous graphs 37 Example T=15 H=30% 6 8 T=30 H=70% T=15 H=30% 7 9 T=30 H=70% T=15 H=30% 3 4 T=30 H=70% T=15 H=30% 1 5 T=30 H=70% T=15 H=30% 2 A set of tools for managing large heterogeneous graphs 38 Limits of SNAP Aggregation is very rigid in terms of attributes : Cartesian product of all modalities. Ineffective with the presence of a large number of attributes or / and numerical attributes. increases the number of groups with small size K-SNAP relaxes the homogeneity requirement and allows users to control the size of the summaries; In many applications, the graphs can be heterogeneous, i.e., objects are not represented by the same list of attributes; K-SNAP cannot be applied on heterogeneous graphs We proposed: A measure of homogeneity and separability Extensions of the algorithm for heterogeneous graphs A set of tools for managing large heterogeneous graphs 3 Architecture of our proposal A set of tools for managing large heterogeneous graphs 40 Proposition of a new evaluation measure: homogeneity measure using a distance P = ( C, C2.., C For a given partition, this measure Δ locally evaluates the 1 n heterogeneity of each class Ci for a relation R and determines the less j dense to be spitted. ) We use the contingency table (or association table) defined for each pair (m, n) of nodes for a given relationship defined as follows: Object m a c Object n b d where a = N ( m) N ( n), b = N ( m) a, c = N ( n) a, R t R t R t R t N R t ( v) = { w V ( u, w) E } { v} i A set of tools for managing large heterogeneous graphs Example A2 A d j ( A1, A2) = (1+ 3)/( ) A A A1 A A2 A1 A d j ( A1, A3) = 3/4 C A A2 A d j ( A1, A3) = 1 C C B B B A3 A A set of tools for managing large heterogeneous graphs Selection Step Starting from a A-compatible grouping, our procedure is to look at each iteration the relationship R * and the group C to divide that maximize j i * j the evaluation measure up that the cardinal of the partition is equal to K. δ i The measure is defined as follows, using the Jaccard distance: ( P) = avec R j R δ j i j ( P) = = m C i R j R1 i P n C i d j δ j i ( m, n) with d j ( m, n) = (b + c)/(a + b + c) is the Jaccard distance for a relation R j A set of tools for managing large heterogeneous graphs Division step Division step is performed as follows: v d 1. Determine the node called «central node» of the class to split verifying: C i d = arg max v C i * Deg ( v) 2. Divide * into two sub-classes according to the following strategy : one contains the neighbour of the central node, and the other one all nodes not connected to the central node. R j * A set of tools for managing large heterogeneous graphs Measure of homogeneity For a given partition P = ( C1, C2.., Cn ),this measure denoted Δ evaluates the homogeneity of each class C i and determines the less homogeneous to be splitted. Principle: For each relation, we denote: R j Intra-group criterion Density of group C i / relation j Inter-group criterion Density and connectivity Homogeneity measure with Deg j (v) : is the degree of vertex v according to the relationship et α [1, 2] The most homogeneous group with few relations outside is not divided R j A set of tools for managing large heterogeneous graphs Application: network of books about American politics Elaborated by Mark Newman [Newman, 2003]. Contains105 nodes and 441 vertex. Initial partition A-compatible Final partition After 4 iterations (7 groups) A set of tools for managing large heterogeneous graphs Application: network of books about American politics Homogeneity measure K-SNAP (K=7) A set of tools for managing large heterogeneous graphs Conclusion: tools developed SPIDER-Graph: a model for representing heterogeneous graphs containing complex nodes A set of transformation patterns matching with a semantic layer Merging of graphs extracted from RDBMS using Hadoop/HDFS Queries Visual query language Keywords search (by selecting nodes or ontology concepts) Summarization Aggregation (K-SNAP) Customized summaries Meaningful summaries for real applications, Efficient and scalable for very large graphs Conclusion 48 Readings 1. Tian, Y., R. A. Hankins, et J. M. Patel (2008). Efficient aggregation for graph summarization. In SIGMOD Rania Soussi, Etienne Cuvelier, Marie-Aude Aufaure, Amine Louati and Yves Lechevallier (2012) DB2SNA: an All-in-one Tool for Extraction and Aggregation of underlying Social Networks from Relational Databases. In: The influence of technology on social network analysis and Mining, Tansel Ozyer et al. (eds.), Springer, Nov Rania Soussi and Marie-Aude Aufaure (2012) Enterprise Ontology Learning for Heterogeneous Graphs Extraction, 2nd International Conference on Model & Data Engineering (MEDI'2012) October Poitiers, France. 4. J Dean & S Ghemawat : MapReduce: Simplified Data Processing on Large Clusters, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart: Web Data Management, Cambridge University Press, Roberto Zicari: Big Data: Challenges and Opportunities, Team : Etienne Cuvelier (postdoc, ECP), Rania Soussi (PhD, ECP), Amine Louati (intern, ECP&INRIA), Bruno Almeida Pimentel (intern INRIA), Alessandra Anyzewski (intern INRIA), kieu van cuong (intern ECP), Baptiste Thillay (intern, ECP). In collaboration with Yves Lechevallier (INRIA Axis) outline 49
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks