【翻译】图解Janusgraph系列-索引详解（Janusgraph Index）

洋仔聊编程

发布于 2022-05-11 10:05:26

6910

发布于 2022-05-11 10:05:26

文章被收录于专栏：Java开发必知必会Java开发必知必会

图解Janusgraph系列-索引详解（janusgraph index）

大家好，我是洋仔，JanusGraph图解系列文章，`实时更新`~

图数据库文章总目录：

整理所有图相关文章，请移步(超链)：图数据库系列-文章总目录
地址：https://liyangyang.blog.csdn.net/article/details/111031257

源码分析相关可查看github（求star~~）： https://github.com/YYDreamer/janusgraph
下述流程高清大图地址：https://www.processon.com/view/link/5f471b2e7d9c086b9903b629
版本：JanusGraph-0.5.2
转载文章请保留以下声明： >作者：洋仔聊编程 >微信公众号：匠心Java >原文地址：[https://liyangyang.blog.csdn.net/](https://liyangyang.blog.csdn.net/)

Overview

Janusgraph Index --> graph index && vertex-centric index

graph index --> composite index && mixed index 、全图索引

composite index : 索引列全使用并且等值匹配、不需要后端索引存储、支持唯一性、排序在内存中成本高
mixed index ：索引列任何字段都可以触发索引、范围查询、全文检索、地理检索等、需要后端索引存储支持、不支持唯一性、排序有索引效率高无索引也在内存中排

vertex-centric index --> janusgraph默认为每个属性添加该索引，组合索引满足最做匹配原则可使用，便于查询节点的边（节点存在很多边的情况下）

一：Extending JanusGraph Server

JanusGraph支持两种类型的索引：graph index和vertex-centric index。graph index常用于根据属性查询Vertex或Edge的场景；vertex index在图遍历场景非常高效，尤其是当Vertex有很多Edge的情况下。

二：Graph Index

Graph Index是整个图上的全局索引结构，用户可以通过属性高效查询Vertex或Edge。如下面的代码：

g.V().has('name','hercules')
g.E().has('reason', textContains('loves'))

上面的例子即为根据属性查找Vertex或Edge的实例，如果没有设置索引，上述的操作将会导致全表扫描，对大图来说是不可接受的。

JanusGraph支持两种不同的Graph Index，Composite index和Mixed Index，Compostie非常高效和快速，但只能应用对某特定的，预定义的属性key组合进行相等查询。Mixed index可用在查询任何index key的组合上并支持多条件查询，除了相等条件要依赖于后端索引存储。

这两种类型的Index都是通过JanusGraph的management操作的：

JanusGraphManagement.buildIndex(String,Class） //此操作只是获取IndexBuilder对象，之后再由该对象通过 addKey()、buildMixedIndex()\buildCompositeIndex()\buildEdgeIndex() 创建索引

第一个参数是index的名称，第二个参数是要索引的类（如Vertex.class），name必须唯一。如果是在同一事务中新增的属性key所构成Index将会即刻生效，否则需要运行一个reindex proceudre来同步索引和数据，直到同步完成，否则索引不可用。推荐在初始化schema时同时定义索引。

注意：如果没有建索引，会进行全表扫面，此时性能非常低，可以通过配置force-index参数禁止全表扫描。

1：Composite Index

Comosite index通过一个或多个固定的key组合来获取Vertex Key或Edge，也即查询条件是在Index中固定的。

// 在graph中有事务执行时绝不能创建索引（否则可能导致死锁）

graph.tx().rollback()

mgmt = graph.openManagement()

name = mgmt.getPropertyKey('name')

age = mgmt.getPropertyKey('age')

// 构建根据name查询vertex的组合索引

mgmt.buildIndex('byNameComposite',Vertex.class).addKey(name).buildCompositeIndex()

// 构建根据name和age查询vertex的组合索引

mgmt.buildIndex('byNameAndAgeComposite',Vertex.class).addKey(name).addKey(age).buildCompositeIndex()

mgmt.commit()

//等待索引生效

mgmt.awaitGraphIndexStatus(graph,'byNameComposite').call()

mgmt.awaitGraphIndexStatus(graph,'byNameAndAgeComposite').call()

//对已有数据重新索引

mgmt = graph.openManagement()

mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"),SchemaAction.REINDEX).get()

mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"),SchemaAction.REINDEX).get()

mgmt.commit()

需要注意的是，Composite index需要在查询条件完全匹配(必须该索引中所有字段全部用上才可以触发索引)的情况下才能触发，如上面代码，g.V().has('name', 'hercules')和g.V().has('age',30).has('name','hercules')都是可以触发索引的，但g.V().has('age',30)则不行，因并未对age建索引。g.V().has('name','hercules').has('age',inside(20,50))也不可以，因只支持精确匹配，不支持范围查询。

Index Uniqueness

Composite Index也可以作为图的属性唯一约束使用，如果composite graph index被设置为unique()，则只能存在最多一个对应的属性组合。

graph.tx().rollback()//Never create new indexes while a transaction is active

mgmt = graph.openManagement()

name = mgmt.getPropertyKey('name')

mgmt.buildIndex('byNameUnique',Vertex.class).addKey(name).unique().buildCompositeIndex()

mgmt.commit()

//Wait for the index to become available

mgmt.awaitGraphIndexStatus(graph,'byNameUnique').call()

//Reindex the existing data

mgmt = graph.openManagement()

mgmt.updateIndex(mgmt.getGraphIndex("byNameUnique"),SchemaAction.REINDEX).get()

mgmt.commit()

注意：对于设置为最终一致性的后端存储，index的一致性必须被设置为允许锁定。

2：Mixed Index

Mixed Index支持通过其中的 任意key的组合 查询Vertex或者Edge。Mix Index使用上更加灵活，而且支持范围查询等（不仅包含相等）；从另外一方面说，Mixed index效率要比Composite Index低。

与Composite key不同，Mixed Index需要配置索引后端，JanusGraph可以在一次安装中支持多个索引后端，而且每个索引后端必须使用JanusGraph中配置唯一标识：称为indexing backend name。

graph.tx().rollback()//Never create new indexes while a transaction is active

mgmt = graph.openManagement()

name = mgmt.getPropertyKey('name')

age = mgmt.getPropertyKey('age')

mgmt.buildIndex('nameAndAge',Vertex.class).addKey(name).addKey(age).buildMixedIndex("search")

mgmt.commit()

//Wait for the index to become available

mgmt.awaitGraphIndexStatus(graph,'nameAndAge').call()

//Reindex the existing data

mgmt = graph.openManagement()

mgmt.updateIndex(mgmt.getGraphIndex("nameAndAge"),SchemaAction.REINDEX).get()

mgmt.commit()

上面的代码建立了一个名为nameAndAge的索引，该索引使用name和age属性构成，并设定其索引后端为"search"，对应到配置文件中为：index.serarch.backend，如果叫solrsearch，则需要增加：index.solrsearch.backend配置。

下面展示了如果使用text search作为默认的搜索行为：

mgmt.buildIndex('nameAndAge',Vertex.class).addKey(name,Mapping.TEXT.getParameter()).addKey(age,Mapping.TEXT.getParameter()).buildMixedIndex("search")

更加详细的使用参考：Charpter21, Index Parameter and Full-Test Search

在使用上，支持范围查询和索引中任何组合查询（索引中任何字段组合都可以触发该索引），而不仅局限于“相等”查询方式：

g.V().has('name', textContains('hercules')).has('age', inside(20,50))

g.V().has('name', textContains('hercules'))

g.V().has('age', lt(50))

Mixed Index支持全文检索，范围检索，地理检索和其他方式，参考Chapter20, Search Predicates and Data Types。

注意：不像composite index，mixed index不支持唯一性。

Adding Property Keys

可以向已经存在的mixed index中新增属性，之后就可以在查询条件中使用了。

//Never create new indexes while a transaction is activegraph.tx().rollback()

mgmt = graph.openManagement()

//创建一个新的属性

location = mgmt.makePropertyKey('location').dataType(Geoshape.class).make()

nameAndAge = mgmt.getGraphIndex('nameAndAge')

//修改索引mgmt.addIndexKey(nameAndAge, location)

mgmt.commit()

//Wait for the index to become available

mgmt.awaitGraphIndexStatus(graph,'nameAndAge').call()

//Reindex the existing data

mgmt = graph.openManagement()

mgmt.updateIndex(mgmt.getGraphIndex("nameAndAge"),SchemaAction.REINDEX).get()

mgmt.commit()

如果索引是在同意事务中创建的，则在该事务中马上可以使用。如果该属性Key已经被使用，需要执行reindex procedure来保证索引中包含了所有数据，知道该过程执行完毕，否则不能使用。

Mapping Parameters

当向mixed index增加新的property key时（无论通过何种方式创建），可以指定一组参数来设置property value在后端的存储方式。参考mapping paramters overview章节。

3：Ordering

图查询的集合返回顺序可由order().by()指定，该方法包含了两个参数：

排序依据的属性名称
升降序，incr和decr

如：

g.V().has('name', textContains('hercules')).order().by('age', decr).limit(10)

返回了name属性中包含‘hercules’且以'age'降序返回的10条数据。

使用Order时需要注意：

composite graph index原生不支持对返回结果排序，数据会被先加载到内存中再进行排序，对于大数据集合来讲成本非常高
mixed graph index本身支持排序返回，但排序中要使用的property key需要提前被加到mix index中去，如果要排序的property key不是index的一部分，将会导致整个数据集合加载到内存。

4：Label Constraint

有些情况下，我们不想对图中具有某一label的所有Vertex或Edge进行索引，例如，我们只想对有GOD标签的节点进行索引，此时我们可以使用indexOnly方法表示只索引具有某一Label的Vertex和Edge。如下：

//Never create new indexes while a transaction is activegraph.tx().rollback()

mgmt = graph.openManagement()

name = mgmt.getPropertyKey('name')

god = mgmt.getVertexLabel('god')

//只索引有god这一label的顶点

mgmt.buildIndex('byNameAndLabel',Vertex.class).addKey(name).indexOnly(god).buildCompositeIndex()

mgmt.commit()

//Wait for the index to become available

mgmt.awaitGraphIndexStatus(graph,'byNameAndLabel').call()

//Reindex the existing data

mgmt = graph.openManagement()

mgmt.updateIndex(mgmt.getGraphIndex("byNameAndLabel"),SchemaAction.REINDEX).get()

mgmt.commit()

label约束对mix index也是类似的，当一个有label约束的composite index被设置为唯一时，唯一约束只应用于具有此label的vertex或edge属性上。

5：Composite Index 和 Mixed Index对比

1. comosite key应用于确切的匹配场景，composite key不需要外部索引系统且通常具有更好的性能。

作为一个例外，如果要精确匹配的值数量很小（如12个月份）或一个元素与图中很多的元素有关联，此时应使用mix index。

2. 对取范围、全文检索、位置查询这样的应用场景，应该使用mix index，而且使用mixed index可以提供order().by()的性能。

三：Vertex-centric Indexs

Vertex-centric index（顶点中心索引）是为每个vertex建立的本地索引结构，在大型graph中，每个vertex有数千条Edge，在这些vertex中遍历效率将会非常低（需要在内存中过滤符合要求的Edge）。Vertex-centric index可以通过使用本地索引结构加速遍历效率，组合索引只支持最左匹配原则

如：

h = g.V().has('name','hercules').next()

g.V(h).outE('battled').has('time', inside(10,20)).inV()

如果没有vertex-centric index，则需要便利所有的batteled边并找出记录，在边的数量庞大时效率非常低。

建立一个vertex-centric index可以加速查询：

//Never create new indexes while a transaction is activegraph.tx().rollback()

mgmt = graph.openManagement()

//找到一个property key

time = mgmt.getPropertyKey('time')

// 找到一个label

battled = mgmt.getEdgeLabel('battled')

// 创建vertex-centric index

mgmt.buildEdgeIndex(battled,'battlesByTime',Direction.BOTH,Order.decr, time)

mgmt.commit()

//Wait for the index to become available

mgmt.awaitGraphIndexStatus(graph,'battlesByTime').call()

//Reindex the existing data

mgmt = graph.openManagement()

mgmt.updateIndex(mgmt.getRelationIndex(battled,"battlesByTime"),SchemaAction.REINDEX).get()

mgmt.commit()

上面的代码对battled边根据time以降序建立了双向索引。buildEdgeIndex()方法中的第一个参数是要索引的Edge的Label，第二个参数是index的名称，第三个参数是边的方向，BOTH意味着可以使用IN/OUT，如果只设置为某一方向，可以减少一半的存储和维护成本。最后两个参数是index的排序方向，以及要索引的property key，property key可以是多个，order默认为升序（Order.ASC）。

graph.tx().rollback()//Never create new indexes while a transaction is active

mgmt = graph.openManagement()

time = mgmt.getPropertyKey('time')

rating = mgmt.makePropertyKey('rating').dataType(Double.class).make()

battled = mgmt.getEdgeLabel('battled')

mgmt.buildEdgeIndex(battled,'battlesByRatingAndTime',Direction.OUT,Order.decr, rating, time)

mgmt.commit()

//Wait for the index to become available

mgmt.awaitRelationIndexStatus(graph,'battlesByRatingAndTime','battled').call()

//Reindex the existing data

mgmt = graph.openManagement()

mgmt.updateIndex(mgmt.getRelationIndex(battled,'battlesByRatingAndTime'),SchemaAction.REINDEX).get()

mgmt.commit()

上面的代码建立了battlesByRatingAndTime索引，并以rating和time构成，需要注意构成索引的property key的顺序非常重要，查询时只能根据propety key定义的顺序查询。（最左匹配原则）

h = g.V().has('name','hercules').next()

g.V(h).outE('battled').property('rating',5.0)//Add some rating properties

1： g.V(h).outE('battled').has('rating', gt(3.0)).inV() 2： g.V(h).outE('battled').has('rating',5.0).has('time', inside(10,50)).inV() 3： g.V(h).outE('battled').has('time', inside(10,50)).inV()

对上面部分的代码，只有查询1,2是可以使用索引的，查询3使用time查询无法匹配先根据rating再根据time的index构造顺序。可以对一个label创建多个不同的索引来支持不同的遍历。JanusGraph自动选择最有效的索引，Vertex-centric仅支持相等和range/interval约束。

注意：在vertex-centirc中使用的property key必须是显式定义的且未确定的class类型（不是Object.class）才能支持排序。如果数据类型浮点型，必须使用JanusGraph的Decimal或Precision数据类型。

根据在同一事务中新建的label所创建的索引可以即刻生效，如果edge正在被使用，则需要运行reindex程序，直到该程序运行结束，否则该索引无法使用。

~~注意：JanusGraph自动为每个edge label的每个property key建立了vertex-centric label，因此即使有数千个边也能高效查询。~~

Vertex-centric label无法加速不受约束的遍历（在所有边中遍历），这种遍历随着边的增加会变的更慢，通常这些遍历可以作为受约束遍历重写来提高性能。

四：Ordering Traversals

下面的查询使用了local和limit方法获取了遍历过程的排序子集。local（）表示只对前面元素的每一个元素进行分别操作，比如排序，是对每个节点的元素排序，不是对所有节点的所有元素排序！

h = g.V().has('name','hercules').next()

g.V(h).local(outE('battled').order().by('time', decr).limit(10)).inV().values('name')

g.V(h).local(outE('battled').has('rating',5.0).order().by('time', decr).limit(10)).values('place')

如果排序字段和排序方向与vertex-centric index一致的话，上面的查询非常高效。

第一个查询是要找到赫拉克勒斯最近战斗过的10个怪兽的名字。第二个查询是最近10次获得5星战斗的地点。在这2个查询例子中，都限定了查询结果的返回数量。

这类查询中心顶点索引也会起作用，如果排序key和定义的中心顶点索引键的排序顺序一致，battlesByTime这个索引将会对第一个查询起作用，battlesByRatingAndTime这个索引将会对第二个查询起作用。注意，battlesByRatingAndTime索引将不会对第一个查询生效，因为rating的相等查询只会对第二个查询起作用。

注意：vertex 排序查询时JanusGraph对Gremlin的扩展，要使用该功需要一段冗长的语句，而且需要_()步骤将JanusGraph转换为Gremlin管道

本文参与腾讯云自媒体分享计划，分享自作者个人站点/博客。

原始发表：2022-04-30，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法

数据结构

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体分享计划，欢迎热爱写作的你一起参与！

编程算法

数据结构

登录后参与评论

0 条评论

热度