怎样将 MySQL 数据表导入到 Elasticsearch

本文节选自《Netkiller Database 手札》

MySQL 导入 Elasticsearch 的方法有很多,通常是使用ETL工具,但我觉得太麻烦。于是想到 logstash 。

23.8. Migrating MySQL Data into Elasticsearch using logstash

23.8.1. 安装 logstash

安装 JDBC 驱动 和 Logstash

curl -s https://raw.githubusercontent.com/oscm/shell/master/database/mysql/5.7/mysql-connector-java.sh	 | bash			
curl -s https://raw.githubusercontent.com/oscm/shell/master/log/kibana/logstash-5.x.sh | bash			

mysql 驱动文件位置在 /usr/share/java/mysql-connector-java.jar

23.8.2. 配置 logstash

创建配置文件 /etc/logstash/conf.d/jdbc-mysql.conf

			mysql> desc article;
+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| id          | int(11)      | NO   |     | 0       |       |
| title       | mediumtext   | NO   |     | NULL    |       |
| description | mediumtext   | YES  |     | NULL    |       |
| author      | varchar(100) | YES  |     | NULL    |       |
| source      | varchar(100) | YES  |     | NULL    |       |
| ctime       | datetime     | NO   |     | NULL    |       |
| content     | longtext     | YES  |     | NULL    |       |
+-------------+--------------+------+-----+---------+-------+
7 rows in set (0.00 sec)			
			input {
  jdbc {
    jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/cms"
    jdbc_user => "cms"
    jdbc_password => "password"
    schedule => "* * * * *"
    statement => "select * from article"
  }
}
output {
    elasticsearch {
    		hosts => "localhost:9200"
        index => "information"
        document_type => "article"
        document_id => "%{id}"
        
    }
}			

23.8.3. 启动 Logstash

			root@netkiller /var/log/logstash % systemctl restart logstash

root@netkiller /var/log/logstash % systemctl status logstash
● logstash.service - logstash
   Loaded: loaded (/etc/systemd/system/logstash.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-07-31 09:35:00 CST; 11s ago
 Main PID: 10434 (java)
   CGroup: /system.slice/logstash.service
           └─10434 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -Djava.awt.headless=true -Dfi...

Jul 31 09:35:00 VM_3_2_centos systemd[1]: Started logstash.
Jul 31 09:35:00 VM_3_2_centos systemd[1]: Starting logstash...
			
root@netkiller /var/log/logstash % cat logstash-plain.log 
[2017-07-31T09:35:28,169][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://localhost:9200/]}}
[2017-07-31T09:35:28,172][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://localhost:9200/, :path=>"/"}
[2017-07-31T09:35:28,298][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>#<Java::JavaNet::URI:0x453a18e9>}
[2017-07-31T09:35:28,299][INFO ][logstash.outputs.elasticsearch] Using mapping template from {:path=>nil}
[2017-07-31T09:35:28,337][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-*", "version"=>50001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"_default_"=>{"_all"=>{"enabled"=>true, "norms"=>false}, "dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"@timestamp"=>{"type"=>"date", "include_in_all"=>false}, "@version"=>{"type"=>"keyword", "include_in_all"=>false}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2017-07-31T09:35:28,344][INFO ][logstash.outputs.elasticsearch] Installing elasticsearch template to _template/logstash
[2017-07-31T09:35:28,465][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>[#<Java::JavaNet::URI:0x66df34ae>]}
[2017-07-31T09:35:28,483][INFO ][logstash.pipeline        ] Starting pipeline {"id"=>"main", "pipeline.workers"=>8, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>5, "pipeline.max_inflight"=>1000}
[2017-07-31T09:35:29,562][INFO ][logstash.pipeline        ] Pipeline main started
[2017-07-31T09:35:29,700][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}
[2017-07-31T09:36:01,019][INFO ][logstash.inputs.jdbc     ] (0.006000s) select * from article	

23.8.4. 验证

			% curl -XGET 'http://localhost:9200/_all/_search?pretty'			

23.8.5. 配置模板

23.8.5.1. 全量导入

适合数据没有改变的归档数据或者只能增加没有修改的数据

				input {
  jdbc {
    jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/cms"
    jdbc_user => "cms"
    jdbc_password => "password"
    schedule => "* * * * *"
    statement => "select * from article"
  }
}
output {
    elasticsearch {
    		hosts => "localhost:9200"
        index => "information"
        document_type => "article"
        document_id => "%{id}"
        
    }
}				

23.8.5.2. 多表导入

多张数据表导入到 Elasticsearch

				# multiple inputs on logstash jdbc

input {
  jdbc {
    jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/cms"
    jdbc_user => "cms"
    jdbc_password => "password"
    schedule => "* * * * *"
    statement => "select * from article"
    type => "article"
  }
  jdbc {
    jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/cms"
    jdbc_user => "cms"
    jdbc_password => "password"
    schedule => "* * * * *"
    statement => "select * from comment"
    type => "comment"
  } 
}
output {
    elasticsearch {
    		hosts => "localhost:9200"
        index => "information"
        document_type => "%{type}"
        document_id => "%{id}"
        
    }
}				

需要在每一个jdbc配置项中加入 type 配置,然后 elasticsearch 配置项中加入 document_type => "%{type}"

23.8.5.3. 通过 ID 主键字段增量复制数据

				input {
  jdbc {
    statement => "SELECT id, mycolumn1, mycolumn2 FROM my_table WHERE id > :sql_last_value"
    use_column_value => true
    tracking_column => "id"
    tracking_column_type => "numeric"
    # ... other configuration bits
  }
}				

tracking_column_type => "numeric" 可以声明 id 字段的数据类型, 如果不指定将会默认为日期

[2017-07-31T11:08:00,193][INFO ][logstash.inputs.jdbc     ] (0.020000s) select * from article where id > '2017-07-31 02:47:00'				

如果复制不对称可以加入 clean_run => true 配置项,清楚数据

23.8.5.4. 通过日期字段增量复制数据

				input {
  jdbc {
    statement => "SELECT * FROM my_table WHERE create_date > :sql_last_value"
    use_column_value => true
    tracking_column => "create_date"
    # ... other configuration bits
  }
}				

如果复制不对称可以加入 clean_run => true 配置项,清楚数据

23.8.5.5. 指定SQL文件

statement_filepath 指定 SQL 文件,有时SQL太复杂写入 statement 配置项维护部方便,可以将 SQL 写入一个文本文件,然后使用 statement_filepath 配置项引用该文件。

				input {
    jdbc {
        jdbc_driver_library => "/path/to/driver.jar"
        jdbc_driver_class => "org.postgresql.Driver"
        jdbc_url => "jdbc://postgresql"
        jdbc_user => "neo"
        jdbc_password => "password"
        statement_filepath => "query.sql"
    }
}				

23.8.5.6. 参数传递

将需要复制的条件参数写入 parameters 配置项

				input {
  jdbc {
    jdbc_driver_library => "mysql-connector-java-5.1.36-bin.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/mydb"
    jdbc_user => "mysql"
    parameters => { "favorite_artist" => "Beethoven" }
    schedule => "* * * * *"
    statement => "SELECT * from songs where artist = :favorite_artist"
  }
}				

23.8.5.7. 控制返回JDBC数据量

	jdbc_fetch_size => 1000  #jdbc获取数据的数量大小
	jdbc_page_size => 1000 #jdbc一页的大小,
	jdbc_paging_enabled => true  #和jdbc_page_size组合,将statement的查询分解成多个查询,相当于: SELECT * FROM table LIMIT 1000 OFFSET 4000 				

23.8.6. example

			input {
  jdbc {
    jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://localhost:3306/cms"
    jdbc_user => "cms"
    jdbc_password => "password"
    schedule => "* * * * *"	#定时cron的表达式,这里是每分钟执行一次
    statement => "select id, title, description, author, source, ctime, content from article where id > :sql_last_value"
    use_column_value => true
    tracking_column => "id"
    tracking_column_type => "numeric" 
    record_last_run => true
    last_run_metadata_path => "/var/tmp/article.last"
  }
}
output {
    elasticsearch {
    		hosts => "localhost:9200"
        index => "information"
        document_type => "article"
        document_id => "%{id}"
        action => "update"  # 操作执行的动作,可选值有["index", "delete", "create", "update"]
        doc_as_upsert => true  #支持update模式
    }
}

原文发布于微信公众号 - Netkiller(netkiller-ebook)

原文发表时间:2017-07-31

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏狂码一生

利用ajaxFileUpload.js实现多文件异步上传功能

  AjaxFileUpload.js是网络开发者写好的插件放出来供大家使用用,原理都是创建隐藏的表单和iframe然后用JS去提交,获得返回值。在这里我将网络...

43313
来自专栏CRPER折腾记

Angular 2 + 折腾记 :(2)初步认识angular2,不一样的开发模式

想来想去,概念这些东西不怎么想讲,更多的是想讲点实战性的内容。 所以有些东西跳过去了,小伙伴们请去看官方文档哈;跳跃性的前进,写的不好多包涵。。。

772
来自专栏orientlu

Google 单元测试框架

到 github 拉取代码或者下载某个版本的 zip 包到本地目录,参考 gtest 中的 README.md 如何编译库和编译自己的代码,下面简单介绍下编译方...

402
来自专栏芋道源码1024

数据库中间件 MyCAT源码分析:【单库单表】插入

本文主要基于 MyCAT 1.6.5 正式版 1. 概述 2. 接收请求,解析 SQL 3. 获得路由结果 4. 获得 MySQL 连接,执行 SQL 5. 响...

44612
来自专栏hotqin888的专栏

beego利用casbin进行权限管理——第五节 策略更新(续)

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/hotqin888/article/det...

561
来自专栏Python爬虫与数据挖掘

如何利用BeautifulSoup选择器抓取京东网商品信息

昨天小编利用Python正则表达式爬取了京东网商品信息,看过代码的小伙伴们基本上都坐不住了,辣么多的规则和辣么长的代码,悲伤辣么大,实在是受不鸟了。不过小伙伴们...

482
来自专栏Danny的专栏

在EasyUI的DataGrid中嵌入Combobox

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/huyuyang6688/article/...

623
来自专栏cloudskyme

使用jquery-easyui写的CRUD插件(2)

首先定义变量 var options = jQuery.extend({},jQuery.fn.crudUIGrid.defaults, options); ...

3325
来自专栏乐沙弥的世界

Linux/Unix shell sql 之间传递变量

       灵活结合Linux/Unix Shell 与SQL 之间的变量传输,极大程度的提高了DBA的工作效率,本文针对Linux/Unix shell s...

563
来自专栏编程之旅

从Yii2的源码来分析框架的QueryParamAuth的鉴权过程

Yii是基于PHP语言打造的一款框架,了解PHP的同学对这款框架肯定也不会陌生。而我在最近使用yii2写App接口的时,查看官方了的RESTful Web服务文...

632

扫描关注云+社区