学习正则表达式 - 提取和替换 XML 标签

用户1148526

发布于 2023-10-14 09:49:09

6390

发布于 2023-10-14 09:49:09

文章被收录于专栏：Hadoop数据仓库

一、需求

使用 lorem.dita 作为示例 XML 文档，通过正则表达式提取出该文档中的所有 XML 标签，并转换为简单的 XSLT 样式表。可以在 Github 中找到 lorem.dita 文件，地址是https://github.com/michaeljamesfitzgerald/Introducing-Regular-Expressions。为了节省篇幅，节选部分文本作为测试数据。

二、实现

1. 插入测试数据

drop table if exists t1;
create table t1 (a text);
insert into t1 values
('<?xml version="1.0" encoding="UTF-8"?>
<!PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="lorem">
 <title>Lorem Ipsum</title>
  <body>
   <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras non commodo mi. </p>
   <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit:</p>
    <ul>
     <li>Lorem ipsum dolor sit amet</li>
     <li>Lorem ipsum dolor sit amet</li>
     <li>Lorem ipsum dolor sit amet</li>
    </ul>
   <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p>
   <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p>
  </body>
</topic>'
);

2. 使用 SQL 查询提取和替换标签

with 
t1 as                          -- 提取、去重、排序所有标签
(
with recursive num as 
(select n, regexp_substr(a,'<[_a-zA-Z][^>]*>',1,t.n) b from t1,(select 1 n) t
  union all
 select n+1, regexp_substr(a,'<[_a-zA-Z][^>]*>',1, n + 1) from t1, num 
  where b is not null)
select replace(convert(group_concat(distinct b order by b) using utf8mb4),',',char(10)) a from num),
t2 as                          -- 替换掉标签属性
(select regexp_replace(a,' id=".*"','') a from t1),
t3 as                          -- 给标签添加常量字符串
(select regexp_replace(a,'^<(.*)>$','<xsl:template match="$1">
 <xsl:apply-templates/>
</xsl:template>
',1,0,'m') a from t2),
t4 as                          -- 添加头尾字符串
(select regexp_replace(a,'^(.*)$','<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">\n\n$1\n</xsl:stylesheet>',1,0,'n') a from t3)
select * from t4;

查询结果如下:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="body">
 <xsl:apply-templates/>
</xsl:template>

<xsl:template match="li">
 <xsl:apply-templates/>
</xsl:template>

<xsl:template match="p">
 <xsl:apply-templates/>
</xsl:template>

<xsl:template match="title">
 <xsl:apply-templates/>
</xsl:template>

<xsl:template match="topic">
 <xsl:apply-templates/>
</xsl:template>

<xsl:template match="ul">
 <xsl:apply-templates/>
</xsl:template>

</xsl:stylesheet>

三、分析

该实现使用内嵌视图、递归查询技术，并调用 regexp_substr 和 regexp_replace 函数完成标签的提取和替换。

1. 提取文本中的所有 XML 标签

（1）编写匹配标签的正则表达式

<[_a-zA-Z][^>]*>

第一个字符是左尖括号（<）。
在 XML 中元素可以以下划线字符 _ 或者 ASCII 范围中的大写或小写字母开头。
在起始字符之后，标签名称可以是零或多个除右尖括号 > 之外的任意字符。
表达式以右尖括号结尾。

（2）用递归查询提取所有标签

with recursive num as 
(select n, regexp_substr(a,'<[_a-zA-Z][^>]*>',1,t.n) b from t1,(select 1 n) t
  union all
 select n+1, regexp_substr(a,'<[_a-zA-Z][^>]*>',1, n + 1) from t1, num 
  where b is not null)

MySQL 的 regexp_substr 函数用于返回正则表达式的匹配项，但每次只能返回一个，用第四个参数 occurrence 指定返回第几个匹配项。为了获得全部标签，需要使用递归查询，将递归变量作为 occurrence 参数传递给 regexp_substr 函数。将 regexp_substr 函数返回 null 作为递归退出条件。这部分查询为每个标签返回一行。

（3）合并、去重、排序所有标签

select replace(convert(group_concat(distinct b order by b) using utf8mb4),',',char(10)) a from num

group_concat(distinct b order by b) 将递归查询返回的多行排序去重，然后合并为以逗号作为分隔符的一行字符串。
convert 函数将 group_concat 返回的一行字符串转为 utf8mb4 字符集。
replace 函数将合并后的一行字符串中的分隔符从逗号换成换行符。

内嵌视图 t1 的查询结果即为去重、排序后的，以换行符作为分隔符的所有标签。

2. 替换掉标签属性

select regexp_replace(a,' id=".*"','') a from t1

内嵌视图 t2 的查询结果为去掉属性的所有标签名称。本例中只有 id 属性。

3. 给标签添加常量字符串

select regexp_replace(a,'^<(.*)>$','<xsl:template match="$1">
 <xsl:apply-templates/>
</xsl:template>
',1,0,'m') a from t2

内嵌视图 t3 的查询结果是个每个标签添加了带有 XSLT 样式的前后缀。使用多行模式后，正则表达式 ^<(.*)>

4. 添加头尾字符串

select regexp_replace(a,'^(.*)$','<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">\n\n$1\n</xsl:stylesheet>',1,0,'n') a from t3

内嵌视图 t4 的查询结果是给 t3 的结果添加首尾 XSLT 标签字符串。使用 dotall 模式后，正则表达式 ^(.*) 匹配整个多行文本，并将匹配结果放入一个捕获组中，1 引用该捕获组。

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2023-05-19，如有侵权请联系 cloudcommunity@tencent.com 删除

xml