blocks|key|2184512|text|我认为fuzzystrmatch和/或pg_trgm模块是您要寻找的。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|2184513|entityMap|0|LINK|mutability|MUTABLE|url|http://www.postgresql.org/docs/current/static/fuzzystrmatch.html|1|http://www.postgresql.org/docs/current/static/pgtrgm.html^0|3|D|J|7|3|D|0|J|7|1|0^^$0|@$1|2|3|4|5|6|7|P|8|@$9|Q|A|R|B|C]|$9|S|A|T|B|C]]|D|@$9|U|A|V|1|W]|$9|X|A|Y|1|Z]]|E|$]]|$1|F|3|-4|5|6|7|10|8|@]|D|@]|E|$]]]|G|$H|$5|I|J|K|E|$L|M]]|N|$5|I|J|K|E|$L|O]]]]

I think the <a href="http://www.postgresql.org/docs/current/static/fuzzystrmatch.html" rel="nofollow"><code>fuzzystrmatch</code></a> and/or <a href="http://www.postgresql.org/docs/current/static/pgtrgm.html" rel="nofollow"><code>pg_trgm</code></a> modules are what you're looking for.

blocks|key|1629396|text|这叫做概率记录链接+(实际上它有几个名字)。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1629397|您要做的第一件事是标准化每一列的值，以便它们是直接可比较的。例如，日期应采用ISO格式并进行裁剪。|1629398|简单的方法|1629399|计算匹配列的数量：|1629400|select
+n.id+as+needle_id,
+h.id+as+haystack_id,
+case+when+n.col1+=+h.col1+then+1+else+0+end+
+%2B+case+when+some_comparison_function(n.col2,+h.col2)+then+1+else+0+end
+%2B+...
+as+relevance
from+
+needles+n
join+
+haystack+h+--+haystack+table+could+be+the+same+as+needles+table
on++--+only+compare+rows+where+at+least+one+column+matches
+n.col1+=+h.col1+
+or+some_comparison_function(n.col2,+h.col2)
+or+...
order+by+
+relevance+desc;|code-block|syntax|javascript|1629401|更难但更正确的方法|1629402|这在数学上是证明最优的。它根据稀有值计算列的权重。|1629403|选择两个应该相等但不同的值的概率。例如，两个记录应该具有相同的SSN，但是有一个错误。减去这个值的是您的m-prob+(称它为99%25)。|ordered-list-item|style|CODE|1629404|对于每一列，计算每个值的相对频率。这是你的u-prob|1629405|对于每一个潜在的匹配(needle.dob+vs+haystack.dob)，如果他们同意：m-prob+/+u-prob，则计算优势比；如果他们不同意，则计算优势比：(1+-+m-prob)+/+(1+-+u-prob)。|1629406|乘以所有赔率比，得到总赔率。|1629407|计算匹配概率：total_odds+/+(1+%2B+total_odds)|1629408|如果概率超过阈值，则匹配，否则不匹配|1629409|entityMap|0|LINK|mutability|MUTABLE|url|https://en.wikipedia.org/wiki/Record_linkage|1|http://courses.cs.washington.edu/courses/cse590q/04au/papers/Felligi69.pdf^0|3|6|0|0|0|0|0|0|0|6|2|1|0|1G|6|0|L|6|0|1A|F|2C|R|0|0|7|T|0|0^^$0|@$1|2|3|4|5|6|7|1H|8|@]|9|@$A|1I|B|1J|1|1K]]|C|$]]|$1|D|3|E|5|6|7|1L|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|1M|8|@]|9|@]|C|$]]|$1|H|3|I|5|6|7|1N|8|@]|9|@]|C|$]]|$1|J|3|K|5|L|7|1O|8|@]|9|@]|C|$M|N]]|$1|O|3|P|5|6|7|1P|8|@]|9|@]|C|$]]|$1|Q|3|R|5|6|7|1Q|8|@]|9|@$A|1R|B|1S|1|1T]]|C|$]]|$1|S|3|T|5|U|7|1U|8|@$A|1V|B|1W|V|W]]|9|@]|C|$]]|$1|X|3|Y|5|U|7|1X|8|@$A|1Y|B|1Z|V|W]]|9|@]|C|$]]|$1|Z|3|10|5|U|7|20|8|@$A|21|B|22|V|W]|$A|23|B|24|V|W]]|9|@]|C|$]]|$1|11|3|12|5|U|7|25|8|@]|9|@]|C|$]]|$1|13|3|14|5|U|7|26|8|@$A|27|B|28|V|W]]|9|@]|C|$]]|$1|15|3|16|5|U|7|29|8|@]|9|@]|C|$]]|$1|17|3|-4|5|6|7|2A|8|@]|9|@]|C|$]]]|18|$19|$5|1A|1B|1C|C|$1D|1E]]|1F|$5|1A|1B|1C|C|$1D|1G]]]]

This is called <a href="https://en.wikipedia.org/wiki/Record_linkage" rel="nofollow">Probabilistic Record Linkage</a> (actually it has several names).

The first thing you want to do is standardize each column's values so they are directly comparable. For example, dates should be in ISO format and trimmed.

<h2>The Easy Way</h2>

Count the number of matching columns:

<pre><code>select
 n.id as needle_id,
 h.id as haystack_id,
 case when n.col1 = h.col1 then 1 else 0 end 
 + case when some_comparison_function(n.col2, h.col2) then 1 else 0 end
 + ...
 as relevance
from 
 needles n
join 
 haystack h -- haystack table could be the same as needles table
on -- only compare rows where at least one column matches
 n.col1 = h.col1 
 or some_comparison_function(n.col2, h.col2)
 or ...
order by 
 relevance desc;
</code></pre>

<h2>The Harder but More Correct Way</h2>

This has been mathematically <a href="http://courses.cs.washington.edu/courses/cse590q/04au/papers/Felligi69.pdf" rel="nofollow">proven</a> to be optimal. It computes the weight of the columns for you based on how rare values are. 

<ol>
<li>Pick the probability of two values that should be equal but are different. For example, two records should have the same SSN, but there was a typo. One minus this value is your <code>m-prob</code> (call it 99%).</li>
<li>For each column calculate the relative frequency of each value. This is your <code>u-prob</code></li>
<li>For each potential match (needle.dob vs haystack.dob), calculate the odds ratio if they agree: <code>m-prob / u-prob</code>, or the odds ratio if they disagree: <code>(1 - m-prob) / (1 - u-prob)</code></li>
<li>Multiply all the odds ratios to get the total odds</li>
<li>Calculate the probability of a match: <code>total_odds / (1 + total_odds)</code></li>
<li>If probability exceeds threshold then match, else non match</li>
</ol>

I have 2 tables with the following fields:

<ul>
<li>First Name</li>
<li>Last Name </li>
<li>Middle Name</li>
<li>State </li>
<li>Zip </li>
<li>SSN </li>
<li>DOB </li>
<li>Phone</li>
</ul>

I am trying to find the records that match between the 2 tables and records that most likely matche but are not an exact match because of input error, missing data, variation of name spelling, etc...

Some of the data is missing. But for all the data that is there, both tables have the same format / data type for each data element. 

Ideally I would like some kind of weighting mechanism for the results.

Now if SSN is a direct match then we have a match. But I would also like to take into account if there was a user input error and 2 digits were mixed up or something like that.

What are my options in PG?

Straight matching does an okay job if I run multiple variations (Examples). 

<ul>
<li>Social Match</li>
<li>Last Name, DOB, Zip</li>
<li>Last Name, DOB, State</li>
<li>Last Name, First Name, DOB, ZIP</li>
</ul>

However I would love to be deploy a more complete solution and am searching for any tips on how to proceed.

PostgreSQL Fuzzy Matching

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我有两个表，包含以下字段：名字姓中间名状态ZipSSN道布电话我试图找到两个表之间的匹配记录和最有可能匹配的记录，但由于输入错误、数据丢失、名称拼写的变化等原因，这些记录不完全匹配。一些数据丢失了。但是对于所有存在的数据，两个表对于每个数据元素都具有相同的格式/数据类型。理想情况下，我希望为结果建立某种加权机制。如果S...

问PostgreSQL模糊匹配
EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PostgreSQL模糊匹配EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PostgreSQL模糊匹配
EN