一个表包含来自两个输入流的信息,其中用户可能同时出现在这两个输入流中,但是略有不同。我正在努力寻找这些重复的用户。我想出了这样一条SQL语句,它可以找到大多数这样的用户:
SELECT s.PROF_MAIL, s.PROF_STATE, s.PROF_GUID, CONCAT(s.PROF_GIVEN_NAME,' ',s.PROF_SURNAME) AS FullName, t.*
FROM [EMPLOYEE] s
join (
SELECT PROF_GIVEN_NAME,PROF_SURNAME, count(*) as qty
FROM [EMPLOYEE]
GROUP BY PROF_GIVEN_NAME,PROF_SURNAME
HAVING count(*) > 1
) t on s.PROF_GIVEN_NAME = t.PROF_GIVEN_NAME AND s.PROF_SURNAME = t.PROF_SURNAME
问题是,名字在一个来源中可能会有像René这样的口音,而在另一个来源中则不会。此外,大写字母也不一定相同。在上面的语句中没有捕获到这些。因此,我尝试在某个地方合并COLLATE Latin1_General_CI_AI,但不知道在哪里使用它,也不知道如何用另一种方式解决它。有人知道怎么做吗?数据库为MS SQL
发布于 2018-05-31 20:01:02
首先,您应该使用窗口函数:
select e.*
from (select e.*,
count(*) over (partition by prof_given_name, prof_surname) as cnt
from employees e
) e
where cnt > 1;
现在可以将collate
合并到partition by
子句中:
select e.*
from (select e.*,
count(*) over (partition by prof_given_name collate Latin1_General_CI_AI, prof_surname collate Latin1_General_CI_AI) as cnt
from employees e
) e
where cnt > 1;
发布于 2018-05-31 20:04:02
你可以使用ROW_NUMBER窗口函数和分区中的名称,就像这样(也可以合并排序)
;WITH cteDups
AS(
SELECT
*,RN=ROW_NUMBER()OVER(PARTITION BY
PROF_GIVEN_NAME COLLATE Latin1_General_CI_AI,
PROF_SURNAME COLLATE Latin1_General_CI_AI
ORDER BY PROF_SURNAME ASC )
FROM dbo.Employee
)
SELECT * FROM cteDups WHERE cteDups.RN > 1
如果EMPLOYEE
表具有记录创建行的时间的DATETIME
列,则将ORDER BY
替换为该列,以便您可以标识最近的记录
https://stackoverflow.com/questions/50623769
复制相似问题