我使用的是BigQuery。我有两个简单的表,来自我们系统的数据质量“很差”。一个表示收入,另一个表示公交车行程的生产行。我需要匹配每个旅程的收入交易,但我只有一组字段,没有关键字,我真的不知道如何做这个匹配。
这是一个数据示例:
收入
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, London, Manchester, Qwerty旅程
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, Kings Cross, Piccadilly Gardens, Qwer
2020, 123123, Kings Cross, Victoria Station, Qwert
2020, 123123, London, Manchester, Qwerty每个工作站最多有9个替代名称,这些名称存储在“工作站”表中。
车站
Station Name, Station Name 2, Station Name 3,...
London, Kings Cross, Euston,...
Manchester, Piccadilly Gardens, Victoria Station,...我想测试匹配或连接表首先与原始字段。这将生成一些匹配,但有许多行程是不匹配的。对于不匹配的收入行,我想更改产品名称(将其缩短为两个字母,并可能从生产表中获得许多匹配项),然后通过首先更改station_origin,然后更改station_destination来更改站点名称。当使用较短的产品名称时,我可能会得到许多匹配,但我希望production表中的行包含最常见的产品。如下所示:
1. Do a direct match. That is, I can use the fields as they are in the tables.
2. Do a match where the revenue.product is changed by shortening it to two letters. substr(product,0,2)
3. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
4. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product is changed as above with a substr(product,0,2) but rev.station_destination is not changed.
5. Change the rev.station_destination to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.有人告诉我,也许我应该创建一个包含所有站点和产品组合的中间表,并让一个排名列来决定顺序。站点表中的站点名称是按重要性排序的,因此“站点名称”比“站点名称2”更重要,依此类推。
我开始使用每个等级的子查询进行查询,并执行UNION ALL,但是有如此多的组合,因此必须有另一种方法来实现这一点。
不知道这是否有任何意义,但我将感谢任何帮助或想法,以更好的方式做到这一点。干杯,克丽丝
发布于 2020-07-02 06:47:55
要实现具有近似匹配的复杂连接策略,在JavaScript中定义策略并从BigQuery SQL查询中调用函数可能更有意义。
例如,以下查询执行以下步骤:
请注意,选择最近的选项的逻辑被封装在JS fhoffa.x.fuzzy_extract_one()中。要了解更多信息,请参阅https://medium.com/@hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83。
WITH data AS (
SELECT name, gender, SUM(number) c
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY 1,2
), top_men AS (
SELECT * FROM data WHERE gender='M'
ORDER BY c DESC LIMIT 200
), top_women AS (
SELECT * FROM data WHERE gender='F'
ORDER BY c DESC LIMIT 200
)
SELECT name male_name,
COALESCE(
(SELECT name FROM top_women WHERE name=a.name)
, fhoffa.x.fuzzy_extract_one(name, ARRAY(SELECT name FROM top_women))
) female_version
FROM top_men a

https://stackoverflow.com/questions/62678455
复制相似问题