文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在SQL和熊猫中获得相同的percent_rank？

问如何在SQL和熊猫中获得相同的percent_rank？
EN

Stack Overflow用户

提问于 2020-07-24 16:44:43

回答 1查看 763关注 0票数 8

我正在学习pyspark，它使用HiveQL，并且发现有趣的是，百分比等级给出了pyspark-sql和pandas的两个不同的答案。

使用sql代码的问题源：https://www.windowfunctions.com/questions/ranking/3

如何在熊猫中获得与SQL相同的结果？

两个问题

与SQL相同的python代码是什么？
是什么？

吡火花-sql

q = """
select name, weight,
       percent_rank() over (order by weight) as percent_rank_wt
from cats
order by weight
"""
spark.sql(q).show()

SQL gives this table. I would like same table using pandas.

+-------+------+-------------------+
|   name|weight|    percent_rank_wt|
+-------+------+-------------------+
| Tigger|   3.8|                0.0|
|  Molly|   4.2|0.09090909090909091|
|  Ashes|   4.5|0.18181818181818182|
|Charlie|   4.8| 0.2727272727272727|
| Smudge|   4.9|0.36363636363636365|
|  Felix|   5.0|0.45454545454545453|
|   Puss|   5.1| 0.5454545454545454|
| Millie|   5.4| 0.6363636363636364|
|  Alfie|   5.5| 0.7272727272727273|
|  Misty|   5.7| 0.8181818181818182|
|  Oscar|   6.1| 0.9090909090909091|
| Smokey|   6.1| 0.9090909090909091|
+-------+------+-------------------+

熊猫

methods = {'average', 'min', 'max', 'first', 'dense'}

df[['name','weight']].sort_values('weight').assign(
     pct_avg=df['weight'].rank(pct=True,method='average'),
     pct_min=df['weight'].rank(pct=True,method='min'),
     pct_max=df['weight'].rank(pct=True,method='max'),
     pct_first=df['weight'].rank(pct=True,method='first'),
     pct_dense=df['weight'].rank(pct=True,method='dense')
).sort_values('weight')
       name  weight   pct_avg   pct_min   pct_max  pct_first  pct_dense
4    Tigger     3.8  0.083333  0.083333  0.083333   0.083333   0.090909
0     Molly     4.2  0.166667  0.166667  0.166667   0.166667   0.181818
1     Ashes     4.5  0.250000  0.250000  0.250000   0.250000   0.272727
11  Charlie     4.8  0.333333  0.333333  0.333333   0.333333   0.363636
3    Smudge     4.9  0.416667  0.416667  0.416667   0.416667   0.454545
2     Felix     5.0  0.500000  0.500000  0.500000   0.500000   0.545455
9      Puss     5.1  0.583333  0.583333  0.583333   0.583333   0.636364
7    Millie     5.4  0.666667  0.666667  0.666667   0.666667   0.727273
5     Alfie     5.5  0.750000  0.750000  0.750000   0.750000   0.818182
8     Misty     5.7  0.833333  0.833333  0.833333   0.833333   0.909091
6     Oscar     6.1  0.958333  0.916667  1.000000   0.916667   1.000000
10   Smokey     6.1  0.958333  0.916667  1.000000   1.000000   1.000000

设置

import numpy as np
import pandas as pd

import pyspark
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark import SparkConf, SparkContext, SQLContext
spark = pyspark.sql.SparkSession.builder.appName('app').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

df = pd.DataFrame({
    'name': [
        'Molly', 'Ashes', 'Felix', 'Smudge', 'Tigger', 'Alfie', 'Oscar',
        'Millie', 'Misty', 'Puss', 'Smokey', 'Charlie'
    ],
    'breed': [
        'Persian', 'Persian', 'Persian', 'British Shorthair',
        'British Shorthair', 'Siamese', 'Siamese', 'Maine Coon', 'Maine Coon',
        'Maine Coon', 'Maine Coon', 'British Shorthair'
    ],
    'weight': [4.2, 4.5, 5.0, 4.9, 3.8, 5.5, 6.1, 5.4, 5.7, 5.1, 6.1, 4.8],
    'color': [
        'Black', 'Black', 'Tortoiseshell', 'Black', 'Tortoiseshell', 'Brown',
        'Black', 'Tortoiseshell', 'Brown', 'Tortoiseshell', 'Brown', 'Black'
    ],
    'age': [1, 5, 2, 4, 2, 5, 1, 5, 2, 2, 4, 4]
})

schema = StructType([
    StructField('name', StringType(), True),
    StructField('breed', StringType(), True),
    StructField('weight', DoubleType(), True),
    StructField('color', StringType(), True),
    StructField('age', IntegerType(), True),
])

sdf = sqlContext.createDataFrame(df, schema)
sdf.createOrReplaceTempView("cats")

hiveql

python

sql

pandas

pyspark

回答 1

Stack Overflow用户

发布于 2020-11-25 19:58:16

SQL的percent_rank与熊猫的rank不完全相同。主要有两个不同之处：

SQL的percent_rank 将当前行的排除在计算之外。因此，如果表有11行，那么对于每一行，它将只使用其他10行来计算结果。熊猫rank包含所有行。
SQL的percent_rank给出了多少行是比当前行小的。熊猫rank不支持这样做的方法。

提供与SQL相同结果的python代码是什么？

要在熊猫中获得与SQL的percent_rank相当的值，您实际上可以对rank结果执行一小部分计算：

(df['weight'].rank(method='min')-1) / (len(df['weight'])-1)

分子中的-1是得到严格小于当前行的行数，分母中的-1是计算除当前行之外的结果。

给出与熊猫相同结果的SQL代码是什么？

这取决于你在熊猫rank中使用的方法，但你可能需要SQL's cume_dist。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63077950

复制

相似问题

问如何在SQL和熊猫中获得相同的percent_rank？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在SQL和熊猫中获得相同的percent_rank？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在SQL和熊猫中获得相同的percent_rank？
EN