文章/答案/技术大牛

发布

社区首页 >问答首页 >按性能随机排序

问按性能随机排序
EN

Stack Overflow用户

提问于 2011-07-13 22:14:10

回答 4查看 3K关注 0票数 5

按随机顺序获取top n行的最好方法是什么？

我使用如下查询：

Select top(10) field1,field2 .. fieldn
from Table1
order by checksum(newid())

上面查询中的问题是，随着表大小的增加，它将变得越来越慢。它将始终执行完全聚集索引扫描，以按随机顺序查找top(10)行。

有没有其他更好的方法呢？

tsql

回答 4

Stack Overflow用户

发布于 2011-07-13 22:45:11

我对此进行了测试，并在更改查询时获得了更好的性能。

我在测试中使用的表的DDL。

CREATE TABLE [dbo].[TestTable]
(
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [Col1] [nvarchar](100) NOT NULL,
    [Col2] [nvarchar](38) NOT NULL,
    [Col3] [datetime] NULL,
    [Col4] [nvarchar](50) NULL,
    [Col5] [int] NULL,
 CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED 
 (
    [ID] ASC
 )
)

GO

CREATE NONCLUSTERED INDEX [IX_TestTable_Col5] ON [dbo].[TestTable] 
(
    [Col5] ASC
)

该表有722888行。

第一个查询：

select top 10
  T.ID,
  T.Col1,
  T.Col2,
  T.Col3,
  T.Col5,
  T.Col5
from TestTable as T
order by newid()

第一次查询的统计数据：

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 13 ms.

(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 12492, physical reads 14, read-ahead reads 6437, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 859 ms,  elapsed time = 1700 ms.

执行计划第一次查询：

第二个查询：

select 
  T.ID,
  T.Col1,
  T.Col2,
  T.Col3,
  T.Col5,
  T.Col5
from TestTable as T
  inner join (select top 10 ID
              from TestTable
              order by newid()) as C
    on T.ID = C.ID

第二个查询的统计数据：

SQL Server parse and compile time: 
   CPU time = 125 ms, elapsed time = 183 ms.

(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 1291, physical reads 10, read-ahead reads 399, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 516 ms,  elapsed time = 706 ms.

执行计划第二次查询：

摘要：

第二个查询是使用Col5上的索引按newid()对行进行排序，然后执行10次聚集索引查找，以获得输出的值。

之所以有性能提升，是因为Col5上的索引比聚集键更窄，这会导致更少的读取。

感谢pointing that out的Martin Smith。

票数 4

Stack Overflow用户

发布于 2011-07-14 03:00:21

减少所需扫描大小的一种方法是结合使用TABLESAMPLE和ORDER by newid，以便从表中选择的页面中选择随机数量的行，而不是扫描整个表。

其思想是计算每页的平均行数，然后使用tablesample为您想要输出的每一行随机选择一页数据。然后，您将仅对该数据子集运行ORDER BY newid()查询。与原始方法相比，这种方法的随机性略低，但比仅使用表样本要好得多，并且需要从表中读取的数据要少得多。

不幸的是，TABLESAMPLE子句不接受变量，因此必须使用动态sql才能使用基于输入表记录大小的动态行值。

declare @factor int
select @factor=8000/avg_record_size_in_bytes from sys.dm_db_index_physical_stats(db_id(), object_id('sample'), null, null, 'detailed') where index_level = 0
declare @numRows int = 10
declare @sampledRows int = @factor * @numRows
declare @stmt nvarchar(max) = N'select top (@numRows) * from sample tablesample (' + convert(varchar(32), @sampledRows) + ' rows) order by checksum(newid())'
exec sp_executesql @stmt, N'@numRows int', @numRows

票数 2

Stack Overflow用户

发布于 2018-09-20 21:27:19

这个问题已经问了7年了，而且还没有被接受的答案。但当我在选择随机行时搜索SQL性能时，它的排名很高。但目前的答案似乎都不能给出一个简单，快速的解决方案，在大型表的情况下，所以我想补充我的建议。

假设：

主键是数字数据类型(每行通常为int / +1 )
主键是聚集索引
表有很多行，应该只选择其中的几行

我认为这是相当普遍的，所以它在很多情况下都会有帮助。

给定一组典型的数据，我的建议是

查找最大和最小
选择一个随机数并检查该数是否为表中的有效id
根据需要重复

这些操作都应该非常快，因为它们都在聚集索引上。只有在最后，才会读取剩余的数据，方法是根据主键列表选择一组数据，这样我们就只拉入我们实际需要的数据。

示例(MS SQL)：

--
-- First, create a table with some dummy data to select from
-- 

DROP TABLE IF EXISTS MainTable
CREATE TABLE MainTable(
    Id int IDENTITY(1,1) NOT NULL,
    [Name] nvarchar(50) NULL,
    [Content] text NULL
)
GO

DECLARE @I INT = 0
WHILE @I < 40
BEGIN
    INSERT INTO MainTable VALUES('Foo', 'bar')
    SET @I=@I+1
END
UPDATE MainTable SET [Name] = [Name] + CAST(Id as nvarchar(50))

-- Create a gap in IDs at the end
DELETE FROM MainTable
    WHERE ID < 10

-- Create a gap in IDs in the middle
DELETE FROM MainTable
    WHERE ID >= 20 AND ID < 30

-- We now have our "source" data we want to select random rows from



--
-- Then we select random data from our table
-- 

-- Get the interval of values to pick random values from
DECLARE @MaxId int
SELECT @MaxId = MAX(Id) FROM MainTable

DECLARE @MinId int
SELECT @MinId = MIN(Id) FROM MainTable

DECLARE @RandomId int
DECLARE @NumberOfIdsTofind int = 10

-- Make temp table to insert ids from
DROP TABLE IF EXISTS #Ids
CREATE TABLE #Ids (Id int)

WHILE (@NumberOfIdsTofind > 0)
BEGIN
    SET @RandomId = ROUND(((@MaxId - @MinId -1) * RAND() + @MinId), 0)
    -- Verify that the random ID is a real id in the main table
    IF EXISTS (SELECT Id FROM MainTable WHERE Id = @RandomId)
    BEGIN
        -- Verify that the random ID has not already been inserted
        IF NOT EXISTS (SELECT Id FROM #Ids WHERE Id = @RandomId)
        BEGIN
            -- It's a valid, new ID, add it to the list.
            INSERT INTO #Ids VALUES (@RandomId)
            SET @NumberOfIdsTofind = @NumberOfIdsTofind - 1;
        END
    END
END

-- Select the random rows of data by joining the main table with our random Ids   
SELECT MainTable.* FROM MainTable
INNER JOIN #Ids ON #Ids.Id = MainTable.Id

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/6680310

复制

相似问题

问按性能随机排序
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问按性能随机排序EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问按性能随机排序
EN