Sometimes its required to fetch random records from database. The simplest approach is to use NewID() with order by clause. Well I have no problem with that but when you have large table to query then It’ll start popping up the issues of performance. Why? I’ll try to answer this question with alternative solution.
To start with I’ve Adventureworks database with around 260000 records in a table and we’ve a query that will take a random record from it.
SELECT TOP 1 [BusinessEntityID] FROM HumanResources.EmployeePayHistory ORDER BY NEWID()
The problem with this query is, when you look at the execution plan then you’ll find it performs a sort operation before it picks a top record.
So 91% of the cost is due to the sort operation. Well this will happen if you use NewID() to pick the random records it first sorts the records based on NewID() which is worst case for sorting based on GUID values.
Now get rid of Sort operation of NEWID() by- BINARY_CHECKSUM
Best approach is to use the BinaryCheckSum instead of NewID to reduce the sort operation to perform on NEWID column generated in Memory. SELECT TOP 1 [BusinessEntityID] FROM HumanResources.EmployeePayHistory WHERE (ABS(CAST( (BINARY_CHECKSUM(*) * RAND()) as int)) % 100) < 10 The BINARY_CHECKSUM function generates a checksum value that is based on the values of the columns that you specify. If two rows are different, they typically will generate different checksum numbers. TheBINARY_CHECKSUM function generally is used to verify whether any of the columns in a row in a table have changed. However, for our purposes, it generates a number that looks like a random number for each row. Another form of using BinaryCheckSum can be: SELECT TOP 1 [BusinessEntityID] FROM HumanResources.EmployeePayHistory WHERE (ABS(CAST( (BINARY_CHECKSUM ([BusinessEntityID], NEWID())) as int)) % 100) < 10 Both will form of Binary_Cheksum will have same query Execution plan with similar cost. Now lets have a look at the Figures analyzed by the Microsoft team to compare performance of both queries. 13 million rows NEWID query 253 347,420 13,810,132 14,157,552 422,891 BINARY_CHECKSUM query 21 347,420 0 347,420 49,203
And in our case here I’ve also did some comparison based on Query plans of Old query of NEWID and new query with BINARY_CHECKSUM
Just another tips improving performance but if you have to pick random records from small tables in SQL then you can still use NEWID() with no issues.
Let me know you feedback. Please do comments if you like to.
Hi Amit,
ReplyDeleteWell explained!!
I wouls like to know how I can select 5000 random rows from 5 Lacks records using BINARY_CHECKSUM ?
The number 5000 should be dynamic input say n.
Thanks!
Somnath
If Top N is of no use in my scenario. As the selection should be distributed across my universe(e.g.,5 Lack)
ReplyDeleteHi Somnath,
ReplyDeleteHere TOP N doesn't mean It'll pick the top and then shuffle them but it'll shuffle all and then pick N records from them. So it'll be uniform over all your records.
Either ordery by newid() or BINARY_CHECKSUM its not efficient for who have less then 3 records in a table for randomly select data in sql server
ReplyDelete?
Can anyone give alternate solution ?
Thanks
palanivelu samudi
Thank you very much for this solution. I was stuck with newid and rand.
ReplyDelete