排序,分组和集合操作
排序操作
创新互联公司专注于鹤庆企业网站建设,响应式网站,电子商务商城网站建设。鹤庆网站建设公司,为鹤庆等地区提供建站服务。全流程按需定制制作,专业设计,全程项目跟踪,创新互联公司专业和态度为您提供的服务
排序是计算机承担的最基本操作之一,尤其是在数据库处理领域,oracle也不例外。可能需要oracle排序数据的操作包括以下几种;
(1)创建一个索引
(2)通过group by,unique或distinct关键字对数据进行分组或聚合
(3)因为使用order by子句使得数据按照排好的顺序返回
(4)使用排序合并方法联结表或结果集
(5)使用集合操作union,intersect或minus
(6)执行特定的子查询
查询可能需要可观的资源,具体说明如下:
CPU总是要消耗的。需要CPU的数量和需要排序的结果集大小成正比。
oracle分配一块内存用来排序。这块内存来自于程序全局区(PGA).可用的PGA内存总量通常取决于MEMORY_TARGET或PGA_AGGREGATE_TARGET的值
如果内存区域不足以完成排序,oracle会分配一个或多个临时段到临时表空间。这个叫作磁盘排序。磁盘排序会增加额外的开销,包括在临时段中分配空间和把数据写入临时表空间以及磁盘读出的IO消耗。
1.1 最优,一次或多次排序
可供执行排序操作的内存量是决定排序性能的最重要的因数。如果供排序使用的内存量足以使排序在内存中完成,则性能将是最佳的,实际上oracle将这种类型的排序操作称为最优排序。
如果没有足够的内存,当执行排序操作时,oracle必须读写临时段。在一次排序中,orale仅需要写入(然后读回)一个短。在多次排序中,oracle需要写并读回多个排序段。需要的排序次数越多。排序中所包含的IO就越多,排序性能也就越差。
排序所需要的IO随着排序次数的增加而急剧增长,排序次数最终成为影响sql性能的主要因素。
1.2 衡量排序活动
现在让我们来看看如何能够衡量排序活动
优化器计算它认为将需要的内存量,并且你可以通过检查dbms_xplan输出的tempSpc字段看到这些(即使在你执行sql之前)
SQL> explain plan for select * from customers order by cust_last_name,cust_first_name,cust_year_of_birth;
Explained.
SQL> select * from table(dbms_xplan.display());
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------
Plan hash value: 2792773903
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 55500 | 9810K| | 2609 (1)| 00:00:02 |
| 1 | SORT ORDER BY | | 55500 | 9810K| 12M| 2609 (1)| 00:00:02 |
| 2 | TABLE ACCESS FULL| CUSTOMERS | 55500 | 9810K| | 405 (1)| 00:00:01 |
---------------------------------------------------------------------------------------
9 rows selected.
DBMS_XPLAN产生的值是一个估计值,但是它可以给你一个对于所需内存的大体感受。
V$SQL_WORKAREA视图包含于数据库中已发生的内存使用及排序行为的相关的详细统计信息。
WITH sql_workarea AS
(SELECT sql_id || '-' || child_number sql_id_child,
operation_type operation,
last_execution last_exec,
round(active_time / 1000000, 2) seconds,
optimal_executions || '/' || onepass_executions || '/' ||
multipasses_executions olm,
substr(sql_text, 1, 155) sql_text,
rank() over(ORDER BY active_time DESC) ranking
FROM v$sql_workarea
JOIN v$sql
USING (sql_id, child_number))
SELECT sql_id_child "SQL_ID -CHILD",
seconds,
operation,
last_exec,
olm "O/1/M",
sql_text
FROM sql_workarea
WHERE ranking <= 3
ORDER BY ranking;
1.3 跟踪排序活动
SQL> oradebug event 10032 trace name context forever,level 12;
---- Sort Parameters ------------------------------
sort_area_size 12691456
sort_area_retained_size 12691456
sort_multiblock_read_count 1
max intermediate merge width 773
*** 2014-12-15 13:34:36.765
sorcls: sorp 0x7fd8df3ea1c8
---- Sort Statistics ------------------------------
Input records 55500
Output records 55500
Total number of comparisons performed 656484
Comparisons performed by in-memory sort 656484
Total amount of memory used 12691456
Uses version 2 sort
---- End of Sort Statistics -----------------------
1.4 使用索引规避排序
如果再order by子句中的部分或全部列上存在索引,oracle有可能使用索引来按照要求的顺序获取记录,因此也避免了排序操作。
假如索引是出现于order by子句里的列相同的列上,oracle可以直接从索引中按照索引排序的顺序读取记录。然而,按键的顺序读取记录需要一块接一块地全扫描索引叶子块。虽然快速全扫描比常规的全索引扫描高效得多,但是快速全扫描无法按索引顺序返回记录。因此也不能用来避免排序操作。
SQL> select * from customers order by cust_last_name,cust_first_name,cust_year_of_birth;
55500 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2792773903
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 55500 | 9810K| | 2609 (1)| 00:00:02 |
| 1 | SORT ORDER BY | | 55500 | 9810K| 12M| 2609 (1)| 00:00:02 |
| 2 | TABLE ACCESS FULL| CUSTOMERS | 55500 | 9810K| | 405 (1)| 00:00:01 |
----------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
1456 consistent gets
1454 physical reads
0 redo size
6366832 bytes sent via SQL*Net to client
41213 bytes received via SQL*Net from client
3701 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
55500 rows processed
然而,如果我们在order by的列上创建一个索引,并使用first_rows提示(或index提示),oracle将使用索引替代排序:
SQL> create index cust_member_i on customers(cust_last_name,cust_first_name,cust_year_of_birth);
Index created.
SQL> select /*+ index(customers cust_member_i) */* from customers order by cust_last_name,cust_first_name,cust_year_of_birth;
55500 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3023887059
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 55500 | 9810K| 20550 (1)| 00:00:15 |
| 1 | TABLE ACCESS BY INDEX ROWID| CUSTOMERS | 55500 | 9810K| 20550 (1)| 00:00:15 |
| 2 | INDEX FULL SCAN | CUST_MEMBER_I | 55500 | |225 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
26557 consistent gets
1701 physical reads
0 redo size
11063057 bytes sent via SQL*Net to client
41213 bytes received via SQL*Net from client
3701 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
55500 rows processed
虽然使用索引就不再需要排序了,但是同时读取索引和表块,已经按块顺次读取这种并不高效的扫描方式所带来的开销,比仅使用全表扫描读取表块的方式要欠佳很多。通常,这意味着了为了避免排序而使用索引,实际上会到导致更差的性能。然而,使用索引的再检索第一行记录时速度更快。因为一旦需要的记录被检索到,它就立即返回。相比之下排序的方法要求再任一记录返回之前,全部记录都必须被检索并完成排序。因此,在优化器目标为first_rows_n时,优化器倾向于使用索引,而在目标是all_rows时,则会使用全表扫描。
2 分组与聚合
聚合(aggregate)操作基于多行记录返回数据,平均,求和,最大和最小值等。分组(grouping)操作允许我们对共享相同的group by值的记录集返回它的集合。
2.1 聚合操作
聚合操作(sum和average)必须处理输入数据的每一行记录。因此,它们通常和全表扫描联系在一起:
SQL> select sum(quantity_sold) from sales;
Execution Plan
----------------------------------------------------------
Plan hash value: 3519235612
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 525 (2)| 00:00:01 | | |
| 1 | SORT AGGREGATE | | 1 | 3 | | | | |
| 2 | PARTITION RANGE ALL| | 918K|2691K| 525 (2)| 00:00:01 | 1 | 28 |
| 3 | TABLE ACCESS FULL | SALES | 918K|2691K| 525 (2)| 00:00:01 | 1 | 28 |
-------------------------------------------------------------------------------------------
如果被聚合上的列存在索引,对于这个索引的快速全扫描通常会更高效:
2.2 最大值和最小值
与大多数其他聚合操作不同,如果在相关列上存在索引,max和min操作并不需要读取每一行记录。如果存在B树索引,我们可以通过检查第一个或最后一个索引项来确定最大值或最小值,这仅需要3-5个逻辑读的开销
SQL> select max(amount_sold) from sales;(没有索引)
Execution Plan
----------------------------------------------------------
Plan hash value: 3519235612
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 525 (2)| 00:00:01 | | |
| 1 | SORT AGGREGATE | | 1 | 5 | | | | |
| 2 | PARTITION RANGE ALL| | 918K|4486K| 525 (2)| 00:00:01 | 1 | 28 |
| 3 | TABLE ACCESS FULL | SALES | 918K|4486K| 525 (2)| 00:00:01 | 1 | 28 |
-------------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
1635 consistent gets
1619 physical reads
0 redo size
536 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
有索引的情况:
SQL> select max(amount_sold) from sales;
Execution Plan
----------------------------------------------------------
Plan hash value: 781264156
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 5 | | |
| 2 | INDEX FULL SCAN (MIN/MAX)| AMOUNT_SOLD_IDX | 1 | 5 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
3 consistent gets
2 physical reads
0 redo size
536 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
然而,如果我们要同时查找最大值和最小值,oracle需要花费很多的逻辑读,此时全扫描索引代价会更高,oracle选择全表扫描。
SQL> select max(amount_sold),min(amount_sold) from sales;
Execution Plan
----------------------------------------------------------
Plan hash value: 3519235612
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 525 (2)| 00:00:01 | | |
| 1 | SORT AGGREGATE | | 1 | 5 | | | | |
| 2 | PARTITION RANGE ALL| | 918K|4486K| 525 (2)| 00:00:01 | 1 | 28 |
| 3 | TABLE ACCESS FULL | SALES | 918K|4486K| 525 (2)| 00:00:01 | 1 | 28 |
-------------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
1635 consistent gets
1619 physical reads
0 redo size
618 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
实际上分别提交MAX和MIN查询然后将结果合并到一起是一种更好的办法:
SQL> select max_sold,min_sold from(select max(amount_sold)max_sold from sales)maxt,(select min(amount_sold) min_sold from sales)mint;
Execution Plan
----------------------------------------------------------
Plan hash value: 3650580342
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | 6 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 1 | 26 | 6 (0)| 00:00:01 |
| 2 | VIEW | | 1 | 13 | 3 (0)| 00:00:01 |
| 3 | SORT AGGREGATE | | 1 | 5 | | |
| 4 | INDEX FULL SCAN (MIN/MAX)| AMOUNT_SOLD_IDX | 1 | 5 | 3 (0)| 00:00:01 |
| 5 | VIEW | | 1 | 13 | 3 (0)| 00:00:01 |
| 6 | SORT AGGREGATE | | 1 | 5 | | |
| 7 | INDEX FULL SCAN (MIN/MAX)| AMOUNT_SOLD_IDX | 1 | 5 | 3 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
6 consistent gets
0 physical reads
0 redo size
602 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
2.3 "前N"查询
在sales表中按美元计销量最好的前10种商品,以下查询肯定是不对的:
SQL> select * from sales where rownum<=10 order by amount_sold desc;
PROD_ID CUST_ID TIME_IDCHANNEL_ID PROMO_ID QUANTITY_SOLD AMOUNT_SOLD
---------- ---------- --------- ---------- ---------- ------------- -----------
13 987 10-JAN-98 3 999 11232.16
13 1660 10-JAN-98 3 999 11232.16
13 1762 10-JAN-98 3 999 11232.16
13 1843 10-JAN-98 3 999 11232.16
13 4663 10-JAN-98 3 999 11232.16
13 2273 10-JAN-98 3 999 11232.16
13 2380 10-JAN-98 3 999 11232.16
13 2683 10-JAN-98 3 999 11232.16
13 2865 10-JAN-98 3 999 11232.16
13 1948 10-JAN-98 3 999 11232.16
10 rows selected.
这是因为对where的处理会先于order by,因此这个查询将获取它最先发现的的10条记录,然后对它们进行排序。这样的结果不是真正的前10.
下面的查询更合理:
SQL> select /* top10 subquery */ * from(select cust_id,prod_id,time_id,amount_sold from sales order by amount_sold desc) where rownum<=10;
CUST_ID PROD_ID TIME_IDAMOUNT_SOLD
---------- ---------- --------- -----------
3948 18 26-APR-99 1782.72
4150 18 26-JUN-99 1782.72
40 18 26-JUN-99 1782.72
33724 18 21-JUN-99 1782.72
32863 18 21-JUN-99 1782.72
31364 18 21-JUN-99 1782.72
10864 18 21-JUN-99 1782.72
10620 18 21-JUN-99 1782.72
6490 18 21-JUN-99 1782.72
4788 18 21-JUN-99 1782.72
Execution Plan
----------------------------------------------------------
Plan hash value: 443584055
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 480 | 4 (0)| 00:00:01 | | |
|* 1 | COUNT STOPKEY | | | | | | | |
| 2 | VIEW | | 10 | 480 | 4 (0)| 00:00:01 | | |
| 3 | TABLE ACCESS BY GLOBAL INDEX ROWID| SALES | 918K| 19M| 4 (0)| 00:00:01 | ROWID | ROWID |
| 4 | INDEX FULL SCAN DESCENDING | AMOUNT_SOLD_IDX | 10 | | 3 (0)| 00:00:01 | | |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=10)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
6 consistent gets
0 physical reads
0 redo size
908 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
10 rows processed
STOPKEY表示不用将排序的结果全部返回。
使用分析函数:
SQL> set lines 200
SELECT /* top10 dense_rank */
*
FROM (SELECT cust_id,
prod_id,
time_id,
amount_sold,
dense_rank() over(ORDER BY amount_sold DESC) ranking
FROM sales)
9 WHERE ranking <= 10;
1355 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2275521554
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 918K| 53M| | 6623 (1)| 00:00:05 | | |
|* 1 | VIEW | | 918K| 53M| | 6623 (1)| 00:00:05 | | |
|* 2 | WINDOW SORT PUSHED RANK| | 918K| 19M| 31M| 6623 (1)| 00:00:05 | | |
| 3 | PARTITION RANGE ALL | | 918K| 19M| | 525 (2)| 00:00:01 | 1 | 28 |
| 4 | TABLE ACCESS FULL | SALES | 918K| 19M| | 525 (2)| 00:00:01 | 1 | 28 |
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("RANKING"<=10)
2 - filter(DENSE_RANK() OVER ( ORDER BY INTERNAL_FUNCTION("AMOUNT_SOLD") DESC )<=10)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
1635 consistent gets
1619 physical reads
0 redo size
31018 bytes sent via SQL*Net to client
1513 bytes received via SQL*Net from client
92 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
1355 rows processed
尽管我对于子查询得到前10的方法有所保留,但是它更高效。dense_rank方法比排序的子查询方法消耗更多的内存和CPU。
2.4 分组操作
分组(group by)操作允许sql语句对一批列中的每个唯一值返回一行记录,并且对每个一个这样的组计算聚合。例如,下面的语句对每一个国家代码返回平均信用限额:
SQL> select country_id,avg(cust_credit_limit) from customers group by country_id;
19 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1577413243
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 19 | 171 | 407 (1)| 00:00:01 |
| 1 | HASH GROUP BY | | 19 | 171 | 407 (1)| 00:00:01 |
| 2 | TABLE ACCESS FULL| CUSTOMERS | 55500 | 487K| 405 (1)| 00:00:01 |
--------------------------------------------------------------------------------
文章标题:排序,分组和集合操作
地址分享:http://scyanting.com/article/jipsdg.html