在database中可能由于某种原因如用户输入,导入数据失败等 导致了重复记录. 如果你没有用主键,约束,或来其它机制实现数据完整性,那最后总是重复记录在你的中.现在让我们来看在sql server 2008中如何删除这些记录, 首先,可以模拟造一些简单重复记录:
代码如下 复制代码
create table dbo.employee
([id] int primary key ,
[name] varchar(50),
[age] int,
[sex] bit default 1)
insert into employee ([id] , [name] , [age] , [sex] ) values(1,'james',25,default)
insert into employee ([id] , [name] , [age] , [sex] ) values(2,'james',25,default)
insert into employee ([id] , [name] , [age] , [sex] ) values(3,'james',25,default)
insert into employee ([id] , [name] , [age] , [sex] ) values(4,'lisa',24,0)
insert into employee ([id] , [name] , [age] , [sex] ) values(5,'lisa',24,0)
insert into employee ([id] , [name] , [age] , [sex] ) values(6,'lisa',24,0)
insert into employee ([id] , [name] , [age] , [sex] ) values(7,'mirsa',23,0)
insert into employee ([id] , [name] , [age] , [sex] ) values(8,'mirsa',23,0)
insert into employee ([id] , [name] , [age] , [sex] ) values(9,'mirsa',23,0)
insert into employee ([id] , [name] , [age] , [sex] ) values(10,'john',26,default)
insert into employee ([id] , [name] , [age] , [sex] ) values(11,'abraham',28,default)
insert into employee ([id] , [name] , [age] , [sex] ) values(12,'lincoln',30,default)
select * from dbo.employee
首先我们使用最常见的方法:
代码如下 复制代码
delete from employee where name in (
name
from employee group by name having count(name)>1);
接着使用rownumber():
代码如下 复制代码
delete t from(
select row_number() over(partition by [name] order by (select 0)) as rownumber,* from employee) t
where t.rownumber > 1;
with du as
(
select row_number() over(partition by [name] order by (select 0)) as rn
from employee
)
delete from dups
where rn>1;
with dups as
(
select [id],[name],[age],[sex]
, row_number() over(partition by [name] order by (select 0)) as rn
,rank() over(partition by [name] order by (select 0)) as rnk
from employee
)
delete from dups
where rnrnk;
下面是这四个t-sql查询的执行计划:
你可以看到没有用cte的方法开销最大, 主要是在table spool, 这里开销了44%, table spool 是一个物理运算符。
table spool 运算符扫描输入,并将各行的一个副本放入隐藏的假脱机表中,此表存储在 tempdb 数据库中并且仅在查询的生存期内存在。如果重绕该运算符(例如通过 nested loops 运算符重绕),但不需要任何重新绑定,则将使用假脱机数据,而不用重新扫描输入。
注意上面的方法只是在重复记录比较少的情况下, 如果重复记录多. delete将会非常慢, 最好的方法是复制目标数据到另一个新表,删除原来的表,重命名新表为原来的表. 或用临时表, 这样还可以减少数据库事务日志. 看下面的t-sql:
代码如下 复制代码
with dups as
(
select [id],[name],[age],[sex]
, row_number() over(partition by [id] order by (select 0)) as rn
from employee
)
select [id],[name],[age],[sex]
into dbo.employeedupstmp
from dups
where rn=1
drop table dbo.employee;
exec sp_rename 'dbo.employeedupstmp','employee'
