利用Sqoop将数据从数据库导入到HDFS

#oracle的连接字符串，其中包含了oracle的地址，sid，和端口号connecturl=jdbc:oracle:thin:@20.135.60.21:1521:dwrac2#使用的用
基本使用
如下面这个shell脚本：
#oracle的连接字符串，其中包含了oracle的地址，sid，和端口号
connecturl=jdbc:oracle:thin:@20.135.60.21:1521:dwrac2
#使用的用户名
oraclename=kkaa
#使用的密码
oraclepassword=kkaa123
#需要从oracle中导入的表名
oralcetablename=tt
#需要从oracle中导入的表中的字段名
columns=area_id,team_name
#将oracle中的数据导入到hdfs后的存放路径
hdfspath=apps/as/hive/$oralcetablename
#执行导入逻辑。将oracle中的数据导入到hdfs中
sqoop import --append --connect $connecturl --username $oraclename --password $oraclepassword --target-dir $hdfspath --num-mappers 1 --table $oralcetablename --columns $columns --fields-terminated-by '\001'
执行这个脚本之后，导入程序就完成了。
接下来，用户可以自己创建外部表，将外部表的路径和hdfs中存放oracle数据的路径对应上即可。
注意：这个程序导入到hdfs中的数据是文本格式，所以在创建hive外部表的时候，不需要指定文件的格式为rcfile，而使用默认的textfile即可。数据间的分隔符为'\001'。如果多次导入同一个表中的数据，数据以append的形式插入到hdfs目录中。
并行导入假设有这样这个sqoop命令，需要将oracle中的数据导入到hdfs中：
sqoop import --append --connect $connecturl --username $oraclename --password $oraclepassword --target-dir $hdfspath --m 1 --table $oralcetablename --columns $columns --fields-terminated-by '\001' --where data_desc='2011-02-26'
请注意，在这个命令中，，有一个参数“-m”，代表的含义是使用多少个并行，这个参数的值是1，说明没有开启并行功能。
现在，我们可以将“-m”参数的值调大，使用并行导入的功能，如下面这个命令：
sqoop import --append --connect $connecturl --username $oraclename --password $oraclepassword --target-dir $hdfspath --m 4 --table $oralcetablename --columns $columns --fields-terminated-by '\001' --where data_desc='2011-02-26'
一般来说，sqoop就会开启4个进程，同时进行数据的导入操作。
但是，如果从oracle中导入的表没有主键，那么会出现如下的错误提示：
error tool.importtool: error during import: no primary key could be found for table creater_user.popt_cas_redirect_his. please specify one with --split-by or perform a sequential import with '-m 1'.
在这种情况下，为了更好的使用sqoop的并行导入功能，我们就需要从原理上理解sqoop并行导入的实现机制。
如果需要并行导入的oracle表的主键是id，并行的数量是4，那么sqoop首先会执行如下一个查询：
select max(id) as max, select min(id) as min from table [where 如果指定了where子句];
通过这个查询，获取到需要拆分字段（id）的最大值和最小值，假设分别是1和1000。
然后，sqoop会根据需要并行导入的数量，进行拆分查询，比如上面的这个例子，并行导入将拆分为如下4条sql同时执行：
select * from table where 0
select * from table where 250
select * from table where 500
select * from table where 750
注意，这个拆分的字段需要是整数。
从上面的例子可以看出，如果需要导入的表没有主键，我们应该如何手动选取一个合适的拆分字段，以及选择合适的并行数。
再举一个实际的例子来说明：
我们要从oracle中导入creater_user.popt_cas_redirect_his。
这个表没有主键，所以我们需要手动选取一个合适的拆分字段。
首先看看这个表都有哪些字段：
然后，我假设ds_name字段是一个可以选取的拆分字段，然后执行下面的sql去验证我的想法：
select min(ds_name), max(ds_name) from creater_user.popt_cas_redirect_his where data_desc='2011-02-26'
发现结果不理想，min和max的值都是相等的。所以这个字段不合适作为拆分字段。
再测试一下另一个字段：clientip
select min(clientip), max(clientip) from creater_user.popt_cas_redirect_his where data_desc='2011-02-26'
这个结果还是不错的。所以我们使用clientip字段作为拆分字段。
所以，我们使用如下命令并行导入：
sqoop import --append --connect $connecturl --username $oraclename --password $oraclepassword --target-dir $hdfspath --m 12 --split-by clientip --table $oralcetablename --columns $columns --fields-terminated-by '\001' --where data_desc='2011-02-26'
这次执行这个命令，可以看到，消耗的时间为：20mins, 35sec，导入了33,222,896条数据。
另外，如果觉得这种拆分不能很好满足我们的需求，可以同时执行多个sqoop命令，然后在where的参数后面指定拆分的规则。如：
sqoop import --append --connect $connecturl --username $oraclename --password $oraclepassword --target-dir $hdfspath --m 1 --table $oralcetablename --columns $columns --fields-terminated-by '\001' --where data_desc='2011-02-26' logtime
sqoop import --append --connect $connecturl --username $oraclename --password $oraclepassword --target-dir $hdfspath --m 1 --table $oralcetablename --columns $columns --fields-terminated-by '\001' --where data_desc='2011-02-26' logtime>=10:00:00
从而达到并行导入的目的。

利用Sqoop将数据从数据库导入到HDFS

VIP推荐