您好,欢迎来到三六零分类信息网!老站,搜索引擎当天收录,欢迎发信息

正则表达式抓取网页信息

2024/2/18 6:21:55发布36次查看
声明:此正则表达式只适用于.net ,使用的流程为发送http请求返回整个html网页,然后从此html页面抓取想要的数据。 
第一部分:发送httpwebrequest 请求
c#代码  
//url 地址 httpwebrequest request = (httpwebrequest)webrequest.create("url")); httpwebresponse response = (httpwebresponse)request.getresponse(); //浏览器类型设置 request.useragent = "mozilla/4.0 (compatible; msie 7.0; windows nt 6.0; slcc1; .net clr 2.0.50727; .net clr 3.0.04506; .net clr 3.5.21022; .net clr 1.0.3705; .net clr 1.1.4322)"; streamreader reader = new streamreader(response.getresponsestream(), encoding.getencoding("utf-8")); //返回的html网页数据 string htmlstr = reader.readtoend();
第二部分:根据返回的html获取有用数据,此方法适用于所有想通过id或class等等的标签找到html的需求,拿下面一个方法为例
c#代码
/// <summary> /// 获得颜色 /// </summary> /// <param name="htmlstr"></param> /// <returns></returns> public string getcolor(string htmlstr) { //获取class为 detailsc_sku的html ,还可改为id的方式 //string regstr6 = @"<(?<htmltag>[\w]+)[^>]*\s[ii][dd]=(?<quote>"; string regstr6 = @"<(?<htmltag>[\w]+)[^>]*\s[cc][ll][aa][ss][ss]=(?<quote>"; string regstr7 = "[\"']?)detailsc_sku(?(quote)"; string regstr8 = @"\k<quote>)"; string regstr9 = "[\"']?[^>]*>"; string regstr10 = @"((?<nested><\k<htmltag>[^>]*>)|</\k<htmltag>>(?<-nested>)|.*?)*</\k<htmltag>>"; stringbuilder sb2 = new stringbuilder(); sb2.append(regstr6); sb2.append(regstr7); sb2.append(regstr8); sb2.append(regstr9); sb2.append(regstr10); //根据正则表达式获取的html string sizehtml = regex.match(htmlstr, sb2.tostring(), regexoptions.singleline).tostring(); if (!string.isnullorempty(sizehtml)) { string newhtml = htmlstr.replace(sizehtml, ""); string regstr11 = @"<(?<htmltag>[\w]+)[^>]*\s[cc][ll][aa][ss][ss]=(?<quote>"; string regstr12 = "[\"']?)detailsc_sku(?(quote)"; string regstr13 = @"\k<quote>)"; string regstr14 = "[\"']?[^>]*>"; string regstr15 = @"((?<nested><\k<htmltag>[^>]*>)|</\k<htmltag>>(?<-nested>)|.*?)*</\k<htmltag>>"; stringbuilder sb3 = new stringbuilder(); sb3.append(regstr11); sb3.append(regstr12); sb3.append(regstr13); sb3.append(regstr14); sb3.append(regstr15); string colorhtml = regex.match(newhtml, sb3.tostring(), regexoptions.singleline).tostring(); if (string.isnullorempty(colorhtml)) return ""; //找出此colorhtml中的所有a 标签 regex regex2 = new regex(@"<a.*?>[\s\s]*?<\/a>"); matchcollection mc2 = regex2.matches(colorhtml); stringbuilder sbs = new stringbuilder(); //循环找到颜色 if (mc2.count > 0) { foreach (match mm in mc2) { sbs.append(removehtml(mm.value.tostring())).append(","); } } return sbs.tostring(); } return ""; }
c#代码
/// <summary> /// 替换字符串中的html标签为空返回标签里的内容 /// </summary> /// <param name="src"></param> /// <returns></returns> public string removehtml(string src) { regex htmlreg = new regex(@"<[^>]+>", regexoptions.compiled | regexoptions.ignorecase); regex htmlspacereg = new regex("\\&nbsp\\;", regexoptions.compiled | regexoptions.ignorecase); regex spacereg = new regex("\\s{2,}|\\ \\;", regexoptions.compiled | regexoptions.ignorecase); regex stylereg = new regex(@"<style(.*?)</style>", regexoptions.compiled | regexoptions.ignorecase); regex scriptreg = new regex(@"<script(.*?)</script>", regexoptions.compiled | regexoptions.ignorecase); src = stylereg.replace(src, string.empty); src = scriptreg.replace(src, string.empty); src = htmlreg.replace(src, string.empty); src = htmlspacereg.replace(src, " "); src = spacereg.replace(src, " "); return src.trim(); }
该用户其它信息

VIP推荐

免费发布信息,免费发布B2B信息网站平台 - 三六零分类信息网 沪ICP备09012988号-2
企业名录 Product