本文实例讲述了c#基于正则表达式抓取a标签链接和innerhtml的方法。分享给大家供大家参考,具体如下:
//读取网页html string text = file.readalltext(environment.currentdirectory + "//test.txt", encoding.getencoding("gb2312")); string prttern = "<a(\\s+(href=\"(?<url>([^\"])*)\"|'([^'])*'|\\w+=\"(([^\"])*)\"|'([^'])*'))+>(?<text>(.*?))</a>"; var maths = regex.matches(text, prttern); //抓取出来写入的文件 using (filestream w = new filestream(environment.currentdirectory + "//wirter.txt", filemode.create)) { for (int i = 0; i < maths.count; i++) { byte[] bs = encoding.utf8.getbytes(string.format("链接地址:{0}, innerhtml:{1}", maths[i].groups["url"].value, maths[i].groups["text"].value) + "\r\n"); w.write(bs, 0, bs.length); console.writeline(); } } console.readkey();
图解正则
朋友需要截取img标签的src 和data-url 跟上面差不多。。顺便附上
string text =file.readalltext(environment.currentdirectory + "//test.txt", encoding.getencoding("gb2312")); string prttern = "<img(\\s*(src=\"(?<src>[^\"]*?)\"|data-url=\"(?<dataurl>[^\"]*?)\"|[-\\w]+=\"[^\"]*?\"))*\\s*/>"; var maths = regex.matches(text, prttern); //抓取出来写入的文件 using (filestream w = new filestream(environment.currentdirectory + "//wirter.txt", filemode.create)) { for (int i = 0; i < maths.count; i++) { byte[] bs = encoding.utf8.getbytes(string.format("图片src:{0}, 图片data-url:{1}", maths[i].groups["src"].value, maths[i].groups["dataurl"].value) + "\r\n"); w.write(bs, 0, bs.length); console.writeline(); } }
以上就是c#基于正则表达式抓取a标签链接和innerhtml的方法的详细内容。
