齐柏林手表验真假:关关采集规则教程

来源：百度文库编辑：九乡新闻网时间：2024/05/04 22:46:44

关关采集器，主要使用正则采集，以下是正则的一些表达

\d* 表示数字
\s* 表示空格+换行
.+? 表示字符(不能为空)
.* 表示字符(可以为空)
() 表示我们需要的部分
((.|\n)*) 章节的内容部分，包括了换行。
=====杰奇对应=====
!!!! 相当于 ([^><]*)
~~~~ 相当于 ([^><'"]*)
^^^^ 相当于 ([^><\d]*)
$$$$ 相当于 ([\d]*)
**** 相当于 (.*)
=====其他基本=====
. 匹配任何单个字符。例如正则表达式r.t匹配这些字符串：rat、rut、r t，但是不匹配root。
$ 匹配行结束符。例如正则表达式weasel$ 能够匹配字符串"He's a weasel"的末尾，但是不能匹配字符串"They are a bunch of weasels."。

^ 匹配一行的开始。例如正则表达式^When in能够匹配字符串"When in the course of human events"的开始，但是不能匹配"What and When
in the"。
* 匹配0或多个正好在它之前的那个字符。例如正则表达式.*意味着能够匹配任意数量的任何字符。
\ 这是引用府，用来将这里列出的这些元字符当作普通的字符来进行匹配。例如正则表达式\$被用来匹配美元符号，而不是行尾，类似的，正则
表达式\.用来匹配点字符，而不是任何字符的通配符。
万能图片规则<[^<]*((?<=<(?:img|IMG)[^>]*(?:(?:src|SRC)(?:\s*=\s*(?:["']?))))(?:[^\s"'>]*)\.(?:jpg|gif|jpeg|bmp|png|GIF|JPG))
[^>]*>
附带：藏海阁文学网采集规则，全文字的哦



    Match
    None


    RuleID
    1
    Match
    None


    GetSiteName
    藏海阁
    Match
    None


    GetSiteCharset
    utf-8
    Match
    None


    GetSiteUrl
    http://www.canghaige.com/
    Match
    None


    NovelSearchUrl
    http://www.canghaige.com/Book/Search.aspx
    Match
    None


    NovelSearchData
    SearchKey={SearchKey}&SearchClass=1
    Match
    None


    NovelSearch_GetNovelKey
    <div id="CListTitle"><a href="/Book/(\d*)/Index.aspx" target="_blank"><b>{SearchKey}</b></a>
    Match
    None


    NovelListUrl
    http://www.canghaige.com/type/1/
    Match
    None


    NovelList_GetNovelKey
    <a href="http://www.canghaige.com/books/(\d*)/" id=".+?" title=".+?">(.+?)</a>
    Match
    None


    NovelUrl
    http://www.canghaige.com/books/{NovelKey}/
    Match
    None


    NovelErr
    未找到该编号的书籍信息
    Match
    None


    NovelName
    <h1>(.+?)</h1>
    Match
    None


    NovelAuthor
    作者：(.+?)</span>
    Match
    None


    LagerSort
    书籍类别:(.+?)</span>
    Match
    None


    SmallSort
    书籍类别:(.+?)</span>
    Match
    None


    NovelIntro
    <div>内容简介：((.|\n)*?)</div>\s*</li>
    Match
    None
    <span(.|\n)+?</span>|<p>|<a.+?</a>|</div>

    NovelKeyword
    <h1>(.+?)</h1>
    Match
    None


    NovelDegree
    连载状态：(.+?)</span>
    Match
    None


    NovelCover
    <a class="pic"><img src="(.+?)"
    Match
    None


    NovelDefaultCoverUrl

    Match
    None


    NovelInfo_GetNovelPubKey
    连载状态：(.+?)</span>
    Match
    None


    PubCookies

    Match
    None


    PubIndexUrl
    http://www.canghaige.com/books/{NovelKey}/
    Match
    None


    PubIndexErr
    这里必须填写
    Match
    None


    PubVolumeContent

    Match
    None


    PubVolumeSplit
    <h3>
    Spilt
    None


    PubVolumeName
    Title">(.+?)</div>
    Match
    None
     

    PubChapterName
    <li><a href="http://www.canghaige.com/book/\d*/\d*/">([^<]+?)</a>
    Match
    None


    PubChapter_GetChapterKey
    <li><a href="(http://www.canghaige.com/book/\d*/\d*/)">[^<]+?</a>
    Match
    None


    PubContentUrl
    {ChapterKey}
    Match
    None


    PubContentErr
    这里必须填写
    Match
    None


    PubContent_GetTextKey

    Match
    None


    PubTextUrl

    Match
    None


    PubContentText
    <div id="zjneirong" s="font-size:14px;width:100%;">((.|\n)+?)<hr
    Match
    None
    <div.+?>|<div>|</div>|<DIV.+?>|</DIV>|<script(.|\n)+?</script>|<style(.|\n)+?</style>|<a(.|\n)+?</a>

    PubContentReplace

    Match
    None


    PubContentImages
    <[^<]*((?<=<(?:img|IMG)[^>]*(?:(?:src|SRC)(?:\s*=\s*(?:["']?))))(?:[^\s"'>]*)\.(?:jpg|gif|jpeg|bmp|png|GIF|JPG))[^>]*>
    Match
    None

关关采集规则教程燕赵晚报：骆家辉坐经济舱只关规则无关欠钱关盼盼关注外?关内?关 [转载]小儿关煞百日关、白虎关、汤火关 “关关雎鸠，在河之洲” 诗经关雎先秦小儿关煞关与人性好坏关之琳春色满园关不住关向应红楼梦人物关宁武关-搜狐新闻门没有关风雨西关中国十大名关 5关6将关向应照片武圣人关羽祈福关林张家界关庙