Rank::CMS - 基于特征识别超过 600 种 CMS。可识别所有流行的论坛、博客、CMS、留言板、维基及多种其他类型的引擎
爬虫工具概览

Rank::CMS – 根据特征识别超过 600 种 CMS。可识别所有流行的论坛、博客、CMS、留言板、维基以及许多其他类型的引擎。A-Parser 的功能允许保存 Rank::CMS 爬虫工具的抓取设置以便后续使用(预设),设置抓取计划等等。
由于内置了强大的 Template Toolkit 模板引擎,可以将结果保存为您需要的任何形式和结构,它允许对结果应用额外的逻辑,并以各种格式输出数据,包括 JSON、SQL 和 CSV。
采集的数据
- CMS 名称
- 类别名称
支持的 CMS 列表
"1C-Bitrix", "2z Project", "3dCart", "Accessible Portal", "actionhero.js", "Adobe CQ5", "Ametys", "Amiro.CMS", "AMPcms", "Anchor CMS", "AsciiDoc", "Backdrop", "Banshee", "BIGACE", "Bolt", "BrowserCMS", "Business Catalyst", "Cargo", "Chameleon", "Ckan", "CMS Made Simple", "CMSimple", "Concrete5", "Contao", "Contenido", "Contens", "ContentBox", "Cotonti", "CPG Dragonfly", "CppCMS", "Craft CMS", "Danneo CMS", "DataLife Engine", "DedeCMS", "Django CMS", "DNN", "Dotclear", "Drupal", "DTG", "Dynamicweb", "e107", "Eleanor CMS", "EPiServer", "eSyndiCat", "ExpressionEngine", "eZ Publish", "FlexCMP", "GetSimple CMS", "Google Sites", "Graffiti CMS", "Grav", "Green Valley CMS", "GX WebManager", "Hippo", "Hotaru CMS", "IBM WebSphere Portal", "ImpressCMS", "ImpressPages", "Indexhibit", "Indico", "InProces", "InstantCMS", "io4 CMS", "Jalios", "Jekyll", "Joomla", "Kentico CMS", "Koala Framework", "Koken", "Kolibri CMS", "Komodo CMS", "Koobi", "Kooboo CMS", "Kotisivukone", "LEPTON", "Liferay", "LightMon Engine", "Lithium", "LiveStreet CMS", "Locomotive", "M.R. Inc Wild CMS", "Mambo", "MaxSite CMS", "Methode", "Microsoft SharePoint", "MODx", "Moguta.CMS", "Mono.net", "Movable Type", "Mozard Suite", "Mura CMS", "Mynetcap", "Nepso", "October CMS", "Odoo", "OpenCms", "openEngine", "OpenNemas", "OpenText Web Solutions", "Ophal", "Orchard CMS", "Pagekit", "PANSITE", "papaya CMS", "PencilBlue", "Percussion", "PHP-Fusion", "phpCMS", "phpSQLiteCMS", "phpwind", "Pligg", "Plone", "Posterous", "Quick.CMS", "RBS Change", "RCMS", "RiteCMS", "Roadiz CMS", "S.Builder", "Sarka-SPIP", "SDL Tridion", "Serendipity", "Silva", "SilverStripe", "SIMsite", "Sitecore", "SiteEdit", "Sivuviidakko", "SmartSite", "sNews", "Solodev", "SPIP", "Squarespace", "Squiz Matrix", "Subrion", "swift.engine", "Textpattern CMS", "Thelia", "TiddlyWiki", "Tiki Wiki CMS Groupware", "Twilight CMS", "TYPO3 CMS", "TYPO3 Neos", "uCore", "Umbraco", "Unbounce", "Ushahidi", "viennaCMS", "Vignette", "VIVVO", "webEdition", "WebGUI", "WebPublisher", "Webs", "WebsiteBaker", "WebsPlanet", "Weebly", "Wix", "Wolf CMS", "WordPress", "XOOPS"
功能
- 基于特征识别 161 种 CMS
- 基于庞大且高质量的 Wappalyzer 特征库(共计 800 多种技术),识别所有流行的论坛、博客、CMS、留言板、维基和许多其他类型的引擎
- 可以选择特定类别或具体引擎进行识别
- 可以指定自定义 User-Agent
- 可以修改和补充特征库
- 可以使用自定义特征文件(custom-apps.json 文件的结构应与常规 apps.json 相同,并位于 files/Rank-CMS 路径下,如果操作正确,在 Check list 选项列表末尾将出现新的类别和应用供选择)
应用场景
- 按引擎过滤
- 按引擎对大型数据库进行排序
查询
查询时需要指定域名列表,例如:
http://a-parser.com/
http://techcrunch.com/
http://vkusnologia.ru/
http://blogautomobile.fr/
http://avto-blogger.ru/
http://www.cyberforum.ru/
结果输出示例
得益于内置的 Template Toolkit 模板引擎,A-Parser 支持灵活的结果格式化,这使其能够以任意形式以及结构化形式(如 CSV 或 JSON)输出结果。
默认输出
结果格式:
$query - $cms\n
结果示例:
http://blogautomobile.fr/- WordPress
http://a-parser.com/ - XenForo
http://vkusnologia.ru/ - WordPress
http://avto-blogger.ru/ - WordPress
http://techcrunch.com/ - WordPress
http://www.cyberforum.ru/ - 1C-Bitrix
以 SQL 格式保存
结果格式:
[% "INSERT INTO cms VALUES('" _ query _ "', '" _ cms _ "', '" _ cat _ "')\n" %]
结果示例:
INSERT INTO cms VALUES('http://yandex.ru', 'unknown', 'unknown')
INSERT INTO cms VALUES('http://vk.com', 'unknown', 'unknown')
INSERT INTO cms VALUES('http://facebook.com', 'unknown', 'unknown')
INSERT INTO cms VALUES('http://a-parser.com', 'WordPress', 'CMS')
INSERT INTO cms VALUES('http://youtube.com', 'unknown', 'unknown')
INSERT INTO cms VALUES('http://google.com', 'unknown', 'unknown')
将结果转储为 JSON
通用结果格式:
[% IF notFirst;
",\n";
ELSE;
notFirst = 1;
END;
obj = {};
obj.query = query;
obj.cms = p1.cms;
obj.cat = p1.cat;
obj.json %]
起始文本:
[
结束文本:
]
结果示例:
[
{"cat":"unknown","cms":"unknown","query":"http://google.com"},
{"cat":"unknown","cms":"unknown","query":"http://yandex.ru"},
{"cat":"unknown","cms":"unknown","query":"http://facebook.com"},
{"cat":"CMS","cms":"WordPress","query":"http://a-parser.com"},
{"cat":"unknown","cms":"unknown","query":"http://vk.com"},
{"cat":"unknown","cms":"unknown","query":"http://youtube.com"}
]
要在任务编辑器中使用“Prepend text”和“Append text”选项,需要激活“More options”。
可能的设置
| 参数 | 默认值 | 描述 |
|---|---|---|
| User agent | _自动插入当前版本 Chrome 的 user-agent_ | 允许伪装成特定的浏览器或搜索引擎 |
| Log long running regex | ☐ | 确定是否记录慢速正则表达式 |
| Check list | cms, message-boards, wikis | 选择要检查的引擎 |
| Emulate browser headers | ☑ | 能够模拟浏览器请求头 |
| RegExp engine | RE2 | 选择正则表达式引擎 |
| Use Net::HTTP | ☐ | 可以使用 Net::HTTP 爬虫工具进行请求 |
| Net::HTTP preset | default | 可以指定带有设置的预设 |
