SE::Yandex::Speller - 通过 Yandex.Speller 检查页面文本错误
爬虫工具概览

SE::Yandex::Speller – 通过 Yandex.Speller 服务查找指定页面俄语、乌克兰语或英语文本中的拼写错误。其语言模型包含数亿个单词和短语。A-Parser 的功能允许保存 SE::Yandex::Speller 爬虫工具的设置以供将来使用(预设),设置数据抓取计划等等。
由于内置了强大的 Template Toolkit 模板引擎,可以将结果保存为您需要的任何形式和结构,这允许对结果应用额外的逻辑并以各种格式输出数据,包括 JSON、SQL 和 CSV。
采集数据
- 发现错误的文本块
功能
- 确定包含错误的文本块数量
- 输出文本中可能存在的错误原因
应用场景
- 查找包含错误的文本块数量
- 检查网站页面文本是否存在拼写错误
- 检查网站页面的拼写
查询
爬虫工具可以接收关键词(文本行)或页面链接作为输入。查询类型会自动识别。
- 文本行形式的查询示例:
Yandex Speller 爬虫工具检查文本
带错误的查询
- 需要检查的网站页面地址形式的查询示例:
https://a-parser.com/
https://en.wikipedia.org/wiki/Parsing
结果输出示例
得益于内置的 Template Toolkit 模板引擎,A-Parser 支持灵活的结果格式化,这使其能够以任意形式以及结构化形式(如 CSV 或 JSON)输出结果。
默认输出
结果格式:
$query: $total\n$errors.format('$word ($suggest) - $type\n')
结果示例:
带错误的查询: 1
obshibkoy (oshibkoy,obshivkoy) - 词典中无此词。
Yandex Speller 爬虫工具检查文本: 0
https://a-parser.com/: 10
podskazkazok (podskazok) - 词典中无此词。
danykh (dannykh,danykh) - 词典中无此词。
MOZ (DMOZ) - 词典中无此词。
NodeJS (Node JS) - 词典中无此词。
Razrabatyvay (Razrabatyvayu) - 词典中无此词。
...
https://en.wikipedia.org/wiki/Parsing: 183
• العربية (• العربية) - 文本包含过多错误。
• বাংলা (• বাংলা) - 文本包含过多错误。
...
material (material) - 词典中无此词。
parsed (passed) - 词典中无此词。
they (that) - 词典中无此词。
...
以 SQL 格式保存
结果格式:
[% FOREACH errors;
"INSERT INTO errors VALUES('" _ word _ "', '" _ suggest _ "', '" _ type _ "')\n";
END %]
结果示例:
INSERT INTO errors VALUES('SaaS', 'Seas', '词典中无此词。')
INSERT INTO errors VALUES('企业与自由职业者', '', '词典中无此词。')
INSERT INTO errors VALUES('联盟营销人员', '联盟 营销人员', '词典中无此词。')
INSERT INTO errors VALUES('Youtube', 'YouTube', '大小写字母使用错误。')
INSERT INTO errors VALUES('电子邮件', '邮件', '词典中无此词。')
INSERT INTO errors VALUES('WordStat', '', '词典中无此词。')
INSERT INTO errors VALUES('外链建设', '', '词典中无此词。')
INSERT INTO errors VALUES('外推', '', '词典中无此词。')
INSERT INTO errors VALUES('Alexa', '', '词典中无此词。')
INSERT INTO errors VALUES('SEMRush', '', '词典中无此词。')
INSERT INTO errors VALUES('Ahrefs', 'Href', '词典中无此词。')
INSERT INTO errors VALUES('MajesticSEO', '', '词典中无此词。')
INSERT INTO errors VALUES('SerpStat', '', '词典中无此词。')
INSERT INTO errors VALUES('企业与自由职业者', '', '词典中无此词。')
INSERT INTO errors VALUES('SaaS', 'Saab,Seas,SAS', '词典中无此词。')
INSERT INTO errors VALUES('SaaS', 'Seas,SAS', '词典中无此词。')
INSERT INTO errors VALUES('NodeJS', 'Nodes', '词典中无此词。')
INSERT INTO errors VALUES('NodeJS', 'Nodes', '词典中无此词。')
INSERT INTO errors VALUES('async', 'sync', '词典中无此词。')
INSERT INTO errors VALUES('潜在客户生成', '潜在 客户生成', '词典中无此词。')
将结果转储为 JSON
通用结果格式:
[% IF notFirst;
",\n";
ELSE;
notFirst = 1;
END;
obj = {};
obj.errors = p1.errors;
obj.json %]
起始文本:
[
结束文本:
]
结果示例:
[{"errors": [{"word":"SaaS","suggest":"Seas","type":"词典中无此词。"},{"word":"企业与自由职业者","suggest":"","type":"词典中无此词。"},{"word":"联盟营销人员","suggest":"联盟 营销人员","type":"词典中无此词。"},{"word":"Youtube","suggest":"YouTube","type":"大小写字母使用错误。"},{"word":"电子邮件","suggest":"邮件","type":"词典中无此词。"},{"word":"WordStat","suggest":"","type":"词典中无此词。"},{"word":"外链建设","suggest":"","type":"词典中无此词。"},{"word":"外推","suggest":"","type":"词典中无此词。"},{"word":"Alexa","suggest":"","type":"词典中无此词。"},{"word":"SEMRush","suggest":"","type":"词典中无此词。"},{"word":"Ahrefs","suggest":"Href","type":"词典中无此词。"},{"word":"MajesticSEO","suggest":"","type":"词典中无此词。"},{"word":"SerpStat","suggest":"","type":"词典中无此词。"},{"word":"企业与自由职业者","suggest":"","type":"词典中无此词。"},{"word":"SaaS","suggest":"Saab,Seas,SAS","type":"词典中无此词。"},{"word":"SaaS","suggest":"Seas,SAS","type":"词典中无此词。"},{"word":"NodeJS","suggest":"Nodes","type":"词典中无此词。"},{"word":"A-Parser","suggest":"","type":"词典中无此词。"},{"word":"NodeJS","suggest":"Nodes","type":"词典中无此词。"},{"word":"async","suggest":"sync","type":"词典中无此词。"},{"word":"潜在客户生成","suggest":"潜在 客户生成","type":"词典中无此词。"},{"word":"抓取","suggest":"蒸发","type":"词典中无此词。"},{"word":"Instagram","suggest":"","type":"词典中无此词。"},{"word":"电商平台","suggest":"","type":"词典中无此词。"},{"word":"电商平台","suggest":"","type":"词典中无此词。"},{"word":"电商平台","suggest":"","type":"词典中无此词。"},{"word":"Instagram","suggest":"","type":"词典中无此词。"},{"word":"Bing","suggest":"","type":"词典中无此词。"},{"word":"新闻网站","suggest":"","type":"词典中无此词。"},{"word":"Redis","suggest":"","type":"词典中无此词。"},{"word":"数据抓取","suggest":"","type":"词典中无此词。"},{"word":"验证码","suggest":"","type":"词典中无此词。"},{"word":"XEvil","suggest":"Evil,Devil","type":"词典中无此词。"},{"word":"CapMonster","suggest":"Cap Monster","type":"词典中无此词。"},{"word":"Captcha","suggest":"","type":"词典中无此词。"},{"word":"RuCaptcha","suggest":"","type":"词典中无此词。"},{"word":"数据抓取","suggest":"争论","type":"词典中无此词。"},{"word":"数据抓取","suggest":"","type":"词典中无此词。"},{"word":"数据抓取","suggest":"请求","type":"词典中无此词。"},{"word":"简报","suggest":"","type":"词典中无此词。"},{"word":"工单","suggest":"","type":"词典中无此词。"},{"word":"A-Parser","suggest":"","type":"词典中无此词。"},{"word":"A-Parser","suggest":"","type":"词典中无此词。"},{"word":"工具","suggest":"节点,王牌,工具","type":"词典中无此词。"}]}]
可能的设置
| 参数 | 默认值 | 描述 |
|---|---|---|
| Languages | 英语, 俄语, 乌克兰语 | 检查语言 |
| Options | 跳过大写字母书写的单词,例如 "VPK"。, 跳过带数字的单词,例如 "avp17kh4534"。, 跳过网址、邮件地址和文件名。, 忽略罗马数字 ("I, II, III, ...")。 | 检查选项 |
| HTML::TextExtractor preset | default | HTML::TextExtractor 的预设。允许指定文本抓取设置 |
