< Parsoid
The dumpgrepper utility is useful to search XML dumps for specific regexp patterns. With a simple regexp, an enwiki dump can be grepped in ~20 minutes.
The grepper operates on actual wikitext (with XML encoding removed), so there is no need to complicate regexps with entities. It supports JavaScript RegExps.
Installation
npm install -g dumpgrepper
Usage
bzcat /path/to/enwiki-latest-pages-articles.xml.bz2 | dumpgrepper '\| *link *='
See also
- New 'insource' regexp search on wikitext of WMF wikis: Example query, Bug.
- User:cscott made a hacked variant that lets you chain conditions, so you can say "pages with this but not that (optionally, on the same line)". See https://github.com/cscott/dumpgrepper. This was just a one-off for a particular wikitext migration; if it is more generally useful it could be cleaned up and merged.
This article is issued from Mediawiki. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.