< Architecture Repository < Artifacts

Wikimedia logo Wikimedia Architecture Repository
Home | Artifacts | Process | Patterns

Knowledge store

Free knowledge data model based on schema.org

Last updated: 2022-12-16 by APaskulin (WMF)
Status: v1 published May 2021‎ to inform the creation of the schema for Wikimedia Enterprise. For the current Wikimedia Enterprise schema, see the data dictionary on enterprise.wikimedia.com.

The purpose of this document is to define a predictable structure for distributing Wikimedia content. To do this, we’ve chosen to use standard types and properties from schema.org. This model is not meant to replace existing data structures within MediaWiki; instead, these structures can act as part of a distribution layer that consumes, structures, and serves knowledge beyond Wikimedia.

Note on schema.org: schema.org is designed to provide structured metadata for web content. We’ve taken this idea a step further by using schema.org’s shared vocabulary to structure the content itself. This allows us to use the same patterns as schema.org even though we’re not using traditional Microdata, RDFa, or JSON-LD formats.

Using this model

We encourage Wikimedia projects to make use of this model, either as a whole or as a base to build on. Services currently using this model include Phoenix (structured content proof of value) and Wikimedia Enterprise.

Adding a property

As defined here, the model is restricted to properties that are meaningful outside the context of MediaWiki. To suggest a new property, leave a comment on the talk page. New properties should conform with the applicable schema.org type whenever possible.

Feedback and questions

To share feedback and question, leave a comment on the talk page. Note that there are often several unknowns associated with each type; these unknowns are tracked in the notes and questions subsections.

Patterns

Canonical data modeling

Allows content to be understood by people, programs, and machines outside the boundaries of the system

Capabilities

Serve and distribute

Distribute predictably-structured knowledge to products and platforms

Language

a human language
Based on schema.org Language

Example
{
  "name": "English",
  "identifier": "en",
  "direction": "ltr"
}
PropertyTypeDescription
name TextLanguage name in that language
identifierTextLanguage code as used by Wikimedia (ISO 639 with exceptions[1])
direction (not on schema.org)Textright-to-left (rtl) or left-to-right (ltr)
variant (not on schema.org)TextLanguage variant[2] (if applicable)

Notes and questions

Project

a wiki in a single language
Based on schema.org CreativeWork (not on schema.org Project)

Example
{
  "name": "Wikipedia",
  "identifier": "en.wikipedia.org",
  "in_language": {
    "identifier": "en"
  },
  "url": "https://en.wikipedia.org",
  "size": {
    "value": 70934,
    "unit_text": "MB"
  }
}
PropertyTypeDescription
name TextUnabbreviated project name in the language specified by inLanguage (Example: Wikipedia, Wikisłownik, etc.)
identifierTextProject domain (Example: en.wikipedia.org)
in_languageLanguageHuman language the project is written in
urlTextURL for the project entry point (not directly to the main page)
sizeQuantitativeValueProject size when downloaded as a whole (compressed)

Notes and questions

  • How should we handle inLanguage for multi-lingual projects? (Commons, Wikispecies, Wikidata, etc.)

Page

a wiki page
Based on schema.org Article

Example
{
  "name": "Pinnation",
  "identifier": 339742,
  "url": "https://en.wikipedia.org/wiki/Pinnation",
  "in_language": {
    "identifier": "en"
  },
  "is_part_of": [
    {
      "identifier": "en.wikipedia.org"
    }
  ],
  "version": 975098740,
  "date_modified": "2020-08-26T18:48:58Z",
  "license": [
    {
      "identifier": "CC-BY-SA-3.0",
      "name": "Creative Commons Attribution Share Alike 3.0 Unported",
      "url": "https://creativecommons.org/licenses/by-sa/3.0/"
    }
  ],
  "main_entity": {
    "identifier": "Q3756157"
  },
  "keywords": "Plant morphology, Leaves",
  "has_part": [
    {
      "identifier": "/node/ff569ed4759dbfc"
    }
  ]
}
PropertyTypeDescription
name TextPage title in reading-friendly format (spaces instead of underscores)
identifierIntegerPage ID (MediaWiki page ID)
urlTextComplete URL for the page
in_languageLanguageHuman language the page is written in
is_part_ofarray of ProjectWiki the page belongs to
versionIntegerRevision ID (MediaWiki revision ID)
date_modifiedTextTimestamp of latest revision in ISO 8601 format (DateTime)
licensearray of LicenseContent license
main_entityEntityPrimary subject of the page (Wikidata ID)
keywordsTextComma separated list of categories the page belongs to
has_partarray of SectionPage sections

Notes and questions

  • Consider using display title for name instead of reading-friendly title
  • How should we handle media files associated with a page? Schema.org has audio, video, thumbnailURL, and primaryImageOfPage (MediaObject). Note that using primaryImageOfPage would be from WebPage type.
    • How to handle licenses for images embedded in a page? (Check with legal)
  • Should we include other URLs (mobile, edit, talk, etc.)? Schema.org has discussionUrl but no others.
  • We’ve intentionally not included content at the page level in favor of providing content at the section level.
  • Is it a problem that isPartOf would be inconsistent between objects?
  • Properties to consider:
    • about - Rosette or other set of page subjects (Wikidata items)
    • interactionStatistic seems like the most logical place for pageviews, number of edits, etc. What types of stats should we include? (array of InteractionCounter)
    • mentions - array of Thing, links included within the page
    • abstract: Is there a way we could get the first two sentences of the article?
    • citation (References used on the page)
    • schemaVersion (https://schema.org/docs/releases.html#v12.0) seems like a good idea, but I’m struggling to see the value. These releases seem to come out every few months.
    • page quality score (aggregateRating?)
    • copyrightHolder -  “The text of Wikipedia is copyrighted (automatically, under the Berne Convention) by Wikipedia editors and contributors and is formally licensed to the public under one or several liberal licenses.”[1] (Covered by license?)
    • dateCreated (page’s initial publication date)
    • creativeWorkStatus
    • creditText (attribution text)

Section

content grouped under a heading or as an introduction before the first heading on a page
Based on schema.org CreativeWork

Example
{
  "name": "Orbit and turning",
  "identifier": "/node/ff569ed4759dbfc",
  "version": 975098740,
  "is_part_of": [
    {
      "identifier": 339742
    }
  ],
  "text": "...html...",
  "encoding_format": "text/html",
  "license": [
    {
      "identifier": "CC-BY-SA-3.0",
      "name": "Creative Commons Attribution Share Alike 3.0 Unported",
      "url": "https://creativecommons.org/licenses/by-sa/3.0/"
    }
  ]
}
PropertyTypeDescription
name TextSection heading
identifierTextKnowledge store ID
versionIntegerMediaWiki revision ID
is_part_ofarray of PagePage the section belongs to
textTextSection content in HTML
encoding_formatMIME type"text/html"
licensearray of LicenseContent license

Notes and questions

  • Properties to consider:
    • dateModified
    • about - Rosette or other set of page subjects (Wikidata items)

License

content license
Based on schema.org CreativeWork

Example
{
  "identifier": "CC-BY-SA-3.0",
  "name": "Creative Commons Attribution Share Alike 3.0 Unported",
  "url": "https://creativecommons.org/licenses/by-sa/3.0/"
}
PropertyTypeDescription
name TextLicense name
identifierTextLicense ID from spdx.org
urlTextURL for the license text

Notes and questions

Entity

a subject of a page
Based on schema.org Thing

Example
{
  "identifier": "Q3756157"
}
PropertyTypeDescription
identifier TextWikidata ID

Notes and questions

  • Connection with Wikidata
This article is issued from Mediawiki. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.