9

I am looking for a comprehensive list of words/concepts that are represented in most if not all known languages - presumably the category would include human body parts (hand, foot, mouth, eye), things present in the environment (sun, sky, cloud, dirt, water), and body motions (walk, jump, fall) amongst other categories. Has such a list already been compiled or does anyone have suggestions on resources I could use to create it programmatically?

My intent is to explicitly encode them into the model architecture of a Large Language Model before training the network weights in the hope of producing a more optimized and/or interpretable LLM.

norlesh
  • 221
  • 2
  • 5
  • 5
    Check out swadesh lists. – Graham H. Sep 10 '23 at 01:16
  • 1
    Thank you! I recognize the term now that you have reminded me - tried asking ChatGPT, Bard and Bing and none of them could make the connection from my description, all saying that no such list existed. – norlesh Sep 10 '23 at 01:22
  • 1
    Just so you know, explicitly encoding semantic relations into models went out of fashion years ago, when people discovered that just having large layers (and more of them) always won over explicitly encoding human knowledge. (Google "the bitter lesson") – jick Sep 10 '23 at 02:27
  • 2
    Thanks jick, but I'm thinking its time for a resurgence now that any further scaling is predicated on millions of dollars of compute.. I have a hard time believing that the semantic relations encoded implicitly across billions of parameters symmetrically couldn't be optimized with some 'intelligent design' now we have the current crop of LLMs to use as lab rats. – norlesh Sep 10 '23 at 02:44
  • 2
    @jick it depends heavily what you mean by "win" here. If the goal is to emulate a human using language, and doing tasks like translation, then you can make that argument. But if the goal is to understand how language works, then a deep learning model is almost completely useless. – leftaroundabout Sep 12 '23 at 14:56
  • @leftaroundabout Well, I can't disagree with that, but if your goal is to understand how language works, why would you first distill your language understanding (which is the product of your goal) and feed it to a language model (which is bad at helping us understand it)? Sounds like putting a cart in front of a horse. – jick Sep 12 '23 at 19:40
  • @jick there is a nascent field sprouting from machine learning called Mechanistic Interpretability (early days yet) that is starting to produce a body of research on topics related to extracting, editing, and auditing the implicit information distilled inside large language models. – norlesh Sep 18 '23 at 00:46

2 Answers2

19

The Natural Semantic Metalanguage is a project that aims to identify the universal building blocks of human language, or "semantic primes". After four decades of empirical research they have identified about 65 semantic primes - concepts which are present in every language, and which cannot be broken down themselves into more basic meanings. (Though in some languages they may exist as affixes or phrases instead of single words.)

curiousdannii
  • 6,193
  • 5
  • 26
  • 48
  • It took four decades to come up with only 65, and they still got it wrong. English "do" and "make" in most contexts use the same verb in Spanish. Yet "do" is one of the so-called primes. – WGroleau Sep 12 '23 at 19:17
  • 1
    @WGroleau Yes, there can be overlaps in the surface forms in a language. This happens in the English version of NSM too, where there are 3 primes that use BE. Polysemy is common so this isn't unexpected. – curiousdannii Sep 12 '23 at 20:53
  • Point is that "hacer" does not "have the same translation in every language" – WGroleau Sep 12 '23 at 22:55
  • 1
    @WGroleau Yes, that's acknowledged by the NSM project. The concepts are what are claimed to be present in each language, not that their words can't have other meanings too. – curiousdannii Sep 13 '23 at 03:55
  • Well, I suppose I'd have to study what they're doing further. The Wikipedia page is what stated that the "primes" "have the same translation in every language" – WGroleau Sep 13 '23 at 05:13
17

No, there may not be any universal meanings. Here is an example. In most (maybe all) Bantu languages, there is no word for "hand" and no word for "arm", because there is a word meaning "hand and arm". The English word "dirt" refers to the stuff that you sweep up in your house, and the stuff you plant crops in: in Bantu languages, these are different words. Two words in different languages would have the same meaning if they referred to the same things, which is not the case. "Jump" in English covers jumping up (not down) and crossing (a stream, road) in Logoori. There are many kinds of "fall" in Logoori (catastrophic falling out of a tree vs. stumbling, etc).

The most likely candidates will be natural objects that are trivially distinguishable from other similar objects. Because bodies are continuous structures, body part divisions are somewhat arbitrary (as in the case of "arm" and "hand", also "leg" and "foot").

Swadesh lists are the best approximation that you will find, and they are usually clearly wrong on a number of cases for any language. You will have to give up on the notion of "all languages", but as a first step you could try to get the equivalents of a Swadesh list or the Leipzig list in some language, and perhaps try to add weights to indicate how close the other language word is to the English word (for example "water" as a noun, and only in the H20 sense, is pretty stable as to referent). Some of the words are ludicrous as candidates for universals, for example "if", which basically doesn't exist in Bantu.

user6726
  • 83,066
  • 4
  • 63
  • 181
  • Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on [meta], or in [chat]. Comments continuing discussion may be removed. – curiousdannii Sep 12 '23 at 11:51