Wolfram Computation Meets Knowledge

Wolfram Summer School


Margarita Zeitlin

Summer School

Class of 2008


Margarita is a current linguistics and German student at New York University, a longtime Beatles fanatic, and the owner of one 6-pound copy of A New Kind of Science, which inspired her to attend the NKS Summer School 2008. Fascinated by the realization that simple rules can create incredibly complex behavior, she set out to join the world of Wolfram in an attempt to apply NKS to the field of linguistics. She enjoys spending her time in the Documentation Center, sorting through her errors, and, once in a while, writing a successful piece of code.

Project: Languages, Letters, and “Loids”

Linguists estimate that there are 5,000-6,000 languages spoken in the world today, most of which belong to one of ten general language families. The Indo-European language family is home to English, German, Spanish, French, and Italian, and these languages act as the focus of this project. The goal is to visually represent the frequencies of letters in all positions of 2- to 10-letter words in these five languages; by breaking up corpora of texts into characters, it is possible to isolate those characters that appear most often in a language. Although the letters of a language do not enumerate all its sounds, this orthography permits a visual evaluation and comparison of several languages. Based on this analysis, the project was extended to create a random word generator for the five tongues based on the probabilities given by the corpora. The final goal was to see whether any of the randomly generated English words could pass for trademarks or perhaps some new slang.

This project began as an attempt to look at five languages and discover something about them only by studying their words. This goal didn’t pan out, but instead reinforced what was already known about these languages. It is nevertheless satisfying to come to the same conclusion after three weeks that linguists have reached after decades of research. Besides, this method of research is unusual for a linguist, so it is remarkable to have reinforced the knowledge of the field from such an unorthodox perspective.

The random word generation was definitely the most amusing aspect of this project. The filter is not the finest; looking at the blocks of generated text, it is evident that most of the text does not resemble any language. Some of the words are real words in their language, some can pass for real words, and others are just gibberish. The generator works relatively well considering that roughly 10% of 1,000 words were actual English words, versus roughly .4% when the words were generated without the word frequency data. The German text came out the worst, while the Italian text came out the best, due to its unusually high frequency of vowels in all positions of a word. However, the project doesn’t end here. There is work to be done on the filter, and it will be possible to create a better generator by limiting where certain letters can appear (i.e. in English, “x” cannot appear in the first position), creating groups of letters that often occur together in a language (i.e. German’s “sch”), and extending the letters to include all phonemes, or sounds, of a language.

Project-Related Demonstrations

Random Word Generator

View demonstration of Wolfram Demonstrations Project

Favorite Radius 3/2 Rule

Rule 1442

This cellular automaton is definitely one of my favorites. It reminds me of a Hokusai painting, a beautiful mountain covered in snow. There is some similarity to ECA 30 in its central complexity, but it is a much more eccentric CA, having jagged edges and various patterns of tracks throughout it.

Out of the thousands of possibilities, CA 1442 certainly has some very interesting characteristics. I would be curious to find out what kind of relation it has to ECA 30, as well as whether it is found in nature. Besides, 1442 was probably a good year.