Alumni

Justin Shin

Science and Technology

Class of 2016

Bio

I study mathematics and philosophy at Bard College. I like investigating models of rational action, and so I enjoy rational choice theory, particularly game theory. When I am grappling with a problem, I always try and draw a picture of the problem to help me think. I enjoy games of all kinds, although I have a soft spot for card games and negotiation games. My other interests include graph theory, symbolic logic, philosophy of mind and philosophy of technology. The questions that I am thinking about now are “Can I always divide a valuable resource among a group of people so that each person thinks that their allocation is at least as valuable as everyone else’s allocation?” and “How should we classify and evaluate social credit systems, and do they come with any new and interesting ethical considerations?”

Project: Computational Linguistics in Mathematica: Keywords, Jargon and Fluff

The United States Code is a compilation of the general and permanent federal laws of the United States. The corpus is interesting as a well-structured, socially relevant technical text. We hope to investigate the text by importing the document into Mathematica and writing general functions that can be used by future projects in computational linguistics using the Wolfram Language.

Our first exploration into this corpus is to import the XML file and see if we can generate some word clouds from the different sections. This is mostly just to get used to moving around the XML representation in the Wolfram Language and get some simple visuals for the contents of the Code. I provide two below. The first word cloud is from the section on conservation, and the second is from the section on the general structure of the government.

Next, I need some way to pull keywords and key phrases out of the document. There is not yet a function in Mathematica that takes a document and finds keywords, so we should build one. The typical method of finding keywords relies on comparing the text to be examined and a large body of sample text to compare the word frequency in each. This method works well when large bodies of typical text in the genre of the text to be examined are available, but does not work well at all for technical documents like the United States Code. The immediate difficulty is that legal documents tend to use both legal jargon and the jargon of whatever field the legal text is attempting to legislate, and are usually very focused on a particular class of laws. As a result, taking large samples of legal text is not likely to provide a reasonable basis for the typical frequency analysis. So if we are to produce a keyword function that we can apply to the United States Code, we must find a good way to identify keywords without large sample texts.

A natural extension of keywords is key phrases. The method of finding key phrases is similar to the method of finding keywords, but with the added complication of identifying noun phrases, verb phrases and clauses. Luckily, Mathematica’s TextStructure function can identify these language objects for us. Once we have a keyword function and a key phrase function that works reasonably well without sample texts, we can identify things like “President” and “Secretary of the Interior” as key terms.

After writing these functions to identify key terms, we have two paths we can explore. The first is the open problem of co-reference identification, which asks how we can tell if two noun phrases have the same real-world referent. For instance, it is very clear to us that in the sentence “Alice and Bob asked Carl, ‘Is Walmart open?’, and Carl replied ‘Yes, they’re open.'” that the word “they” refers to the same thing as “Walmart,” but it can be difficult for a program to make that connection. An inaccurate co-reference function might identify the referent of “they” as “Alice and Bob.” If we can write a reasonable co-reference function, we can take our corpus and ask it to collect all sentences that share a specific referent, like “The Secretary of the Interior,” effectively grabbing all of the responsibilities of the Secretary of the Interior out of the United States Code. Other methods, such as simply searching for all instances of “The Secretary of the Interior,” would miss out on all the sentences that just use pronouns or descriptors, or return larger-than-needed swathes of text, which must then be searched by hand. With a co-reference function and a key term function, we can write a concordance generator that works with any corpus, from poetry collections to religious texts.

Our other path is to gather the historical archives of the United States Code and see if we can run an interesting analysis on how this corpus has changed over time. A general function that might be useful is a PredictWordCloud function, which when passed a series of word clouds with timestamps spits out a prediction of what a word cloud will look like in the future. In this corpus, we may hope to use this PredictWordCloud function to predict which legal topics are on the rise and which are on the fall. Outside of our corpus, we can also use this to see what words are falling out of usage and what new words are catching on in popularity. If we combine this PredictWordCloud function with a function that examines and classifies the grammatical structures of a document, we can go a step further and build a function that attempts to write in the style of the corpus in the future.

Overall, I hope to expand Mathematica’s support of computational linguistics and generate some useful visualizations and alternate forms of the United States Code.

References

[1] Office of the Law Revision Code, “United States Code.” (Sep 15, 2016) uscode.house.gov/browse.xhtml.

[2] Wolfram Research, Inc., “Wolfram Language & System Documentation Center.” (Sep 15, 2106) reference.wolfram.com/language.

Favorite 3-Color 2D Totalistic Cellular Automaton

Rule 312916548

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

Alumni

Justin Shin

Bio

Project: Computational Linguistics in Mathematica: Keywords, Jargon and Fluff

References

Favorite 3-Color 2D Totalistic Cellular Automaton