Nae Eoun is a graduate student in bioengineering at the Massachusetts Institute of Technology. With her interest in the intersection between biological science and engineering, she quit medical school and moved from Korea to the United States to explore her dream. She received her bachelor’s and master’s degrees in biomedical engineering at Columbia, and now she is continuing her studies at MIT. Converging her interest in biology and natural skill in computer science, her research focus is on computational biology. At MIT, she received an award for designing a lumboperitoneal shunt anchor and computationally analyzing its performance. Outside the classroom, she worked for the Project Lab in Protein Biochemistry at Columbia, Sanofi Genzyme, Hanwha Chemical Bio R&D Center and Seoul National University Hospital, where she solidified her practical skills in hands-on research and research methods.
Project: mBot for Real-Time Data Processing
Goal of the project:
Patent claims within a patent application or patent define the scope of the invention. In order to enforce the full scope of a patent, it is imperative for inventors, firms and companies to make sure that the patent’s scope is not covered by prior publications and competitors’ patents (referred to as “prior art”) available in the public domain. For this reason, most companies with a patent portfolio spend a substantial amount of resources in searching for potentially troubling prior art. This project is a first step to optimize this process and offers an analytic tool to better categorize and analyze the relationship between patents by clustering them based on claim term frequency.
Summary of work:
For the US patent grant full-text files from USPTO, the patent information, including patent number, invention title, number of claims and claim text, was extracted using the headings of XML files, and WordCounts was executed on the invention title and claim text cleaned to obtain the three most frequently used words. Embeddings using NetModel called GloVe were used to turn those words into high-dimensional vector representations and group them together with semantically similar data in a vector space. Non-hierarchical clusterings using ClusterClassify were performed with various methods, including Agglomerate, DBSCAN and GaussianMixture, and with a specified number of clusters for KMean. It also demonstrated the prominent trends in patent grants in time series by matching the USPTO’s classification system with NBER’s classification categories and analyzing the counts of patents in each category per year. In addition, for analysis of each patent claim, a scope checker was created to detect the most relevant and detailed information in a patent’s description for a claim by constructing a distance matrix between the claim sentence and description sentences.
Results and future work:
The non-hierarchal clustering using 100-dimensional vector representations of each of three words created using GloVe produced very small numbers of clusters that were trivial while the k-means clustering with a specified number of clusters could group the patents semantically pretty well. Then, classification using USPTO’s classification codes demonstrated that patent grants have been continuously increasing, and, especially, the number of patents related to the computer and software fields have been greatly increasing. Lastly, the scope check successfully produced the relevant details of a given patent claim. The patent claim analytics need to be further improved by taking more frequently used words, taking phrases instead of words or adding more neural network layers that can add distinguishable features. The clustering also can be improved by methods with an optimized number of clusters that can minimize the dissimilarity within each cluster while maximizing it between different clusters or by taking account of the classification system together.