Giulio recently graduated with a master’s degree in physics from the University of Rome, “La Sapienza.” His studies comprised mainly statistical mechanics and its applications in different fields, such as neural networks, disordered systems and biological systems.
His last project revolved around the statistical analysis of the bacterium E.coli’s central carbon metabolism.
His interests span from natural sciences to Karate-Do, Italian cantautori (singer-songwriters), science fiction and politics.
Project: Parsing data from vector plot images
My project consisted in a set of functions that can extract a plot dataset provided the vector representation of the plot.
The algorithm procedes gradually identifying all the graphic primitives present in the image and filtering out the different element: axes or the frame, plot ticks, tick labels, points in different datasets.
The last task is to identify the ticks the labels are referring to and use that value to rescale the data.
Here I present a plot taken from a preprint from arxiv.org, with multiple datasets, inset numbers, axes labels and different colours. The second image is the plot as reconstructed by the algorithm without any external input.
Summary of results and conclusions
I have been able to successfully identify axes, tick, label and data points for my self-generated plots. In this situation the error is negligible and depends mainly on the non-zero area geometric object used to identify the points in the set.
The results with real-world plots are not always perfect, however we can almost always recognize all of the points with little or no manual tuning of the heuristics.
Further tests are required to correct all the remaining error sources. The next step in the project would be the aggregation of a test set of real-world plots to benchmark the performance and to fine-tune the parameters.
The long term goal is to generalize the input to an arbitrary plot image (e.g. bitmaps, scanned files …).
Favorite Four-Color, Four State Turing Machine