Molly Des Jardin, University of Pennsylvania
Molly is the Japanese and Korean studies librarian at the University of Pennsylvania, beginning in July 2013. She manages the humanities collection related to Japan and Korea, and all materials in Japanese and Korean languages. Molly also conducts reference interviews and instruction sessions related to library use and resources, and resources related to Japan and Korea.
Brian Vivier, University of Pennsylvania
Brian leads area studies collections at the Penn Libraries, coordinating collection development in non-Western languages and overseeing the Libraries’ separate area studies departments. He also serves as the Chinese Studies Librarian, managing Penn’s collections related to Chinese studies, both in Chinese and in European languages, and is an adjunct associate professor of Chinese studies. His research focuses on the economic and social history of medieval China. Before coming to Penn, Brian was the Coordinator of Public and Information Services for the Asia Library at the University of Michigan and Special Projects Manager for Yale’s East Asia Library. Brian holds a BA from Indiana University of Pennsylvania, a PhD in Chinese history from Yale University, and a library degree from Southern Connecticut State University.
YCW: You are both currently working on a new corpus project for Chinese and Japanese languages. It’s a massive project that is so far in its early stages. You’re aiming to create a database of newspapers and magazines from the early 20th century that are digitally accessible and to fundamentally improve optical character recognition (OCR) for Chinese character texts. Where did the impetus for this project come from? And how do you see it fitting in with similar projects?
Molly Des Jardin and Brian Vivier (MDJ and BV): We both work in a university environment, and we know that there is general discontent among scholars who lament that they are not able to use OCR for their historical documents. While there are other databases of East Asian texts made up of periodical scans, for example of magazines and newspapers, almost none of them have text that can be used and manipulated. OCR allows text to be searched and copied, unlike when only a “page image” (photograph or scan of the page) is available. For historians, media or literary scholars, genealogy and family history researchers, and other interested individuals, this opens up a whole range of new possibilities, whether one is conducting research on a particular topic or still in the exploration stage.
We are also looking to close a gap in coverage. Donald Sturgeon at Harvard University has used materials at the Harvard-Yenching Library to build up a corpus of classical Chinese from scans of premodern documents. Projects like the LIVAC Synchronous Corpus cover more contemporary material. There is still relatively little data from the earlier part of the 20th century.
YCW: What types of new research or projects do you see coming out of this corpus?
MDJ and BV: Our goal is to create a massive, machine-readable corpus of periodical literature of the Japanese empire and the early 20th century. During this period, there was an extraordinary range and number of new ideas translated and exchanged from other parts of the world to East Asia. The Imperial Japanese information sphere is huge, and though humans cannot navigate it in its entirety, machines can. There is no corpus of this scope that looks at periodical literature in all Sinograph (Chinese character)-using portions of the empire—China, Taiwan, Korea, Manchuria, and Japan—between 1895 and 1945. This ambition goes well beyond what anyone else has attempted.
One of the exciting things about this time period and medium is that the content of these texts are definitely influenced, but not completely driven by, the political situation. To this day, many studies of not only the Japanese empire, but also other empires, deal with only the interaction between the metropole and a single colony. One person simply isn’t able to study how all of the colonies communicated with the metropole, and also how all the colonies communicated between themselves. A machine can draw new or perhaps otherwise uncovered connections using textual material and help us find where global geopolitical trends intersect.
YCW: What makes optical character recognition (OCR) technology for Chinese characters, or “Sinographs,” still inadequate?
MDJ and BV: So far, people have not been very effective in programming computers to recognize the range of difference in the ways one renders a character. Everyone who has learned to write Chinese characters has encountered the question: “How far can you go before a character does not resemble itself anymore?” When we work with a messy copy of a newspaper or a film copy that isn’t that good, it might be difficult for even a human to read—these documents are old! Old newspapers use different forms of Chinese characters that we use now, especially in the case of Korean and Japanese texts that use more traditional forms. There is also the problem of layout with newspapers and magazines—there are articles and illustrations and advertisements. Before a computer can process text, it has to identify what makes up the text in the first place. Trying to teach a computer these variations has been a challenge. The work has, however, been getting better for modern Chinese and even Adobe Acrobat can do a good job. It is just much harder for anything published before 1950.
Part of the reason for the slow progress is related to Thomas Mullaney’s findings on why Remington and other companies were not willing to work on Chinese typewriters in the 19th century—where is the money? Creating OCR that can handle magazines and newspapers from the Japanese imperial period is an expensive challenge—who is going to use it? It’s going to take an academic research team to deal with the problem because a company like Google probably doesn’t consider it a priority.
The computing world was created with the English language in mind, but recently there has been notable innovation coming from those who have been forced to create new solutions. For example, predictive typing is something that came from China and Japan, not the West. Chinese tech companies could potentially make huge strides to develop this technology, but we have not yet seen a heavy focus on historical documents, or indeed the wider Sinograph-sphere. Historical documents for them are much lower priority and they are really interested in contemporary materials more than anything else.
YCW: What kinds of experts need to be on the team to get everything done?
MDJ and BV: We of course need programmers, especially experts in computer vision—the field that yields OCR software that enables text mining and analysis. We’re interested in working with linguists, as we would like to do cross-language analysis, which is still in its infancy for East Asian languages. We plan to employ historians and literature scholars, as well. Currently, we have received encouragement from granting entities such as universities and other foundations to push forward.
At these early stages, we are focused on recruiting experts in data structure and machine learning. We are first creating a smaller cross-section of what we hope the larger project will be—several issues of a periodical, perhaps. We will then recruit machine-learning experts with interest in OCR and page segmentation, and set them loose on this sample corpus to see if we can get close to solving the problem. This could be in the form of a challenge or competition for those experts or those who would like to hone their skills. Basically, we are presenting them with a problem and a set of data, and asking them to solve it!
YCW: One last question about additional challenges—you’ve spoken about potential copyright issues when working with these periodical materials. Are there other concerns with doing this type of work in today’s media and political environment?
MDJ and BV: One of our plans is to identify partners in China, Japan, and Korea. They would hold the rights to some publications and could, perhaps, help us negotiate with others. There are lots of scholars who want to use this information, and there are institutions that want more people to access their information. The Shanghai Library, for instance, has a large digitized collection and we would love to partner with them to make the text more accessible. We are looking for mutually beneficial ways to work with people and groups that own rights.
There are no legal issues with us generating this data for ourselves, but we need to explore the implications of redistribution. In the meantime, we’re identifying funding sources, and the small “proof of concept” cross section we mentioned earlier will be crucial for this process. Then, if we are able to make the data more generally available, we would have many different eyes from many different disciplines. We don’t know where it could go, and that’s a good thing!
— Interview by Sarah Yu