The sharing of knowledge acquired by deep learning models has been enabled primarily by sharing architectures, including foundation models and transfer learning techniques. However, the sharing of learned information without transplanting network connection weights has been more limited, and often relies on human-interpretable communication and assessment of learned representations. The goal of this REU is to push participating students to use deep networks to produce rich data representations and transformations that incorporate acquired knowledge about the concept or term represented by the embedding. Students will internalize the ability to create, share, and apply vector representations that they may not be able to readily interpret themselves. This is a complementary approach to direct neural network weight parameter sharing, with this proposed approach being more accessible across application domains to flexibly combine knowledge acquired from a variety of networks and data sets. Some sample research projects are listed below:

P1: Location-based universal embeddings (Dr. Ting Xiao)

Customarily, the inference of company characteristics and future outcomes from limited data replied on comparisons and predictive models for specific industries. However, identifying the specific industry and related companies for large, multinational corporations or even small, agile startups can be challenging. Also, relying on industry-specific models limits valuable inferences across industries due to characteristics such as geographic location or regulatory environment. With a location embedding representing each geographic location, information on related business attributes can be readily leveraged to improve prediction, regardless of industry. Using Federal Information Processing System (FIPS) codes for states and counties in the United States, students can create an embedding of FIPS codes using a combination of state-level and county-level codes over a 2-year span.

P2: Text classification and sentiment analysis of legal documents (Dr. Junhua Ding)

Legal artificial intelligence (LegalAI) has developed rapidly with the practice of Natural Language Processing (NLP) and machine learning in the legal domain. As the core task of LegalAI, Legal Argument Mining aims to automatically extract units of argument or reasoning from legal documents including court judgment and evidence, with the goal of providing structured data for computational models of argument and for reasoning engines. It includes three steps: 1. court judgment and evidence classification; 2. Legal argument unit exemption; and 3. Legal argument structure detection. The focus of this research is on the sentiment analysis of legal evidence and classification of court judgment. In this project, students will apply embedding techniques in the field of text classification and sentiment analysis. They will collect legal cases from websites and learn how to label documents using semi-supervised learning tools with the Expectation-Maximization (EM) algorithm, EM with BERT, transfer learning, and GAN. Students will also design and experiment machine learning models for legal sentiment analysis using the labeled data.

P3: Personalized gesture recognition for communication via transfer learning (Dr. Mark V. Albert)

This project is one piece of a broader research effort of Dr. Albert’s lab to further develop a fast and flexible wearable gesture recognition system in order to establish a new paradigm for non-verbal communication technology. Complex gesture recognition systems exist for shared, standardized movements such as sign language, however, many individuals with Cerebral Palsy and other motor impairments lack both the ability to speak and perform fine motor movements consistently. Contemporary strategies to create tailored gesture recognition for these subjects include few-shot training of subject-specific gestures for individuals lacking fine motor skills, however, Dr. Albert’s lab is exploring the use of embedding strategies to both speed processing and enhance generalization in similar transfer learning contexts. Critically, embeddings and a developed hierarchical clustering approach applied to a large dataset of acquired gesture data will enable nearly real-time estimates of recognition efficacy, which is critical for real-time exploration of gesture vocabularies in practice. This particular research effort will rely on previously acquired and publicly available gesture recognition data sets to readily engage students and share strategies freely. Though not a part of this specific research effort, the results will later be adapted and applied to clinical data sets for proper testing and validation through Dr. Albert’s funded clinical collaborations. By leveraging deep automated feature learning through self-supervised autoencoder embeddings on a large corpus of acquired movements, fast and flexible gesture-based communication can be achieved at the speed amenable to social contexts.

P4: Development and application of Stock2Vec (Dr. Zinat Alam)

Traditionally, building predictive models for firms often relies on inference using historical data on firms in the same industry sector. However, firms can be similar across a variety of dimensions that should be incorporated and leveraged in relevant prediction problems. This is particularly true for large, complex organizations which may not be well defined by a single industry and have no clear peers. To enable prediction using company information across a variety of dimensions, we propose to create an embedding of company stocks, Stock2Vec. Stock2Vec can be easily added to any prediction model for public companies with stock price information. Students will learn how to create a rich vector representation from stock price fluctuations and characterize what these dimensions represent. Next, they will conduct comprehensive experiments to evaluate this embedding in applied machine learning problems in various business contexts. 

P5: Natural language processing applied to law and justice (Dr. Haihua Chen)

Over the last few decades, legal artificial intelligence (LegalAI) has developed rapidly with the practice of natural language processing (NLP) and machine learning in the legal domain. LegalAI tasks mainly include legal argument mining, judgment prediction, court view generation, legal entity recognition, legal question answering, and legal summarization. Word embeddings and language models, the most potent tools for natural language processing, have also been applied to LegalAI. However, the application of vector embeddings in LegalAI is still in the early stage, and these vector embeddings in existing literature did not show similar effectiveness as in other domains (such as medical and social media). One of the reasons is that legal language is more complicated (languages, long texts, discourses, terminologies, and others). Therefore, different vector embeddings need to be generated in terms of different languages and different tasks. Meanwhile, legal knowledge can also be injected into the vector embeddings to improve the semantic representation. In this project, we will train students to produce tailed or joint vector embeddings for various legal NLP tasks, thereby accelerating LegalAI research and applications.

P6: Estimating phonemes in historical languages using autoencoders (Dr. Frederik Hartmann)

Traditional historical linguistics lacks the possibility to empirically assess its assumptions regarding the phonetic systems of past languages and language stages since most current methods rely on comparative tools to gain insights into phonetic features of sounds in proto- or ancestor languages. Dr. Hartmann’s previous work has applied deep neural networks to predict phonetic features of historical sounds. This is particularly challenging as reconstructed historical phonetic features are highly interdependent in subtle combinations, requiring reconstructions to both incorporate known or established phonetics and joint estimation of various unknown, interdependent features. The method utilizes the principles of coarticulation, local predictability and statistical phonological constraints to predict phonetic features by the features of their immediate phonetic environment. Previous work validated this method using New High German phonetic data and its specific application to diachronic linguistics and in a case study of the phonetic system Proto-Indo-European. But many more applications are possible by standardizing the approach using state of the art deep autoencoders for reconstruction of missing phonetics.

P7: Deep autoencoders and transfer learning for estimation of quantum bits from crystallographic defects (Dr. Yuanxi Wang)

Realistic crystalline solids are never perfect. Missing or out-of-place atoms – collectively known as crystallographic point defects – are ubiquitous in the otherwise ordered atomic lattices that make up solids. Depending on the application, point defects can be undesirable in materials engineering when they degrade materials properties (e.g., transport or optical response), and can be desirable when they endow their hosts with new functionalities (e.g., doping and hardening). In both cases, predictive modeling efforts have sought to accurately predict defect properties so that their utility in a solid can be determined ahead of experiments. The recent establishment of computational databases cataloging >500 calculated defect properties for oxides provide new opportunities in employing deep learning methods to accelerate theory modeling of defects. Yet they are still insufficient in size to take full advantage of deep learning methods. Here we propose to combine small datasets of defect properties in oxides and large datasets of pristine solid formation energies in improving the prediction accuracy of defect formation energies.  Specifically, we will adopt transfer learning to leverage models pre-trained on large datasets (~10,000 crystal structures) to re-train accurate models for the smaller defect dataset that predicts defect formation energies. In comparison, other common machine learning algorithms will also be applied without transfer learning to establish a baseline. The final trained model will allow for efficient prediction of the stability of defects in oxides.   

P8: Data driven/deep learning analysis for environmental studies using earth observation data (Dr. Lu Liang)

Remote sensing is a modern cost-effective technique to retrieve large-scale, long-term landscape change information for better understanding about human-environment interaction. This project will utilize the very high-resolution aerial images (1 m) and the deep learning methods for automatic classification of agricultural irrigation techniques in the US. Students will have diverse opportunities to apply their knowledge and skills using real-world large datasets to solve emerging critical issues that are related to the water-food-energy nexus. Example projects will be: testing the scalability of the deep neural algorithms on multiple data sources collected at different geographic locations, times, and resolutions; developing a new algorithm that can detect irrigation types on landscape with complicated patterns (e.g., rough mountains or images with shadows).

P9: Machine learning and embedding applications in glass materials research (Dr. Jincheng Du)

The conventional way of designing new glasses has been an empirical scientific endeavor via expensive, inefficient trial-and-error for discovery. These constraints limit the speed that new compositional properties can be explored in the glass industry. However, machine learning can be leveraged to predict material properties prior to manufacture. In this project, students will explore how embedding techniques can be utilized to improve prediction of the compositional values for the kinetic properties of glass based on their chemical composition and properties from molecular dynamics simulations. Identifying these kinetic properties such as the glass transition temperature (Tg) and density (p) is one of the main key steps in glass composition design. Both properties have a significant impact on the quality of the eventual product and will define the functionality of the glass. The procedure used in this study can be extended to predict other macroscopic properties as a function of the glass composition. Such an approach can help accelerate the design of novel glasses with optimized properties.

P10: Developing ML-based methods to identify horizontally transferred genomic islands in Staphylococcus aureus strains (Dr. Rajeev Azad)

Staphylococcus aureus is a versatile pathogen that can cause infections in both humans and animals. S. aureus has been reported to become resistant to commonly used antibiotics due to its adaptation to the modern hospital environment by the horizontal acquisition of drug resistance genes. This project will use ML methods and embedding techniques to decipher horizontally acquired large genomic regions, namely, genomic islands (GIs), in methicillin-resistant S. aureus strains and will classify a subset of GIs carrying virulence and resistance genes as pathogenicity and resistance islands respectively. ML-based gene clustering methods will be exploited for identifying horizontally transferred genes. The gene clustering approach will also be used to identify novel (yet unreported) islands that harbor virulence and/or resistance genes. Further, complementary strengths of both supervised and unsupervised machine learning techniques will be explored to robustly identify genomic structures acquired through the process of horizontal gene transfer. The outcomes of this project may provide valuable information on the evolution of drug resistance in S. aureus.