The research subject of this project is Data Analytics and Information Retrieval. This REU program will offer a ten-week research program for ten undergraduate students during summer semesters. The faculty-student interaction as well as interaction among students will take different forms such as meetings, seminars, tutorials, workshop, and field trips. The REU program will allow a diverse pool of undergraduate students to experience cutting-edge research experience that will help them to become self reliant in STEM research. Students will gain valuable research skills that will prepare them for their future fields of study, and their exposure to the research will help them to compete for high technology fields in an innovative job market. The research experience will also motivate them to continue onto graduate studies. The REU project also will provide students an opportunity to collaborate with their faculty mentors and student peers across the nation after the summer program. The sample research projects cover open research topics in data analytics and information retrieval. Here is a list of sample research projects. The real research projects will be decided by students and their faculty mentors.

Data Analytics of a Large Volume of Biomedical Images. This project includes two closely-related research components: auto-labeling and image classification. The images are cytopathology images collected by PI Ding. The image collection will be available for our research study. In addition, more cytopathology and other biomedical images are also available online. (1) Auto-labeling of cytopathology images. The resolution of a cytopathology image normally is too high (e.g. 2048x1024 pixels) to be directly handled by a deep learning based classification model. In addition, each image usually contains many cells. In order to build a deep learning classifier, segmenting the individual cells in each image is necessary. A deep learning based classifier for cytopathology images can work on the segmented cell images directly. A deep learning based classification also can process noisy data nicely. Traditional segmentation algorithms would produce acceptable segmented samples even if some samples may include overlapped cells or only partial image of a cell. However, a cytopathology image, not the individual cells in the image, is labeled for its disease type. Then some of the cells in the images are disease cells, but others could be normal cells. Therefore, it is necessary to re-label each segmented cell. Manually labelling these individual cell images is infeasible due to the labor intensity and unavailability of pathologists. In this project, students will the approaches for auto-labeling of these images based on semi-supervised learning. (2) Automated classification of cytopathology images. The ultimate goal of analyzing cytopathology images is to correctly classify each image for its disease type. The automated classification of medical images such as X-ray images using deep learning has reached the same accuracy as a medical specialist. However, building a high quality deep learning classifier for medical images is a grand challenge due to the absence of labelled samples in quality and quantity. In this research component, we will study how to build a deep learning classifier for cytopathology images with a limited number of labeled samples, or samples with a high percentage of incorrectly-labeled samples.

Content Based Image Retrieval of Large-Scale Image Datasets. Content-based image retrieval is the process of searching for similar images based on analysis of image content, which associates pixel-level information to the image semantics. Image representation and similarity measurement are the two critical tasks of image retrieval. Deep CNN that learns image features directly provides a new way to build a better image retrieval system. However, it is infeasible to compare the similarity of two images directly based on their CNN learned features since they are high-dimensional, which could cause the ``curse of dimensionality". Another challenge of image retrieval is the high computational cost for searching a large-scale image dataset. Hashing based methods such as Approximate Nearest Neighbor (ANN) have been proposed for speedup. The basic idea is to project the high-dimensional features to a lower dimensional space and produce compact hash codes. Then the image retrieval can be implemented based on hash code matching or Hamming distance measurement. In this research, we will use deep CNNs to produce binary codes from learned features. The image retrieval consists of a coarse-level search to find candidate images and a fine-level search to return the final result from the candidate images.

Information Retrieval of Large-Scale Text Collections. Information retrieval (IR) of texts aims to find desired documents or webpages from large-scale text collections for users’ information needs. An IR system typically includes the following functional components: document processing/indexing, query processing, matching using information retrieval models, and re-ranking of results. Machine learning approaches including neural models have been widely used in different components of an IR system to improve its performance. In this project, we will study the application of applying various machine learning and knowledge discovery approaches to build two types of IR systems: one is a high performance precision medicine IR system, and the other is a cross-language information retrieval (CLIR) system for low-resource language text collections. (1) Re-ranking with machine learning approaches for Precision Medicine Information Retrieval. Precision Medicine is an approach in healthcare which deals with evidence-based treatment and prevention of diseases customized to the individual patient. Precision Medicine Information Retrieval (PMIR) specifically deals with the challenges of developing IR systems that retrieve relevant cancer treatment literature using a patient’s genomic and demographic information. At TREC PMIR, participants were challenged to design a system to perform two IR tasks, one to retrieve relevant cancer prognosis, treatment or prevention based on articles from a collection of scientific abstracts, and the other task to retrieve eligible clinical trials from a collection of clinical trials for a given common set of patient cases which are referred to as topics. These topics mainly contain information about the type of cancer, genomics and demographic information. The UNT Intelligent Information Access Lab has participated in TREC for three years with good progress. We will continue this research with information fusion of different approaches for re-ranking retrieval results to improve IR performance. We will also explore the modeling of human precision medicine knowledge as a re-ranking process. (2) Cross-language Information Retrieval for low-resource languages. The goal of Cross Language Information Retrieval (CLIR) is to find relevant information that is written in a language that is different from that of the user’s queries. CLIR is important because there are more than 6,000 languages worldwide, and many web pages or documents are written in languages other than English. CLIR is expected to remove language barriers on information and knowledge sharing for users for their work and daily life. Effective CLIR services are desired especially for digital libraries and multinational corporations. This project aims to investigate effective and efficient computational methods to locate text and speech content in “documents” (speech or text) in low-resource languages, using English queries. This capability is one of several expected to ultimately support effective triage and analysis of large volumes of data, in a variety of less studied languages. We would like to explore the use of machine learning and data analytics to understand new languages, to perform machine translation of the queries, and to re-rank retrieval results.

Social Media Information Retrieval for Disaster Research. Social media data is an increasingly important source of textual information, especially for monitoring and analysis of information about rapidly-developing situations such as natural disasters and other crisis events. Most social media platforms offer some search functionality, facilitating the retrieval of large numbers of posts potentially related to disaster events. Such retrieved datasets contain as much irrelevant material as relevant material, posing a challenge for first responders, aid workers, decision makers, members of the public, and researchers seeking to learn from the things that people post in crisis situations. In this project, we will study the retrieval, processing, and analysis of crisis-related Twitter data. (1) Retrieval and processing of crisis-related Twitter data. The goal of this project is to produce archivable, shareable, linguistically-preprocessed versions of Twitter datasets related to specific crisis events. Using tools for both archival retrieval and streaming retrieval, we will harvest large (min. 500K tweets/event) collections of microblog posts for both past and current events. The current events selected need not be crisis events; the same methods and challenges are relevant for (e.g.) many national- or international-scale political, cultural, or societal events. We then employ a range of filtering techniques on the harvested data, aiming to develop standards for informative tweets; some standard filters are language, length, status as a retweet, etc. We will additionally employ methods for removing posts from bots and other types of spam. Finally, the data will be subjected to linguistic preprocessing, including tokenization, part-of-speech tagging, and dependency parsing using Twitter-specific toolkits for natural language processing. (2) Content-based filtering and analysis of crisis Twitter data. The goal of this project is to develop tools for automatic filtering, classification, and analysis of crisis-related Twitter data. Even after identification of informative tweets, the datasets collected in (1) will contain both event-relevant and event-irrelevant tweets, as well as tweets addressing different aspects of the target event/crisis. Further filtering and automated analysis make the data more usable both for human analysis and as input for machine learning or other computational systems.

We will experiment with a range of approaches to filtering, including: (1) Neural methods (CNNs) for classifying relevant and irrelevant tweets \cite{nguyen2017robust}; (2) Training and deployment of general crisis-related word embeddings (i.e. not tuned to any specific crisis or event) for clustering of tweets; (3) Neural methods for sentiment analysis of Twitter data; and (4) Exploitation of emergency-management related lexicons for tweet classification. With each of these methods, we will additionally experiment with incorporating linguistic features from the previous preprocessing steps, in order to better understand the role of linguistic information for crisis tweet classification. The cross-classification and analysis methodologies developed in this project will greatly support use of the datasets for further research.