Skip navigation
Title: 

Simi Bot (Text Similarity Analyzer) - Best Paper

Authors: Chen, Wukun (Eric)
Keywords: TF-IDF, R, Text Similarity, text clustering
Issue Date: 17-Jun-2021
Abstract: Abstract The purpose of this project is to develop an application to perform TF-IDF text similarity scoring analysis for NYU School of Professional Studies and the Management and Systems program (MASY). This application is programmed in the R programming language and hosted on the shinyapp.io server. This text data mining application is featuring topic clustering, keyword extraction, machine learning, cloud computing, and Shiny-based user experience. This project allows users to customize unsupervised machine learning hyperparameters and upload files locally. After uploading a .txt file (comparison source) and a .csv file (comparison target), users need to choose the number of clusters for the text cluster analysis (from 2 to 20), the number of most frequent words to display for each text cluster (from 2 to 20), and the level of word combinations (from 1 to 3). When a user clicks the “Analyze Data” button after all the hyperparameters are set, this application will generate two data tables to indicate similarity scores, cluster group, size of each cluster group, and the keyword in each cluster group. The underlying algorithms of this program are as following: cleanse the text source and target, drop all non-alphabetical characters, eliminate multi-space, and lemmatize all the words; apply TF-IDF transformation, compute similarity-score against the source to each target; cluster with hierarchical method; calculate the mean similarity scoring by the group to determine the cluster of the max mean; output data table with cluster group, size of each cluster group, and the keyword in each cluster group. With this new tool, students studying data analysis and machine learning would have an easy-to-use R tool to perform TF-IDF text similarity scoring analysis, which works for both Windows-based PCs and Apple Macs. Scenarios that we can put into use include matching resumes to occupations, matching a syllabus to occupations, matching resumes to program syllabi to discover gaps, and recommend courses. Templates, samples, and comprehensive tutorials are provided in the application.
Description: Best in Showcase Paper
URI: http://hdl.handle.net/2451/62803
Rights: Author Retains All Rights
Appears in Collections:MASY Student Research Showcase 2021

Files in This Item:
File Description SizeFormat 
Wukun Chen - Final Project Report - Simi Bot (Similarity Analyzer).pdfAbstract The purpose of this project is to develop an application to perform TF-IDF text similarity scoring analysis for NYU School of Professional Studies and the Management and Systems program (MASY). This application is programmed in the R programming language and hosted on the shinyapp.io server. This text data mining application is featuring topic clustering, keyword extraction, machine learning, cloud computing, and Shiny-based user experience. This project allows users to customize unsupervised machine learning hyperparameters and upload files locally. After uploading a .txt file (comparison source) and a .csv file (comparison target), users need to choose the number of clusters for the text cluster analysis (from 2 to 20), the number of most frequent words to display for each text cluster (from 2 to 20), and the level of word combinations (from 1 to 3). When a user clicks the “Analyze Data” button after all the hyperparameters are set, this application will generate two data tables to indicate similarity scores, cluster group, size of each cluster group, and the keyword in each cluster group. The underlying algorithms of this program are as following: cleanse the text source and target, drop all non-alphabetical characters, eliminate multi-space, and lemmatize all the words; apply TF-IDF transformation, compute similarity-score against the source to each target; cluster with hierarchical method; calculate the mean similarity scoring by the group to determine the cluster of the max mean; output data table with cluster group, size of each cluster group, and the keyword in each cluster group. With this new tool, students studying data analysis and machine learning would have an easy-to-use R tool to perform TF-IDF text similarity scoring analysis, which works for both Windows-based PCs and Apple Macs. Scenarios that we can put into use include matching resumes to occupations, matching a syllabus to occupations, matching resumes to program syllabi to discover gaps, and recommend courses. Templates, samples, and comprehensive tutorials are provided in the application.2.92 MBAdobe PDFView/Open


Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.