Title: | Industry Code Analyzer – NAICS Code Discovery For Startups in R |
Authors: | Xu, Yin (Fein) |
Keywords: | NAICS Code System, Text Analysis, TF-IDF, Shiny |
Issue Date: | 17-Jun-2021 |
Abstract: | NAICS Code is a classification adopted by the North American Industry Classification System. Federal Statistical Agencies use the code to establish a North American standard on collecting and analyzing statistical data related to the U.S. Economy. However, NAICS is a self-assigned system. Business owners or users have to select the code that best describes their primary business activities. It is time-consuming and inefficient for business owners or users to manually use keyword search provided by the NAICS Code system, bounced back and forth among search pages and homepage. Therefore, this project aims to help startups or users of the NAICS Code system find correct and corresponding industry codes based on their business. The consultant has earlier work in building a NAICS industry code search tool in Python. The project expanded on existing preliminary work done in Python. The consultant of this project developed a tool in R to search the NAICS industry code database more intelligently. Given a business description as a text file, the industry code analyzer tool searches the NAICS industry code database to identify the industry classification corresponding to the users' uploaded business descriptions. The industry code analyzer tool uses TF-IDF text similarity scoring, returns a ranked list of industry codes, and presents the top 5 codes and descriptions to the user for selection and download. This project carried on additional experiments to define the tool capabilities and produced a user interface in Shiny for easier use of the tool. The shiny app regarding industry code analyzer served as an efficient search tool to find the correct industry code and benefit both professional and academic uses. The accuracy rate of this industry code tool reaches 79%, compared to the result generated from preliminary work in Python and code manually found by business owners. Further development regarding TF-IDF decomposition dimensionality reduction is suggested to adopt in the next phase to enhance accuracy and reducing process time in text analysis. |
URI: | http://hdl.handle.net/2451/62804 |
Rights: | All Rights Reserved By Author |
Appears in Collections: | MASY Student Research Showcase 2021 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Yin(Fien)Xu - Final Project Report - Industry Code Analyzer – NAICS Code Discovery For Startups in R.pdf | Abstract NAICS Code is a classification adopted by the North American Industry Classification System. Federal Statistical Agencies use the code to establish a North American standard on collecting and analyzing statistical data related to the U.S. Economy. However, NAICS is a self-assigned system. Business owners or users have to select the code that best describes their primary business activities. It is time-consuming and inefficient for business owners or users to manually use keyword search provided by the NAICS Code system, bounced back and forth among search pages and homepage. Therefore, this project aims to help startups or users of the NAICS Code system find correct and corresponding industry codes based on their business. The consultant has earlier work in building a NAICS industry code search tool in Python. The project expanded on existing preliminary work done in Python. The consultant of this project developed a tool in R to search the NAICS industry code database more intelligently. Given a business description as a text file, the industry code analyzer tool searches the NAICS industry code database to identify the industry classification corresponding to the users' uploaded business descriptions. The industry code analyzer tool uses TF-IDF text similarity scoring, returns a ranked list of industry codes, and presents the top 5 codes and descriptions to the user for selection and download. This project carried on additional experiments to define the tool capabilities and produced a user interface in Shiny for easier use of the tool. The shiny app regarding industry code analyzer served as an efficient search tool to find the correct industry code and benefit both professional and academic uses. The accuracy rate of this industry code tool reaches 79%, compared to the result generated from preliminary work in Python and code manually found by business owners. Further development regarding TF-IDF decomposition dimensionality reduction is suggested to adopt in the next phase to enhance accuracy and reducing process time in text analysis. | 6.77 MB | Adobe PDF | View/Open |
Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.