J. An

Tech Enthusiast, Problem Solver, Adaptive Learner

To extract entropy from oceans of data, to discern signals from noise-polluted info, to pursue perfection from repetitive practices.

I describe myself as a tech enthusiast, a problem solver, and an adaptive learner. Fascinated by the digital world, I appreciate all the convenience and sparks brought by technology. Received my master’s degree concentrated on Data-Driven Analysis & Computation from Columbia University, solid software programming & data analysis backgrounds facilitate my everyday life and my journey of building innovative products to better help our society, functionally and ethically.

Knowledgeable with a wide variety of Machine Learning / Big Data / Cloud Computing / Web Development frameworks & libraries and comfortable with the most majority of programming languages, I am ready to serve. And I believe it is the best of times for you and me to work together for the greatest expectations!


Work Experience

Graduate Course Assistant

Department of Computer Science, Columbia University
Sep. 2020 - Dec. 2020

  • Teaching and Learning Facilitation: Served for the Columbia’s best-known Databases class (by Prof. Donald Ferguson) with 200+ enrollments. Involved in homework & exam grading, maintaining grading scripts written in Python and conducting weekly office hour / recitation.
  • Highest Evaluated: Among 11 assistants, received the highest overall quality rating (mean: 4.43/5, median 5/5) and highest response rate (18.53%) based on the student evaluations.

Graduate Research Assistant

Digital Video Multimedia Lab, Columbia University
Feb. 2020 - Jul. 2020

  • Dataset Retrieval and Processing: Designed monologue videos searching and retrieval application using YouTube API and youtube-dl in Python. Implemented face recognition API to crop video frames around speakers’ face and removed unrelated intervals using FFmpeg to achieve a nearly 100GB clean dataset. [Code]
  • Face-Speech Bridging: Extended the existing algorithm to reconstruct speeches from silent videos of talking faces and reconstruct talking faces solely from speech sequences by training two mutual autoencoders for video & audio.

Projects

Dining Concierge Chatbot Web Application [Code]

Sep. 2020 – Oct. 2020

  • Restaurant Recommendations: Based on the conversation with the user, the front-end NLP supported chatbot identifies the user’s preferred cuisine type and dining region on Manhattan. The back-end services search through the database and retrieve the top 3 highest-rated restaurants that matched the preference, and then send the restaurants’ name, address, rating and price info back to the user’s phone via SMS as the recommendations.
  • Microservice-Driven Workflow: The front-end website is hosted on S3 and triggering the Lambdba function using API Gateway. The NLP supported chatbot is developed under Lex and hooked with the Lambda function for validation and response formatting. The user’s preference is stored in a SQS queue and pulled by Elasticsearch as a filter for the restaurant selection. Selected RestaurantIDs would be used as the key to fetch more info from the database in DynamoDB which is collected using Yelp API. Finally, the recommendations would be sent out using SNS.

Smart Door Authentication System [Code]

Oct. 2020 – Nov. 2020

  • Visitor Authentication: Integrated with live streaming and face recognition using AWS Kinesis and Rekognition services. Simulates the out-door security camera workflow that authenticates/recognizes the visitor (face information) and provides her the access via SMS to a virtual door (or any sensitive resources) after the owner’s permission.

Photo Album Web Application [Code]

Nov. 2020 – Dec. 2020

  • Photo Uploading & Searching: Integrated with AWS Lex, Elasticsearch and Rekognition services, support uploading photos to the album and the back-end service would automatically analyze the objects in the photos and create index (labels) for searching. Then, users could search photos using natural language via both text and voice.

Real-time Credit Card Fraud Detection Pipeline [Code]

Dec. 2020 – Jan. 2021

  • Pipeline Workflow: A Kafka topic continuously produces the simulated transaction records that consumed by a Spark Streaming job. The streaming job would predict whether these transactions are fraud or not baed on the pre-trained Spark ML models and then save the classified records into the database. Display the fraud and non-fraud transactions in real-time on the dashboard web page by using the Spring Boot framework. Meanwhile, implemented two REST APIs that could easily retrieve the customer’s information and create the transaction statement for each customer with the Flask framework.

Natural Language Processing Models Implementation [Code]

Jun. 2020 – Aug. 2020

  • Trigram Language Model: Designed n-grams extracting and counting functions for corpuses. Implemented linear interpolation to compute smoothed trigram probability and log probability of an entire sequence. Applied the trigram model to a TOEFL written-test skill level classification task giving 83% accuracy.
  • Probabilistic Context-Free Grammar Parser: Implemented CKY algorithm for PCFG parsing by retrieving a parse tree for the input sentence given the PCFG probabilities in the grammar from a backpointer parse table.
  • Neural Network Dependency Parser: Constructed a feed-forward neural network with Keras to predict the transitions of an arc-standard dependency parser using greedy parsing algorithm giving 71% accuracy.
  • Lexical Substitution: Found lexical substitutes for individual target words in context from WordNet database. Implementations included overlap computing algorithm – Simplified Lesk, vectorial similarity computing based on Word2Vec embeddings, BERT’s masked language model and its combination with WordNet derived candidates.
  • Image Captioning: Created matrices of image representations using the Inception v3 network and read the image captions as a lookup table. Trained an LSTM language generator on the caption data, and added the image input to write an LSTM caption generator. Finally, applied beam search to produce the n highest-scored caption sequences.

Machine Learning Models Implementation [Code]

Feb. 2020 – May. 2020

  • Supervised Learning: Implemented Ridge Regression and Gaussian Process Regression for vehicles’ mileage per gallon prediction giving 2.10 and 1.89 RMSE value respectively and used degree of freedom for vehicle features impact analysis. Implemented Naive Bayes Classifier and k-NN Classifier for spam/ham email classification giving 87.0% and 87.3% accuracy respectively and used poisson parameters for email words impact analysis.
  • Unsupervised Learning: Implemented Matrix Factorization to find similar movies and recommend unrated movies for users based on a MovieLens rating dataset and Nonnegative Matrix Factorization for topic detection on New York Times article dataset. Implemented Markov Chains to rank college football teams based on CFB2019 scores.