About

I'm Sandeep, a Senior Data Scientist based in the Canada with over 5 years of experience building AI-powered solutions.

I possess proficiency in Python, Natural Language Processing, Computer Vision, Time Series Forecasting, Predictive Modeling, and Statistical Analysis. In addition, I have a keen interest in designing interactive dashboards and effectively generating and visualizing insightful data.

I'm passionate about solving problems and building efficient, organised systems that are fast, performant and accessible to everyone.

In my free time, I enjoy activities such as mentoring, reading and cooking.

Skills

Programming languages / Databases: Python, Java, SQL, Oracle, NoSQL

Libraries and frameworks: TensorFlow, Keras, PyTorch, spaCy, NLTK, Scikit-learn, Seaborn, Pandas, OpenCV, LLM, BERT, Layoutlm, LSTM, Transfer Learning, Vision Transformers, GAN, Regression, Decision trees, Random forests, SVM, XGBoost, AdaBoost, LightGBM, Clustering, PCA, t-SNE, ARIMA

BI and Analytics tools: Qlik Sense, Google Analytics, Tableau

Big Data technologies: Hadoop, Spark

Cloud platforms: AWS (EC2, Lambda, S3, Sagemaker), Google Cloud, Azure

Other tools and technologies: ETL, Kubernetes, MLflow, GitHub, Django, Flask, Jenkins, Jira, Microsoft Office (Excel, PowerPoint), Linux

Experience

Machine Learning Engineer / Digitap Ai / Freelance

Oct 2022 - Jan 2023 / 3m

Capitalized on comprehensive credit bureau datasets to develop an intricate credit risk scorecard, employing advanced machine learning algorithms such as Random Forest and Logistic Regression. This initiative, focusing specifically on customers who had made their first installment, led to an accuracy rate of 87%, significantly reducing financial risk costs by 15%. By enhancing the credit decisioning process, we ensured more precise creditworthiness assessments, thereby reducing instances of defaults and late payments. The scorecard assigned a risk score between 300 and 900 to each customer, providing a quantifiable measure of their credit risk profile. Throughout this project, proactive engagement with stakeholders was maintained to ensure technical robustness and alignment with larger business objectives, effectively mitigating high-risk credit exposures.
Made significant advancements to the organization's risk management system by successfully implementing an Alternate Scoring Model based on XGBoost, a gradient boosting framework renowned for its predictive power and speed. This model was specifically tailored for customers who had completed their first payment, using a wide array of features such as email patterns, employment status, and transactional behavior. XGBoost's decision-tree-based ensemble machine learning method was utilized to accurately identify potentially fraudulent customers with a precision rate of 92%. Each suspicious customer was subsequently assigned a risk score between 100 and 900, encapsulating their individual risk profile. This approach significantly enhanced the firm's ability to manage risk exposure and minimize potential financial fallout from fraudulent activities.

Senior Data Scientist / Morgan Stanley / Remote

July 2021 - Sep 2022 / 1y 2m

Utilized state-of-the-art machine learning techniques, including LayoutLM for document understanding, Azure OCR and Pytesseract for text extraction, to automate a significant 96% of the lending team's manual tasks. This initiative dramatically enhanced the efficiency of loan document processing and reduced human error. To deploy this solution at scale, I developed a Flask API which served as a bridge between the machine learning models and the end-users. This API was seamlessly integrated into the existing infrastructure using Jenkins, a leading open-source automation server that enables developers to reliably build, test, and deploy their software. Additionally, the application was containerized using Treadmill, a scalable, highly-available application orchestration platform. This approach ensured the robustness and reliability of the system, providing end-users with faster and more efficient document processing capabilities, consequently improving the overall user experience and productivity of the lending team.
In partnership with a cross-functional team, I facilitated the deployment and configuration of Label Studio on our server infrastructure. This innovative process substantially optimized our machine learning model training with high-quality, structured data. To continually improve the model's performance, we established a feedback mechanism. This iterative approach allowed for the enhancement and refinement of our model, improving its predictive accuracy over time. This implementation resulted in a significant reduction of human effort dedicated to document labeling tasks by 80%. Furthermore, to ensure ongoing model performance transparency and facilitate swift identification and resolution of potential issues, we integrated a system for continuous performance monitoring of the deployed model in our production environment. This holistic strategy optimized resource allocation, improved model accuracy, and promoted data-driven decision-making processes within the organization.
Pioneered the development of a sophisticated machine learning model for automated invoice processing, using Tesseract OCR for text extraction, YOLOv5 for object detection, and BERT QA for text interpretation. This innovative model reduced the business team's workload by a considerable 90%. It adeptly identified and extracted key financial data points such as dates, tax figures, and total amounts from invoices, providing insightful revelations about expenditure and lending patterns. The strategic deployment of this model not only streamlined operational efficiency but also furnished the organization with valuable financial intelligence, aiding in the formulation of data-driven business strategies and decision-making processes.

Senior Data Scientist / HDFC Life / On-Site

Jul 2019 - Jul 2021 / 2y

Achieved a 95% accuracy rate on the Life toolkit by training a robust machine learning model to validate and extract vital data from documents like Death Certificates, Pan Cards, and Aadhar Cards. The process began with using OpenCV for image correction and quality enhancement. Following this, YOLO (You Only Look Once), a real-time object detection system, was used to identify regions of interest within the documents. Subsequently, Tesseract OCR extracted the text from these regions, with post-processing techniques employed to clean and structure the data. As a project lead, I supervised the model development, vendor collaboration, and the data annotation process, ensuring a high-accuracy model.
Orchestrated the development and deployment of a high-precision Lie Detector model, leveraging OpenFace features and LSTM (Long Short-Term Memory) networks. This innovative model, which achieved an accuracy rate of 91%, was crucial in aiding the Medical Underwriter team in assessing the veracity of statements within customer's Video Medical Examination Reports (MER). The experimental design for data collection was meticulous: seven individuals were enlisted, and videos were recorded of each, with portions of truth and deceit being annotated during the recording process. OpenFace, a tool for facial behavior analysis, was utilized to extract distinctive facial features, while LSTM networks were trained to identify temporal patterns in these features, indicative of deceptive behavior. This exhaustive experimental design ensured a diverse, representative sample, bolstering the model's generalizability and effectiveness in real-world scenarios. Thus, this approach significantly streamlined the Medical Underwriter team's operations while enhancing risk assessment accuracy.
Engineered an advanced automation bot, leading to a 90% reduction in manual labor for the team. The bot, designed with database management capabilities, executed complex SQL queries autonomously, allowing large-scale data extraction, manipulation, and retrieval. Its adept understanding of database schemas ensured efficient execution of tasks, thereby enhancing data processing speed and accuracy.
Architected and implemented a sophisticated Qlik Sense dashboard specifically tailored for loan processing tasks. This dashboard provided an interactive and user-friendly interface for visualizing complex loan management data. It leveraged Qlik Sense's dynamic data visualization capabilities, facilitating real-time insights into key lending indicators, thereby enhancing strategic decision-making and operational efficiency.