KM
Khaja Mujahiddin Mohammed

Hi, I'm Khaja Mujahiddin Mohammed

Senior Data Engineer | Azure & AWS Lakehouse Architect | PySpark • Python • NLP

Data EngineeringMedallion ArchitectureNLP / WhisperPySparkCloud Native
6+ years scaling analytics pipelines (Healthcare, Telecom, Retail)Architecting Medallion Lakehouses on Azure & AWSDeploying NLP & Predictive Models
6+ yrsExperience
30%Faster Data Prep
NLPWhisper & VAD
MSData Science

About

As a senior-level data engineer and analyst with more than 6 years of experience, currently serving as a Data Engineer at Optum, I design scalable analytics platforms and machine-learning pipelines across healthcare, telecom, and retail. I have delivered end-to-end cloud data solutions on Azure and AWS, including a Medallion lakehouse architecture and automated feature-engineering pipelines that cut data-prep latency by 30%. By building production-grade NLP and predictive models and creating interactive BI dashboards, I translate complex data into actionable insights. Aiming to apply this expertise to accelerate data-driven decision-making and improve outcomes for a forward-thinking organization.

Core Strengths

  • Cloud Data Architecture (AWS, Azure Data Lake, Databricks)
  • Medallion Lakehouse & ETL/ELT Pipelines (PySpark, dbt, Airflow)
  • Machine Learning & NLP (Whisper, Transformer Models, Deep Learning)
  • Interactive BI & Dashboards (Power BI, Tableau)

Domains

  • Healthcare Analytics (Clinical AI, HIPAA/PCI Compliance)
  • Telecom Signal Intelligence & Speech Processing
  • Enterprise Financial Reporting & Anomaly Detection
  • FinOps & Cloud Cost Optimization

Live Snippet

# Example PySpark Transformation Pipeline
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

def process_bronze_to_silver(df):
    """
    Cleans raw data and writes to Silver zone 
    in Delta Lake format.
    """
    clean_df = df.filter(col("status") == "active") \
                 .dropDuplicates(["transaction_id"])
                 
    clean_df.write.format("delta") \
            .mode("append") \
            .save("/mnt/datalake/silver/transactions")
            
    return "Data processing complete"

Projects

Representative image for NLP-Driven Speech Analytics Platform (Unsplash)

NLP-Driven Speech Analytics Platform

NLP · PySpark · Databricks · Whisper · Azure

Representative image for Enterprise Lakehouse Analytics Platform (Unsplash)

Enterprise Lakehouse Analytics Platform

Azure · Medallion · Airflow · Power BI

Representative image for Multimodal Fake News Detection (Unsplash)

Multimodal Fake News Detection

Deep Learning · NLP · Machine Learning

Representative image for Pedestrian & Cyclist Segmentation (Unsplash)

Pedestrian & Cyclist Segmentation

Computer Vision · U-Net · Deep Learning

Representative image for Real-Time Customer Segmentation (Unsplash)

Real-Time Customer Segmentation

AWS · PySpark · Power BI · MLOps

Experience

  1. Data Engineer

    Optum • Hartford, CT • Jan 2025 – Present

    • Architected scalable ETL and data ingestion pipelines on AWS and Azure with Databricks, Lambda, and Step Functions.
    • Engineered automated feature engineering pipelines utilizing PySpark and SQL, reducing data preparation latency by 30%.
    • Designed high-performance relational and NoSQL data structures in ADLS Gen2 for downstream machine-learning workflows.
    • Streamlined orchestration workflows using Airflow and dbt, and implemented data observability frameworks.
    • Applied FinOps principles to optimize cloud compute usage, reducing operational costs.
  2. Associate Data Engineer

    Staples • Chennai, India • Oct 2021 – Jul 2023

    • Implemented a Medallion Lakehouse architecture in Azure Data Lake (Bronze/Silver/Gold) to unify multi-brand ERP datasets.
    • Designed scalable ETL workflows using PySpark and SQL to transform high-volume financial transaction datasets.
    • Developed automated reporting and visualization layers in Power BI and Spotfire.
    • Optimized SQL query performance across Snowflake and Azure Synapse environments.
  3. Data Analyst

    Verizon • Hyderabad, India • Aug 2019 – Sep 2021

    • Developed high-throughput Spark and Python data pipelines in Azure Databricks processing millions of daily telecom records.
    • Deployed NLP transcription workflows using Whisper and Voice Activity Detection (VAD) models.
    • Executed serverless analytics with AWS Athena to analyze acoustic metadata.
    • Designed Power BI dashboards visualizing regional network performance and signal degradation metrics.

Skills

  • Languages: Python, pandas, NumPy, scikit-learn, SQL, PostgreSQL, MySQL, Snowflake, R, PySpark
  • Machine Learning & AI: NLP, Deep Learning, Feature Engineering, Statistical Modeling, A/B Testing, Transformer Models, Voice Activity Detection, Prompt Engineering
  • Data Engineering: Apache Spark, Delta Lake, Kafka, Hadoop Ecosystem, ETL/ELT Pipelines, Medallion Architecture
  • Cloud Platforms: Azure, Databricks, Data Factory, Synapse, ADLS Gen2, AWS, S3, SageMaker, Athena, Lambda, Glue, EMR
  • MLOps & DataOps: Airflow, dbt, MLflow, Docker, Kubernetes, Terraform, CI/CD, Git, Great Expectations, Data Observability
  • Visualization & BI: Power BI, Tableau, Matplotlib, Seaborn, Plotly, Excel
  • Architecture & Governance: Lakehouse Architecture, Star/Snowflake Schemas, Data Lineage, Data Governance, HIPAA/PCI Compliance, FinOps

Education

Master of Science, Data Science

University of New Haven · West Haven, CT · Aug 2023 — May 2025

  • Coursework: Advanced Machine Learning, Deep Learning, NLP, Cloud-Based MLOps, Big Data Analytics.
  • Key Projects: Multimodal Fake News Detection (85% accuracy), Pedestrian & Cyclist Segmentation (U-Net).

Bachelor of Technology, Mechanical Engineering

Sreenidhi Institute of Science & Technology · Hyderabad, India · Jun 2017 — May 2021

  • Foundation in quantitative analysis, mathematics, and engineering principles.
  • Transitioned into data and analytics through hands-on programming and statistical modeling projects.

Certifications

  • AWS Cloud Practitioner — EduBridge
  • Google Data Analytics — Coursera
  • HackerRank SQL — 5 Star
  • MySQL Developer — Udemy
  • PGDCA (Post Graduate Diploma in Computer Applications)
  • Data Analytics & Visualization Virtual Experience — Forage

Contact

The form opens your mail client with a prefilled email.