Summary
I’m a systems thinker, builder, and mentor who stabilizes chaos and helps infrastructure teams scale with clarity and care. With 20+ years in distributed systems, production engineering, and backend architecture, I’ve turned around critical services, restored reliability to failing systems, and shaped engineering cultures through mentorship and practical observability.
I’m especially strong in ambiguous, high-stakes environments where technical root causes are tangled in organizational ones. My work blends code, communication, and coordination to unblock teams and make complexity tractable.
What I bring:
- Deep expertise in batch compute, observability, and incident response
- Leadership through influence—bridging infra, data, and customer-facing teams
- Refactoring legacy systems for performance, scalability, and maintainability
- Coaching engineers and shaping org culture around reliability and ownership
- Turning intuition into metrics, and outages into institutional learning
Experience
Production Engineer
Meta — Nov 2021 - Sep 2024
Returned to Meta in a staff-level IC capacity with a unique cross-org mandate: identify and unblock critical reliability and efficiency issues across the data warehouse compute stack. Contributed to scheduler redesigns, cross-team coordination, and hard-to-track workload failures—especially during the company-wide “year of efficiency” initiative.
Key Contributions:
- Served under the head of Data Infra Compute, with a broad investigative role spanning Presto, Spark, Dataswarm, and Chronos, partnering with Warehouse Foundation, DataOps, and Fblearner teams.
- Influenced planning and scope for a major scheduler redesign effort, advocating for practical instrumentation, better prioritization signals, and lightweight prototypes.
- Investigated Spark workloads skirting prioritization, surfaced non-compliant patterns, and helped bring them into alignment.
- Acted as a technical bridge between DataOps and Compute/Infra teams to reconnect roadmap and prioritization efforts for shared reliability goals.
- Diagnosed a warehouse-wide pipeline degradation caused by SKU drift in Presto scheduling—fix shaved hours off critical data landing time each day.
- Mentored other production engineers, helping revive agility and confidence across teams through targeted coaching and systems thinking.
Senior Software Engineer
Mode — Jul 2020 - Sep 2020 (3 months)
Returned to Mode briefly to support the backend team during a period of performance degradation and architectural transition. Contributed targeted improvements, mentored engineers, and laid groundwork for long-term schema scalability.
Key Impacts:
- Identified architectural performance issues across Java and Go services by analyzing CPU traces and heap dumps from crashing production instances.
- Reduced memory footprint of schema processing services by 75%, improving uptime and eliminating redundant schema queries to customer databases.
- Mentored backend engineers on profiling and performance diagnosis techniques, using live case studies from production.
- Advocated for and initiated architectural discussions around streaming-based schema indexing to support large-scale datasets.
- Kicked off planning for a third-party database driver quality certification program to ensure reliability and compatibility across customer integrations.
Technical Cofounder
stealth startup — Feb 2020 - Jul 2020 (6 months)
Launched an early-stage venture exploring data-driven machine learning products in commercial real estate. Led technical strategy, data acquisition, and feature engineering efforts for an MVP viability model, while building operational structure and alignment in a volatile founding environment.
Key Contributions:
- Built a custom data scraper to aggregate and normalize commercial property listings from multiple sources into structured training data (including data cleaning, enrichment, and storage).
- Led technical design and planning for an ML pipeline to support a regression model for tenant viability prediction.
- Established MVP-focused planning cadence, and facilitated engineering prioritization and alignment among cofounders.
- Advised and supported a newly trained data scientist (first-time collaborator) on feature development and tradeoffs.
Outcome:
Wound down operations due to early-stage cofounder misalignment and external market instability during COVID-19 onset.
Senior Software Engineer
Mode — May 2019 - Feb 2020 (10 months)
Hired into a backend leadership role during a pivotal stage in the company’s growth. Took initiative across boundaries—refactoring core systems, improving cross-team reliability, and helping customer success reduce escalations through backend visibility and self-service tooling. Shaped technical direction and roadmap priorities through influence and clarity of purpose.
Key Impacts:
- Took ownership of database driver performance and reliability, reducing customer query issues and improving team-wide service stability.
- Led a major refactor of the query service, modularizing instrumentation, error handling, and configuration to improve reliability and surface top errors faster.
- Worked directly with major customers to troubleshoot database connectivity issues—identified root causes in client-side infrastructure and contributed fixes to their Presto gateway proxy.
- Collaborated with prospective clients during the sales cycle to understand their needs and build key features that helped close new contracts.
- Initiated structured collaboration between backend engineering and customer success—established weekly triage cycles, uncovered key bugs, built shared tools, and reduced escalations by 90%.
- Mentored three engineers (junior and senior), assigning ownership and helping grow confidence and autonomy.
- Launched recurring incident reviews and response processes, building reliability culture across backend teams.
- Formed and advocated for a cross-functional senior IC working group (with executive buy-in) to address long-standing performance, architecture, and ownership gaps.
Production Engineer
Facebook — Apr 2017 - Dec 2018 (1 year 9 months)
Returned to Facebook and joined the Data Infrastructure team, focused on stabilizing batch compute systems (Hive on Corona, Presto, Spark) during a period of high operational friction. Took on an engineering leadership role guiding technical direction, reliability strategy, and cross-team coordination across critical infrastructure components.
Key Impacts:
- Led the cultural and technical shift toward user-centered reliability by reconciling internal SLIs with real-world developer pain—long queue times, failed retries, and undiagnosed job delays.
- Defined and tracked new metrics for final failure states, labeling root causes through extensive query and log parsing; influenced prioritization and roadmap adjustments across multiple teams.
- Investigated performance variability and fragility in
dataswarmworkflows, revealing critical inefficiencies in job retries, query configurations, and dependency resolution logic. - Co-designed a dependency analysis tool to map and estimate downstream workloads impacted by errors to critical, foundational fact tables, enabling fast, cost-aware recovery orchestration.
- Advised and supported adoption of best practices in configuration, scheduling pools, and pipeline standardization across orgs.
Engineering Leadership:
- Worked with directors to shape reliability-focused technical direction across compute teams, including Spark, Hive, and Presto, aligning priorities and roadmap commitments.
- Mentored engineers and promoted systems-level thinking around reliability, observability, and operational feedback loops.
- Became a go-to source for insight and triage around batch performance and failure modes.
Chief Architect
NTT Security Holdings — Nov 2016 - Apr 2017 (6 months)
Joined during a post-acquisition unification effort to help define architecture and engineering practices for a new global security services division. Despite personal upheaval during this time, contributed to core strategy discussions, team formation, and early infrastructure scaffolding that shaped the organization’s future direction.
Key Impacts:
- Led initial architecture work to consolidate and align services from multiple acquired entities into a cohesive security platform strategy.
- Oversaw early design and development of GTIP (Global Threat Intelligence Platform), including technical direction and system goals for ingest, analysis, and enrichment workflows.
- Bootstrapped a new SRE team and introduced early infrastructure-as-code, CI/CD guardrails, and release hygiene practices to reduce operational risk.
- Mentored engineers on improving reliability and development velocity through automated testing, linting, and deployment standards.
Senior Technical Consultant
Valiant Solutions, LLC — Jul 2016 - Nov 2016 (5 months)
Brought in to lead the migration of an application firewall and intrusion detection system for a U.S. federal agency. Delivered a clean transition from legacy tooling to AWS-native network access controls through custom-built automation and reverse engineering, all under strict government operational constraints.
Key Impacts:
- Reverse-engineered Trend Micro’s configuration and audit data structures to design a migration tool translating legacy firewall rules into AWS-native equivalents (e.g. security groups, NACLs).
- Automated the full configuration export, transformation, and rehydration process to eliminate manual toil and reduce migration risk.
- Addressed deep technical debt, including unoptimized database schema (e.g. never-vacuumed audit logs), improving performance of internal systems.
- Navigated restricted time windows, RDP-only access, and policy overhead to complete all deliverables on schedule.
- Identified forward-looking architecture improvements, though these were deferred due to institutional limits.
Database Performance Engineer / Production Engineer
Facebook — Jul 2014 - Feb 2016 (1 year 8 months)
Worked on Facebook’s data compute infrastructure powering all batch workloads via Hive on Corona (a fork of MapReduce), Presto, and HDFS. Helped steer the system through a major inflection point as storage efficiencies shifted the primary bottleneck to compute capacity. Contributed to both deep technical troubleshooting and cross-org operational reform to improve system reliability, resource efficiency, and performance.
Key Impacts:
- Standardized job prioritization and slot allocation across internal orgs by interviewing engineers, analyzing workloads, and defining a unified set of scheduling pools, replacing a fragmented namespace-specific model.
- Developed “Hiveswatter”, a mechanism to short-circuit known failing recurring Hive jobs early in execution, reclaiming valuable compute slots.
- Diagnosed a critical scheduler bottleneck in Corona’s jobtracker caused by serialized task assignment with slow network calls—pinpointed the codepath and proposed fixes adopted by engineering leadership.
- Resolved cluster-wide OOM crashes during cross-region replication by designing a workaround using an
LD_PRELOAD-based connect syscall interceptor to dynamically adjust TCP socket buffer sizes based on destination subnet. - Collaborated with data infrastructure and storage teams during system-wide hardening to handle rising demand across Messenger, Ads, and other internal platforms.
- Improved system resilience by triaging query scheduler behavior, eliminating runaway compute workloads, and pushing for consistent failure triage practices.
Production Engineer
Facebook — Feb 2013 - Jul 2014 (1 year 6 months)
Joined during a period of systemic instability and team dysfunction on the HBase infrastructure team, which supported Messenger, internal monitoring systems, and search indexing. Promoted mid-year into a formal management role to help stabilize both the systems and the team. Focused on enabling strong but under-supported engineers to succeed, restoring operational health, and rebuilding trust across orgs under intense pressure.
Key Impacts:
- Diagnosed and mitigated write-path latency issues across HBase and HDFS that were causing repeated crashes and outages in Messenger.
- Facilitated effort to stabilize schema design and replication strategies, including live migrations of active clusters under load and redesign of backup/restore systems.
- Advocated for halting feature development and prioritizing cross-functional reliability work—helping realign efforts across PE, SWE, and partner teams.
- Brought focus and follow-through to a fractured team, enabling engineers to reclaim ownership of critical remediations and performance work.
- Collaborated in cross-org crisis response planning and execution during hardware failures and data migration events.
- Authored and shared weekly internal reports to keep stakeholders and leadership aligned on system health, risks, and progress.
People & Team Leadership:
- Promoted to IC5 with a “Redefines Expectations” rating in recognition of impact.
- Managed hiring and mentoring for multiple team members who went on to excel.
- Played a key role in unlocking peers: diagnosing blockers, modeling initiative, and seeding confidence through relationship-building and emotional intelligence.
- Regularly took on “translator” roles between engineers and stakeholders to surface what mattered and get it unblocked.
Recognition:
- Honored with a public award for crisis leadership—recognized internally as a calm and catalytic force during high-severity operational events.
Multiple Roles – SOC Analyst, Systems Engineer II, SOC Manager, Network Appliance Engineer
Solutionary — Aug 2006 - Jan 2013 (6 years 6 months)
Joined as an entry-level night-shift SOC analyst and became a central force behind Solutionary’s evolution—from operations to engineering to leadership.
Progressed through several IC and management roles, repeatedly leaning into the organization’s most urgent needs.
Known as the connective tissue across security, systems, and customer teams, I left behind lasting infrastructure, culture, and care.
Key Impacts:
Reduced daily ticket handling from 2 hours to 5 minutes by reverse-engineering internal systems and building an automated Nmap validation and reporting tool.
Architected a distributed VPN infrastructure for client-deployed appliances, with automated provisioning and self-configuration via OpenVPN and PXE.
Led and restructured the SOC, supporting 40+ analysts across 3 shifts, and introduced data-backed performance metrics, an analyst training program, and a peer-led Linux study group.
Managed the Information Security Engineering team, translating customer needs into operational flows and bridging between engineering and sales/account management.
Defined a new “Network Appliance Engineer” IC role to formalize high-impact technical work at the intersection of infra and customer-facing ops.
Infrastructure + Automation Work:
Brought Puppet and Git into service for early config management, standardizing system states across heterogeneous fleets.
Led the virtualization of infrastructure onto Xen, improving hardware utilization and flexibility.
Redesigned network topology, including VLAN and subnet segmentation for internal services.
Created a PXE boot provisioning platform to accelerate infrastructure deployment and management.
Built internal dashboards, leaderboards, and reports using SQL queries rendered in Confluence, turning a wiki into a homegrown BI platform for the SOC.
Developed a recursive descent parser to convert 200+ bespoke syslog-ng configurations into equivalent rsyslog configs, enabling a fleet-wide migration to a standardized, git-backed configuration hierarchy with global, customer-, and device-level overlays. Collaborated directly with rsyslog creator Rainer Gerhards to resolve edge-case bugs and support scale-out.
Glue & Incident Leadership:
Served as the company’s go-to bridge between dev, ops, and security, and was frequently called in during critical incidents to stabilize systems and restore customer trust.
Recognized company-wide as a multiplier—someone who shipped solutions, unblocked teams, and helped others rise.
Airborne Cryptologic Linguist
United States Air Force — Jul 2003 - Jul 2006 (3 years 1 month)
Licenses & Certifications
Advanced Learning Algorithms — Coursera GZUDC3YG6MEL
Supervised Machine Learning: Regression and Classification — Coursera HQJFZNRLD6R4
Unsupervised Learning, Recommenders, Reinforcement Learning — Coursera 934ETDFPF5ZM