Post

Dont Know Everything Quiet Quit Be Mediocre Itll Save Your Sanity In The Long Run

Don’t Know Everything, Quiet Quit, Be Mediocre: It’ll Save Your Sanity In The Long Run

Introduction

The sysadmin’s terminal flickers with yet another ticket: “NTP server shows correct time but wall clock is 10 minutes off.” You verify NTP sync, check firewall rules, confirm stratum sources - everything works. Yet the analog clock on the wall remains stubbornly wrong. The punchline? It’s not your clock. You didn’t install it. You didn’t configure it. But now it’s your problem.

This scenario from Reddit’s r/sysadmin perfectly captures the DevOps trap we’ve all faced: The compulsion to own every technical problem within sight distance, regardless of actual responsibility. This post isn’t about NTP troubleshooting - it’s about the psychological infrastructure we build (or fail to build) around our technical work.

In an era where Kubernetes clusters span continents and SaaS sprawl creates invisible dependencies, the old sysadmin mantra “know everything, control everything” has become a recipe for burnout. We’ll examine:

  1. The cultural shift from “hero sysadmin” to sustainable DevOps practice
  2. Technical boundary-setting using modern infrastructure patterns
  3. Documentation strategies that protect your sanity
  4. When and how to say “not my problem” professionally

You’ll learn concrete techniques to:

  • Define operational responsibility boundaries in complex environments
  • Create self-service documentation that deflects trivial requests
  • Implement monitoring that automatically answers “is this my problem?”
  • Preserve mental bandwidth for high-value engineering work

Understanding the Problem Space

The Death of the Omniscient Sysadmin

In legacy IT environments, system administrators were expected to be:

  • Hardware technicians
  • Network engineers
  • Application specialists
  • Security auditors
  • Desktop support
  • Procurement managers

This “full-stack human” model collapsed under cloud-native complexity. Consider these statistics from recent DevOps reports:

Responsibility Area% of Teams Reporting Ownership Fatigue
Cloud Infrastructure73%
CI/CD Pipelines68%
Developer Tooling61%
Legacy Systems89%
Third-Party SaaS57%

The Reddit clock scenario exemplifies responsibility creep - when auxiliary systems become your problem through organizational osmosis.

Modern Responsibility Boundaries

Effective DevOps teams use technical guardrails to define operational boundaries:

  1. Infrastructure Ownership Matrix

    | System Component | Primary Owner | Escalation Path | Monitoring Responsibility | |—————————|—————|—————–|—————————| | NTP Servers | Network Team | Tier 2 Support | Central Monitoring | | Physical Clocks | Facilities | Vendor Support | None (Manual Checks) | | VM Time Sync | DevOps | Cloud Team | Prometheus/Grafana | | Application Timezone Logic | Dev Team | DevOps | App Performance Monitoring| ```

  2. Automated Responsibility Tagging Modern infrastructure-as-code tools allow ownership tagging:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    # Terraform resource with ownership metadata
    resource "aws_instance" "ntp_server" {
      ami           = "ami-0c55b159cbfafe1f0"
      instance_type = "t3.micro"
      tags = {
        Owner          = "Network Team"
        SupportContact = "network-support@example.com"
        Runbook        = "https://wiki.example.com/NTP-Troubleshooting"
      }
    }
    
  3. Service Boundary Monitoring Prometheus alert rules that differentiate responsibility: ```yaml
    • alert: NTPStratumHigh expr: node_ntp_stratum > 5 annotations: description: ‘NTP stratum too high (). Contact network team.’ playbook: ‘https://wiki.example.com/NTP-Stratum-Alert’

    • alert: ContainerTimeDrift expr: abs(time() - container_last_seen) > 60 annotations: description: ‘Container time drift detected. Check Docker host sync.’ playbook: ‘https://wiki.example.com/Container-Time-Sync’ ```

The Mediocrity Principle

“Mediocrity” in this context means strategically limiting deep expertise to your actual responsibility domain. This isn’t about doing poor work - it’s about resisting the urge to:

  1. Reverse-engineer every black box system
  2. Maintain tribal knowledge of deprecated systems
  3. Accept responsibility for systems without authority

As the classic Google SRE book notes: “100% reliability is both impossible and economically nonviable.” Apply this to personal knowledge - 100% system mastery across all dependencies is impossible.

Prerequisites for Sanity Preservation

Technical Requirements

  1. Clear CMDB (Configuration Management Database)
    • ServiceNow, NetBox, or open-source alternatives like iTop
    • Must contain:
      • System ownership
      • Support contacts
      • Documentation links
  2. Unified Monitoring
    • Prometheus + Grafana stack
    • Proper alert routing (e.g., Alertmanager -> Slack/Teams channels by team)
  3. Documentation Portal
    • MediaWiki, Confluence, or MkDocs
    • Mandatory fields for new systems:

      System Ownership

      • Primary:
      • Secondary:
      • Vendor:

      Troubleshooting

      • First-line checks:
      • Escalation path: ```

Organizational Requirements

  1. Formal RACI Matrix
    • Responsible, Accountable, Consulted, Informed
    • Published for all critical systems
  2. Change Advisory Board (CAB) Process
    • Prevent shadow IT deployments
    • Mandatory ownership assignment for new systems
  3. Escalation Protocol
    • Defined SLAs for cross-team issues
    • Automatic ticket routing based on infrastructure tags

Implementing Boundary Controls

Infrastructure as Code Ownership

Add ownership metadata to all IaC resources:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# AWS Terraform module with support metadata
module "ntp_cluster" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  
  instance_count = 3
  name           = "ntp-server"
  
  tags = {
    Service     = "NTP"
    Owner       = "network-team@example.com"
    Runbook     = "https://wiki.example.com/NTP-Maintenance"
    SupportTier = "2"
  }
}

Automated Documentation Generation

Use tools like Terraform-Docs to create self-maintaining ownership manifests:

1
2
# Generate documentation from IaC
terraform-docs markdown table --output-file OWNERSHIP.md .

Resulting OWNERSHIP.md:

ResourceTypeOwnerSupport Tier
ntp_clusteraws_instancenetwork-team@example.com2

Alert Routing with Ownership Metadata

Configure Alertmanager to route based on tags:

1
2
3
4
5
6
7
8
9
10
11
route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: 'slack-general'
  routes:
  - match:
      severity: critical
      Owner: network-team
    receiver: 'slack-network'

The Art of Professional Pushback

When receiving requests outside your domain:

  1. Verify ownership
    1
    2
    
    # Query CMDB for system owner
    curl "https://cmdb.example.com/api/systems/$(hostname)/owner"
    
  2. Provide actionable handoff Bad response: “Not my problem.” Good response: “Our monitoring shows NTP sync working at the OS level. The physical clock appears to be out of sync. Per our CMDB, physical devices are managed by Facilities (ext. 555). I’ve cc’d their lead on this ticket.”

  3. Document the interaction Update the ticket with:
    • CMDB ownership record
    • Monitoring screenshots
    • Escalation path followed

Configuration Examples

NTP Boundary Monitoring

Prometheus rules that differentiate sync issues:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
groups:
- name: time-sync
  rules:
  - alert: OSLevelTimeDrift
    expr: abs(node_timex_offset_seconds) > 0.5
    labels:
      severity: critical
      Owner: devops-team
    annotations:
      description: 'Host time drift detected - check NTP configuration'

  - alert: PhysicalClockDrift
    expr: last_over_time(physical_clock_offset_seconds[5m]) > 60
    labels:
      severity: warning
      Owner: facilities-team
    annotations:
      description: 'Building clock out of sync - contact facilities'

Automated Responsibility Checks

Bash script to verify system ownership before troubleshooting:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
SYSTEM=$1

# Query CMDB
OWNER=$(curl -s "https://cmdb.example.com/api/systems/$SYSTEM/owner")

if [[ "$OWNER" != "devops-team" ]]; then
  echo "System $SYSTEM is owned by $OWNER"
  echo "Escalating ticket to $OWNER..."
  echo "See documentation: https://wiki.example.com/System-Ownership"
  exit 1
fi

# Proceed with troubleshooting
ntpq -p
chronyc sources

Operational Workflows

Daily Boundary Maintenance

  1. CMDB Hygiene Check
    1
    2
    
    # Report systems without clear ownership
    curl "https://cmdb.example.com/api/systems?owner=null" | jq .
    
  2. Alert Ownership Audit
    1
    2
    3
    4
    5
    6
    
    -- Query alerts handled by wrong team
    SELECT * FROM alert_history 
    WHERE resolved_by NOT IN (
      SELECT team FROM system_owners 
      WHERE system = alert_history.system_name
    );
    
  3. Documentation Link Validation
    1
    2
    3
    4
    5
    
    # Check for broken runbook links
    import requests
    for system in cmdb.systems:
        if response.status_code != 200:
            log_error(f"Broken link for {system.name}")
    

Troubleshooting Boundary Issues

Common Problems and Solutions

SymptomLikely CauseResolution
“Why is this my problem?” responsesMissing CMDB dataUPDATE cmdb.systems SET owner='network-team' WHERE name='ntp01';
Alerts routing to wrong teamMisconfigured AlertmanagerAdd match: [Owner: 'network-team'] to route config
Recurring shadow IT issuesNo CAB processImplement Terraform Sentinel policies requiring owner tags

Debugging Ownership Conflicts

  1. Check historical ownership:
    1
    2
    
    # Git blame for IaC ownership tags
    git blame terraform/main.tf | grep -i owner
    
  2. Audit access patterns:
    1
    2
    3
    4
    
    -- Find who actually maintains the system
    SELECT DISTINCT user FROM audit_logs 
    WHERE system='ntp-server' 
      AND action IN ('restart', 'config_change');
    
  3. Verify monitoring coverage:
    1
    2
    
    # Check if system has associated alerts
    curl "https://prometheus.example.com/api/v1/alerts" | jq '.[] | select(.annotations.system == "ntp-server")'
    

Conclusion

The wall clock that shouldn’t be your problem is more than a meme - it’s a warning sign of unhealthy responsibility spread. By implementing technical ownership boundaries through:

  1. CMDB-enforced system attribution
  2. Metadata-driven alert routing
  3. Self-service documentation
  4. Professional escalation protocols

We move from the unsustainable “know everything” model to a sustainable DevOps practice where:

  • Teams focus on their actual domains
  • Cross-system issues have clear paths
  • Engineers preserve cognitive bandwidth

Your next steps:

  1. Audit three systems you “sort of” support with no formal ownership
  2. Implement at least one metadata tag in your IaC
  3. Create a single runbook page with explicit escalation instructions

Remember: Mediocrity isn’t about quality - it’s about scope. The best DevOps engineers aren’t those who fix everything, but those who know exactly what not to fix.

This post is licensed under CC BY 4.0 by the author.