코스 개요

Foundations of Agentic Systems in Production

  • Agentic architectures: loops, tools, memory, and orchestration layers
  • Lifecycle of agents: development, deployment, and continuous operation
  • Challenges of production-scale agent management

Infrastructure and Deployment Models

  • Deploying agents in containerized and cloud environments
  • Scaling patterns: horizontal vs vertical scaling, concurrency, and throttling
  • Multi-agent orchestration and workload balancing

Monitoring and Observability

  • Key metrics: latency, success rate, memory usage, and agent call depth
  • Tracing agent activity and call graphs
  • Instrumenting observability using Prometheus, OpenTelemetry, and Grafana

Logging, Auditing, and Compliance

  • Centralized logging and structured event collection
  • Compliance and auditability in agentic workflows
  • Designing audit trails and replay mechanisms for debugging

Performance Tuning and Resource Optimization

  • Reducing inference overhead and optimizing agent orchestration cycles
  • Model caching and lightweight embeddings for faster retrieval
  • Load testing and stress scenarios for AI pipelines

Cost Control and Governance

  • Understanding agent cost drivers: API calls, memory, compute, and external integrations
  • Tracking agent-level costs and implementing chargeback models
  • Automation policies to prevent agent sprawl and idle resource consumption

CI/CD and Rollout Strategies for Agents

  • Integrating agent pipelines into CI/CD systems
  • Testing, versioning, and rollback strategies for iterative agent updates
  • Progressive rollouts and safe deployment mechanisms

Failure Recovery and Reliability Engineering

  • Designing for fault tolerance and graceful degradation
  • Retry, timeout, and circuit breaker patterns for agent reliability
  • Incident response and post-mortem frameworks for AI operations

Capstone Project

  • Build and deploy an agentic AI system with full monitoring and cost tracking
  • Simulate load, measure performance, and optimize resource usage
  • Present final architecture and monitoring dashboard to peers

Summary and Next Steps

요건

  • Strong understanding of MLOps and production machine learning systems
  • Experience with containerized deployments (Docker/Kubernetes)
  • Familiarity with cloud cost optimization and observability tools

Audience

  • MLOps engineers
  • Site Reliability Engineers (SREs)
  • Engineering managers overseeing AI infrastructure
 21 시간

참가자 수


참가자당 가격

회원 평가 (3)

예정된 코스

관련 카테고리