← Index

AI Architect / AI Platform Engineer

Comprehensive Interview Preparation Guide

AWS GenAI Platform Security Leadership

AI Hub — End-to-End Architecture

A production AI Hub on AWS follows a layered architecture with clear separation of concerns. Each layer can be scaled, monitored, and upgraded independently.

User Applications Layer Web, Mobile, CLI, 3rd-party integrations 1. AI Gateway / Model Access Layer API Gateway / ALB Rate Limiting & Auth Model Abstraction 2. Agent Orchestration Runtime LangGraph / LangChain State & Memory Human-in-the-Loop 3. RAG & Knowledge Infrastructure Vector DB (OpenSearch) Embedding Pipeline Re-ranking & Hybrid 4. Foundation Models & Inference Bedrock (Claude, Llama, Titan) SageMaker Endpoints Governance / Security / Observability Data Foundation: S3 • Glue • Lake Formation • DynamoDB • Aurora

Request Flow (Happy Path)

  1. User request arrives at API Gateway with auth token
  2. Gateway applies rate limits, validates schema, logs request
  3. Request routed to Agent Orchestration based on tenant/use case
  4. Agent determines if RAG context is needed
  5. If RAG: hybrid search (vector + keyword) in OpenSearch
  6. Retrieved context sent to Foundation Model via Bedrock
  7. Model output validated by Guardrails (PII, safety, toxicity)
  8. Response logged, cached if applicable, returned to user

Key Design Principles

🔌 Loose Coupling

Each layer can be scaled independently. Swap Bedrock for SageMaker endpoints without touching the gateway or agent layer.

🔍 Observability

Latency, cost, and token usage tracked at every layer. CloudWatch + custom dashboards for per-tenant metrics.

🔒 Tenant Isolation

Multi-tenancy enforced at both IAM (ABAC) and application layers. Data never leaks between tenants.

⚡ Resilience

Circuit breakers for model calls, fallback to smaller/cached models, graceful degradation when services are down.

🛠 Platform Layers — Deep Dive

🔷 AI Gateway / Model Access Layer

The gateway is the single entry point for all AI requests. It handles authentication, rate limiting, request validation, and model routing.

AWS Implementation

ComponentAWS ServicePurpose
RoutingAPI Gateway / ALBRoute requests by tenant, model, use-case
AuthCognito + API KeysJWT validation, API key management
Rate LimitingAPI Gateway throttlingPer-tenant, per-model limits
Model AbstractionLambda + Bedrock SDKUnified interface across model providers
CachingElastiCache (Redis)Cache frequent queries, reduce cost
Interview Tip: Emphasize that the gateway abstracts model providers — you can switch from Claude to Llama without application changes. This is a key "build vs buy" decision.
🔷 Agent Orchestration Runtime

The orchestration layer manages multi-step agent workflows with state management, tool selection, and human-in-the-loop capabilities.

Key Decisions

ApproachBest ForTrade-off
LangGraph on ECSComplex stateful agentsMore control, more ops overhead
Bedrock AgentsSimple tool-use agentsManaged, but limited customization
Step Functions + LambdaDeterministic workflowsGreat visibility, but not dynamic
Build vs Buy: Bedrock Agents are good for POCs. For production with complex retry logic, state persistence, and multi-agent coordination, build with LangGraph on ECS/EKS.
🔷 RAG & Knowledge Infrastructure

The knowledge layer handles document ingestion, chunking, embedding, vector storage, and retrieval with re-ranking.

Production Pipeline

Document Upload (S3) │ ├─ Glue ETL: Parse PDF/DOCX/HTML ├─ Chunking: Semantic + Hierarchical (256-1024 tokens) ├─ Embedding: Titan Embeddings / Cohere │ └─ OpenSearch: Index vectors + metadata ├─ Vector index (HNSW) └─ Keyword index (BM25)
Interview Tip: Always mention hybrid search (vector + keyword) with reciprocal rank fusion. This handles both semantic and exact-term matching, covering 90% of retrieval problems.
🔷 Governance, Security & Observability

A cross-cutting concern that spans all layers. Implements guardrails, audit logging, encryption, and monitoring.

Three-Layer Safety

Input Filtering ├─ Prompt injection detection (regex + ML) ├─ PII redaction (SSN, credit card, email) └─ Jailbreak detection Model Execution (Sandboxed) ├─ Rate limiting, token limits, timeouts Output Filtering ├─ Harmful content detection ├─ Hallucination scoring └─ Sensitive data redaction + citation validation

🧠 LLM Internals

Transformer Architecture

Core Components
ComponentFunctionDetails
TokenizationText → tokens (BPE)~1 token ≈ 4 chars. Custom tokenizers for domain models.
Embedding LayerToken IDs → dense vectors4096-12288 dims for modern LLMs. Semantic clustering.
Self-AttentionCompute token relationshipsAttention(Q,K,V) = softmax(Q·KT/√dk)·V — O(n²) complexity
Multi-Head AttentionParallel attention heads12-100 heads learning different relationships (syntax, semantics)
FFNFeature transformationd_model → 4×d_model → d_model with non-linearity
Layer Norm + ResidualTraining stabilityEnables training 100+ layer models

Context Windows & Scaling

ModelContextYear
GPT-2~1K tokens2019
GPT-34K tokens2020
Claude 3200K tokens2024
Claude 3.5 Sonnet200K tokens2025
Quadratic Problem: Attention is O(n²) — doubling context requires 4× resources. Effective context ≠ max tokens (models degrade at >80% capacity). The "lost-in-the-middle" effect means information in the center of long contexts gets less attention.

Fine-Tuning vs Prompting Decision Tree

Start with Prompt Engineering (cheap, fast) │ └─ Quality insufficient after 10 iterations? ├─ Few-Shot Prompting (add examples) → 40-50% improvement └─ Still insufficient? ├─ LoRA / Adapter Tuning → 80-90% of full tuning, 10x faster └─ Full Fine-Tuning (distillation, hallucination correction)
✓ Fine-Tune When
  • Domain specialization (finance, biotech, proprietary code)
  • Latency reduction via distillation to smaller model
  • Hallucination reduction with correct domain data
  • Cost reduction (fewer tokens needed)
✗ Don't Fine-Tune When
  • General reasoning (models already excel)
  • Frequently changing tasks (require retraining)
  • Limited data (<500 examples)
  • Need for explainability

📚 RAG Patterns — Naive to Production

Evolution: Naive RAG → Advanced RAG

Naive RAG (Baseline)
User Query → Embed Query → Vector Search → Top-K → Augment Prompt → Generate

Problems: No query understanding, shallow retrieval, no ranking beyond similarity, wasted context window.

Advanced RAG (Production)
User Query ├─ Query Rewriting (rephrase for better retrieval) ├─ Query Expansion (synonyms, related queries) │ ├─ Hybrid Search: │ ├─ Vector search (semantic similarity) │ ├─ BM25/keyword search (exact terms) │ └─ Reciprocal Rank Fusion │ ├─ Candidate Retrieval (100-500 candidates) ├─ Re-ranking (cross-encoder, LLM-based relevance) ├─ Context Compression (filter, summarize, reorder) └─ Generation with Attribution

Query Rewriting Techniques

1. Query Decomposition

Break complex queries into sub-queries. Example: "Tax implications of 2008 crisis on small businesses" becomes: ["2008 financial crisis causes", "tax policy changes 2008", "small business impact"]. Retrieve for each, combine results.

2. Query Expansion

"Machine learning models" expands to: ["ML models", "neural networks", "deep learning", "AI algorithms"]. Improves recall when document terminology varies.

3. HyDE (Hypothetical Document Embeddings)

Generate a hypothetical relevant document for the query, then use its embedding for retrieval instead of the query embedding. Bridges the vocabulary gap between queries and documents.

Chunking Strategies

StrategySizeBest ForTrade-off
Fixed-Size512 tokens, 20% overlapBaseline, general useBreaks mid-sentence
SemanticVariableDomain-heavy (law, finance)More expensive (embedding every boundary)
HierarchicalMulti-levelProduction systemsComplex but most effective
Document-AwarePreserves structureStructured docs with headersRequires document parsing
Production Best Practice: Start with hierarchical chunking + hybrid search + re-ranking. This covers 90% of RAG problems at moderate complexity. Re-ranking alone improves relevance 40-60% with <10% latency overhead.

Hybrid Search Implementation

# Parallel queries
semantic_results = vector_search(query_embedding, top_k=100)  # Recall: 80%
keyword_results  = bm25_search(query, top_k=100)               # Recall: 70%

# Fusion methods:
# 1. Reciprocal Rank Fusion (RRF): score = 1/(k + rank), sum scores
# 2. Weighted average: score = 0.7 * vec_score + 0.3 * keyword_score

ranked_results = fuse_and_rank(semantic_results, keyword_results)
final_results  = rerank(ranked_results, query, top_k=10)

🤖 Agent Frameworks

LangChain vs LangGraph vs Semantic Kernel

DimensionLangChainLangGraphSemantic Kernel
ModelChains (linear)DAGs with statePlugins + Planning
StrengthsWide ecosystem, fast to prototypeLoops, branching, persistence, debuggableEnterprise tooling, .NET integration
WeaknessesNo native loops/stateSteeper learning curveLess Python community
Best ForChatbots, Q&A, simple pipelinesMulti-step agents, approval workflowsEnterprise .NET / Azure
State MgmtManualBuilt-in persistenceKernel context

Decision Framework

Use LangChain When
  • Linear workflow (chat, Q&A, summarization)
  • Quick prototyping needed
  • Heavy integration requirements
Use LangGraph When
  • Agent with loops & retries
  • Human-in-the-loop approval
  • Multi-agent coordination
  • Need state persistence & recovery

Production Agent Architecture

Request → API GatewayLangGraph Agent Orchestrator ├─ Agent node (decide next action) ├─ Tool nodes (execute actions) │ ├─ Database query tool │ ├─ Search tool │ ├─ Calculation tool │ └─ External API tool ├─ Human review node (optional loop) └─ Response formatting node │ Response Cache (Redis) → Response

🌐 Multi-Agent Systems

Orchestration Patterns

1. Orchestrator-Worker
Orchestrator (supervisor) ├─ Task → Agent A ├─ Task → Agent B └─ Synthesize results

Pros: Simple, clear control. Cons: Single point of failure.

2. Peer-to-Peer
Agent A ↔ Agent B ↓ ↓ Agent C ↔ Agent D

Pros: Resilient, scalable. Cons: Harder to debug.

3. Hierarchical
Strategic Agent ├─ Tactical Agent 1 │ └─ Worker Agents └─ Tactical Agent 2

Pros: Separation of concerns. Cons: Complex orchestration.

4. Market-Based
Task Published → Agents Bid ↓ Winner Executes

Pros: Efficient allocation. Cons: Needs reputation system.

Communication Protocols (2025-2026)

ProtocolByModelUse Case
MCPAnthropicServer-clientLLM access to tools, data sources
A2ACommunityPeer-to-peerMulti-agent negotiation, delegation
ACPCommunityFormal specHeterogeneous multi-agent systems
ANPCommunityDecentralizedSwarm intelligence, distributed decisions

🚀 LLMOps / MLOps Lifecycle

CI/CD Pipeline for AI

Code Changes (Git) │ ├─ [1] Lint & Type Check (Black, MyPy, prompt template validation) ├─ [2] Unit Tests (model init, tool compat, prompt structure) ├─ [3] Integration Tests (E2E workflows, latency checks) ├─ [4] Evaluation Suite (RAGAS, correctness, hallucination rate, cost) ├─ [5] Staging Deployment (shadow traffic, baseline comparison) ├─ [6] Canary Deployment (5% traffic, auto-rollback) └─ [7] Production (blue-green, gradual traffic shift)

Model Evaluation Metrics

CategoryMetricsTools
Retrieval (RAG)Precision@K, Recall, NDCG, MRRRAGAS, Trulens
GenerationBLEU, ROUGE, METEORHuggingFace Evaluate
HallucinationFactuality score, contradiction detectionVectara, LangSmith
LatencyP50, P95, P99CloudWatch, Datadog
CostTokens/query, cost/requestBuilt-in token counting

Drift Detection

Input Drift

Query distribution changes (e.g., billing bot gets technical questions). Detect via statistical tests on embedding distributions.

Output Drift

Model behavior changes (more verbose, different tone). Detect via token length histograms, lexical diversity metrics.

Data Quality Drift

RAG documents updated but indexes stale. Detect via embedding distribution shift. Action: reindex, retrain.

Business Metric Drift

User satisfaction or conversion drops. Detect via user ratings, feedback analysis. Action: prompt refinement.

Retraining Triggers

Automatic
  • Hallucination rate > 5%
  • Latency P95 > 2x baseline
  • Retrieval NDCG drops 20%
  • New data + 2 weeks passed
Manual
  • Major domain update (new regulations)
  • Systemic bias from user feedback
  • Better model released by provider
  • Compliance requirement changes

Amazon Bedrock

Architecture Overview

Client Applications (Lambda, SageMaker, Custom) Bedrock API /messages /invoke-agent /retrieve Bedrock Guardrails Input filters (PII, prompt inject) | Output filters (toxicity, hallucination) | Data redaction Foundation Models Claude | Llama | Titan | Mistral Knowledge Bases (RAG) S3 ingestion | Chunking | OpenSearch Agents: Action Groups | Lambda Tools | Session State | Max 15 iterations

Bedrock Knowledge Bases: Build vs Buy

Bedrock Knowledge BasesCustom RAG Pipeline
Setup TimeHoursWeeks
ChunkingFixed onlySemantic, hierarchical, custom
Hybrid SearchLimitedFull control (RRF, weighted)
Re-rankingNot customizableCross-encoder, LLM-based
Best ForPOCs, simple Q&AProduction, accuracy >95%
Interview Tip: Frame build vs buy as a spectrum. Start with Bedrock KB for POC, then migrate to custom RAG when accuracy requirements increase. This shows pragmatic architecture thinking.

🔬 Amazon SageMaker — Deep Dive

SageMaker is the core ML platform on AWS — it spans the entire ML lifecycle from data labeling through training, hosting, and monitoring. For an AI Hub, SageMaker handles custom model training, fine-tuning, and serving models that Bedrock doesn't offer.

SAGEMAKER END-TO-END ML LIFECYCLE PREPARE Data Wrangler Processing Jobs Feature Store Ground Truth (labeling) Clarify (bias detection) BUILD & TRAIN Studio Notebooks Training Jobs (GPU/Spot) Hyperparameter Tuning Distributed Training JumpStart (pretrained) DEPLOY Real-Time Endpoints Async Inference Batch Transform Serverless Inference Multi-Model Endpoints MONITOR Model Monitor Data Capture Drift Detection Explainability CloudWatch Alarms SageMaker Pipelines — CI/CD Orchestration (DAG-based MLOps) Model Registry Versioning, approval, lineage Feature Store Online (DynamoDB) + Offline (S3) Experiments & Trials Track runs, compare metrics Infrastructure: S3 (artifacts) • ECR (containers) • IAM (roles) • VPC (networking) • KMS (encryption) • CloudWatch (logs) + Integration with Bedrock, Lambda, Step Functions, EventBridge, Glue

SageMaker vs Bedrock — When to Use Which

ScenarioUse BedrockUse SageMaker
Foundation model inference✓ (managed, multi-provider)Only for models not on Bedrock
Custom model training✓ (Training Jobs, distributed)
Fine-tuning LLMs✓ (limited models, Bedrock fine-tuning)✓ (full control, any model, LoRA/QLoRA)
Classical ML✓ (XGBoost, sklearn, etc.)
Embedding models✓ (Titan Embeddings, Cohere)✓ (custom embedding models)
RAG✓ (Knowledge Bases)Pair with OpenSearch for custom RAG
AI agents✓ (Bedrock Agents)✗ (use LangGraph on ECS instead)
GPU/hardware control✗ (abstracted)✓ (choose instance type, GPU count)
Cost optimizationPay per tokenPay per instance-hour (spot = 70% off)
Interview Tip: A mature AI Hub uses both: Bedrock for quick access to foundation models, SageMaker for custom models, fine-tuning, and workloads where you need hardware control or cost optimization at scale.

Inference Endpoints — Deep Dive

OptionLatencyThroughputCost ModelUse Case
Real-Time Endpoint<1sHighPay per instance-hourInteractive APIs, chatbots
Async InferenceMinutesVery highPay per request + instanceLarge payload, batch, RAG pipelines
Batch TransformHoursUnlimitedCheapest per inferenceOffline scoring, nightly reprocessing
Serverless Inference1-2s (cold start)MediumPay per invocation + durationSpiky traffic, dev/test environments
Multi-Model Endpoint<1s (cached)HighPay per instance (shared)Serve 100s of models on 1 endpoint
Multi-Container Endpoint<1sHighPay per instanceInference pipeline (preprocess + model + postprocess)
Real-Time Endpoint Architecture
API RequestApplication Load BalancerSageMaker Endpoint ├─ Production Variant A (80% traffic) ─ ml.g5.xlarge ├─ Production Variant B (20% traffic) ─ ml.g5.2xlarge [canary] └─ Shadow Variant (0% live, copies traffic) ─ [A/B testing] │ Model Container (ECR image with inference code) ├─ model_fn() ─ Load model weights ├─ input_fn() ─ Deserialize request ├─ predict_fn() ─ Run inference └─ output_fn() ─ Serialize response

Production Variant Configuration

import sagemaker
from sagemaker.model import Model

model = Model(
    image_uri="123456789.dkr.ecr.us-east-1.amazonaws.com/my-model:latest",
    model_data="s3://bucket/model.tar.gz",
    role="arn:aws:iam::123456789:role/SageMakerRole"
)

# Deploy with two variants for A/B testing
predictor = model.deploy(
    initial_instance_count=2,
    instance_type="ml.g5.xlarge",
    endpoint_name="my-model-endpoint",
    data_capture_config=DataCaptureConfig(
        enable_capture=True,
        sampling_percentage=100,         # Capture all for monitoring
        destination_s3_uri="s3://bucket/capture/"
    )
)
Async Inference — For Long-Running AI Workloads
Client → POST request with S3 input location │ SageMaker Async Endpoint ├─ Queues request internally ├─ Returns InferenceId immediately ├─ Processes when capacity available └─ Writes output to S3 │ SNS Notification → Success/Failure callback │ Auto-scales to 0 when idle (cost savings!)
# Async endpoint config — scales to 0 when idle
async_config = AsyncInferenceConfig(
    output_path="s3://bucket/async-output/",
    max_concurrent_invocations_per_instance=4,
    notification_config=AsyncInferenceNotificationConfig(
        success_topic="arn:aws:sns:us-east-1:123:success-topic",
        error_topic="arn:aws:sns:us-east-1:123:error-topic"
    )
)

model.deploy(
    async_inference_config=async_config,
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1
)
Key benefit: Async endpoints can scale to 0 instances — no traffic means no cost. Perfect for intermittent AI workloads like document processing, batch RAG, or nightly model retraining inference.
Multi-Model Endpoints (MME) — Serve Hundreds of Models
Multi-Model Endpoint (single endpoint, single instance fleet) │ ├─ Model A (loaded) ─ Tenant 1 fine-tuned model ├─ Model B (loaded) ─ Tenant 2 fine-tuned model ├─ Model C (on S3) ─ Loaded on-demand when called ├─ Model D (on S3) ─ Loaded on-demand when called └─ ... up to 1000s of models Dynamic loading: frequently-used models stay in memory, cold models loaded from S3 on first request (~seconds)

MME is ideal for multi-tenant AI platforms where each tenant has a fine-tuned model. Instead of one endpoint per tenant ($$$), you serve all tenants from a shared fleet and SageMaker handles model loading/unloading based on traffic patterns.

# Invoke a specific model on a multi-model endpoint
response = sm_runtime.invoke_endpoint(
    EndpointName="multi-model-endpoint",
    TargetModel="tenant-a/model.tar.gz",  # S3 key for this tenant's model
    Body=json.dumps(payload),
    ContentType="application/json"
)
Inference Recommender — Right-Size Your Endpoints

SageMaker Inference Recommender benchmarks your model across different instance types and configurations to find the optimal cost/performance trade-off.

# Run inference recommender to find optimal instance
response = sm_client.create_inference_recommendations_job(
    JobName="my-model-benchmark",
    JobType="Default",  # or "Advanced" for custom traffic patterns
    RoleArn=role_arn,
    InputConfig={
        "ModelPackageVersionArn": model_package_arn,
        "JobDurationInSeconds": 3600
    }
)
# Returns: ranked list of instance types with latency, throughput, cost
# Example: ml.g5.xlarge — P95: 120ms, 40 req/s, $1.41/hr
#          ml.g5.2xlarge — P95: 85ms, 65 req/s, $2.36/hr

Auto-Scaling Endpoints

Scaling Policies for AI Workloads
import boto3
client = boto3.client("application-autoscaling")

# Register the endpoint variant as a scalable target
client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/my-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=2,    # Always at least 2 for availability
    MaxCapacity=20,   # Burst up to 20 instances
)

# Target tracking — scale based on invocations per instance
client.put_scaling_policy(
    PolicyName="InvocationsPerInstance",
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/my-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 70.0,   # Target 70 invocations per instance
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        },
        "ScaleInCooldown": 300,   # Wait 5min before scaling down
        "ScaleOutCooldown": 60,   # Scale up quickly (1min)
    }
)

Scaling Metrics Comparison

MetricBest ForTrade-off
InvocationsPerInstanceGeneral workloadsSimple, but doesn't account for request complexity
CPUUtilizationCPU-bound modelsDoesn't correlate with GPU utilization
GPUUtilizationGPU-heavy inferenceCustom metric via CloudWatch, more accurate for LLMs
ModelLatencyLatency-sensitive APIsScale when latency degrades above target
Custom (queue depth)Async inferenceScale based on pending requests in SQS

SageMaker Training Jobs

Training Architecture
Training Job │ ├─ Input: S3 (training data) + ECR (container image) ├─ Compute: ml.p4d.24xlarge (8x A100 GPUs) │ ├─ On-Demand: full price, guaranteed capacity │ └─ Managed Spot: 70% off, can be interrupted │ └─ Checkpointing to S3 every N steps ├─ Distributed: data parallel / model parallel │ ├─ Data Parallel: same model, split data across GPUs │ └─ Model Parallel: split model layers across GPUs (for LLMs) └─ Output: model.tar.gz → S3 → Model Registry
Training Job Code Example
from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1-gpu",
    role=role,
    instance_count=4,                   # 4 instances for distributed training
    instance_type="ml.p4d.24xlarge",     # 8x A100 GPUs each = 32 GPUs total
    use_spot_instances=True,             # 70% cost savings
    max_wait=86400,                      # Max 24hr including spot interruptions
    max_run=72000,                       # Max 20hr actual training
    checkpoint_s3_uri="s3://bucket/checkpoints/",
    output_path="s3://bucket/output/",
    hyperparameters={
        "epochs": "10",
        "learning_rate": "5e-5",
        "batch_size": "32",
        "model_name": "bert-base-uncased"
    },
    distribution={
        "torch_distributed": {"enabled": True}  # PyTorch DDP
    },
    metric_definitions=[
        {"Name": "train:loss", "Regex": "train_loss: ([0-9\\.]+)"},
        {"Name": "eval:accuracy", "Regex": "eval_acc: ([0-9\\.]+)"}
    ]
)

estimator.fit({
    "training": "s3://bucket/train/",
    "validation": "s3://bucket/val/"
})
Managed Spot Training — 70% Cost Savings
How It Works
  • SageMaker uses EC2 Spot Instances (unused capacity, 70% cheaper)
  • If spot is interrupted, training pauses and resumes from checkpoint
  • Set max_wait as budget for total time including interruptions
  • Set max_run as budget for actual training time
Checkpointing Strategy
  • Save checkpoints to S3 every N steps (not just epochs)
  • For LLM fine-tuning: checkpoint every 500-1000 steps
  • Resume from latest checkpoint after spot interruption
  • Cost savings: 60-90% vs on-demand for training jobs

Hyperparameter Tuning (HPO)

from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter

tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name="eval:accuracy",
    objective_type="Maximize",
    hyperparameter_ranges={
        "learning_rate": ContinuousParameter(1e-5, 5e-4, scaling_type="Logarithmic"),
        "batch_size": IntegerParameter(16, 128),
        "warmup_steps": IntegerParameter(100, 1000),
    },
    max_jobs=20,          # Run 20 combinations
    max_parallel_jobs=4,   # 4 training jobs in parallel
    strategy="Bayesian",    # Bayesian optimization (smarter than grid/random)
)

tuner.fit({"training": "s3://bucket/train/"})
Tuning Strategies: Bayesian (best for expensive jobs, learns from previous runs) > Random (good baseline, parallelizable) > Grid (exhaustive, expensive). For LLM fine-tuning, Bayesian with 10-20 jobs is usually sufficient.

SageMaker Pipelines — MLOps Orchestration

SageMaker Pipeline (DAG-based, repeatable, versioned) │ ├─ ProcessingStep │ └─ Data cleaning, feature engineering, train/test split │ ├─ TrainingStep │ └─ Model training (GPU, distributed, spot) │ ├─ TuningStep (optional) │ └─ Hyperparameter optimization across N runs │ ├─ EvaluationStep │ └─ Compute metrics (accuracy, F1, RAGAS, custom) │ ├─ ConditionStep │ ├─ accuracy > 0.85? → RegisterModel → Deploy │ └─ else → FailStep (notify team) │ ├─ RegisterModelStep │ └─ Model Registry (version, approval status, lineage) │ └─ CreateEndpointStep (or Lambda for custom deployment) └─ Deploy to endpoint with data capture enabled
Pipeline Code — Full Example
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.step_collections import RegisterModel

# Step 1: Data Processing
processing_step = ProcessingStep(
    name="PreprocessData",
    processor=sklearn_processor,
    code="preprocess.py",
    inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
    outputs=[ProcessingOutput(output_name="train"), ProcessingOutput(output_name="test")]
)

# Step 2: Training
training_step = TrainingStep(
    name="TrainModel",
    estimator=estimator,
    inputs={"train": TrainingInput(s3_data=processing_step.properties.ProcessingOutputConfig
        .Outputs["train"].S3Output.S3Uri)}
)

# Step 3: Evaluation
eval_step = ProcessingStep(
    name="EvaluateModel",
    processor=sklearn_processor,
    code="evaluate.py",
    inputs=[ProcessingInput(source=training_step.properties.ModelArtifacts.S3ModelArtifacts,
        destination="/opt/ml/processing/model")],
    property_files=[PropertyFile(name="evaluation", output_name="metrics",
        path="evaluation.json")]
)

# Step 4: Conditional deployment
cond = ConditionGreaterThanOrEqualTo(
    left=JsonGet(step_name="EvaluateModel", property_file="evaluation",
        json_path="accuracy"),
    right=0.85
)

register_step = RegisterModel(
    name="RegisterModel",
    estimator=estimator,
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    content_types=["application/json"],
    approval_status="PendingManualApproval"  # Requires human sign-off
)

condition_step = ConditionStep(
    name="CheckAccuracy",
    conditions=[cond],
    if_steps=[register_step],     # Deploy if accurate enough
    else_steps=[FailStep(name="ModelNotGoodEnough")]
)

# Assemble pipeline
pipeline = Pipeline(
    name="AI-Hub-Training-Pipeline",
    steps=[processing_step, training_step, eval_step, condition_step],
    sagemaker_session=session
)

pipeline.upsert(role_arn=role)
pipeline.start()

Model Registry & Approval Workflow

Model Registry Architecture
Model Package Group: "fraud-detection-models" │ ├─ Version 1 (v1.0) ─ Status: Approved ─ In Production │ ├─ Metrics: accuracy=0.92, F1=0.89 │ ├─ Artifacts: s3://models/fraud-v1/model.tar.gz │ └─ Lineage: training job ARN, dataset version, pipeline run │ ├─ Version 2 (v2.0) ─ Status: PendingManualApproval │ ├─ Metrics: accuracy=0.94, F1=0.91 │ └─ Awaiting: compliance review + risk committee │ └─ Version 3 (v3.0) ─ Status: Rejected └─ Reason: bias detected in demographic parity test
# Approve a model version (typically via CI/CD or manual review)
sm_client.update_model_package(
    ModelPackageArn="arn:aws:sagemaker:us-east-1:123:model-package/fraud-v2",
    ModelApprovalStatus="Approved",
    ApprovalDescription="Passed compliance review, bias testing, and risk committee"
)

# EventBridge rule triggers deployment on approval
# Rule: source=sagemaker, detail-type=ModelPackageStateChange, status=Approved
# Target: Lambda function that deploys to endpoint

Feature Store

Dual-Store Architecture
┌────────────────────────────────────────────────┐ │ Feature Group: customer-features │ │ ├─ customer_id (record identifier) │ │ ├─ avg_transaction_amount │ │ ├─ transaction_count_30d │ │ ├─ credit_score │ │ └─ risk_segment │ └────────────────────────────────────────────────┘ │ │ Online Store Offline Store (DynamoDB) (S3 / Parquet) ├─ Millisecond lookups ├─ Historical data ├─ Latest feature values ├─ Training datasets └─ Real-time inference └─ Point-in-time joins
Feature Store Code — Create, Ingest, Retrieve
from sagemaker.feature_store.feature_group import FeatureGroup

# Define feature group
feature_group = FeatureGroup(name="customer-features", sagemaker_session=session)
feature_group.load_feature_definitions(data_frame=df)

# Create with both online + offline stores
feature_group.create(
    s3_uri="s3://bucket/feature-store/",
    record_identifier_name="customer_id",
    event_time_feature_name="event_time",
    role_arn=role,
    enable_online_store=True    # DynamoDB for real-time
)

# Ingest features
feature_group.ingest(data_frame=df, max_workers=4, wait=True)

# Real-time lookup (online store)
record = featurestore_runtime.get_record(
    FeatureGroupName="customer-features",
    RecordIdentifierValueAsString="CUST-12345"
)

# Historical query for training (offline store — Athena)
query = feature_group.athena_query()
query.run(
    query_string="""
        SELECT * FROM "customer-features"
        WHERE event_time BETWEEN '2025-01-01' AND '2025-12-31'
    """,
    output_location="s3://bucket/athena-results/"
)

SageMaker Model Monitor

Four Types of Monitoring
Monitor TypeWhat It DetectsHow It Works
Data QualityInput data drift, schema violations, missing valuesCompares live input distributions against baseline statistics
Model QualityAccuracy degradation, prediction driftCompares predictions against ground truth labels (when available)
Bias DriftFairness violations across protected groupsRuns SageMaker Clarify bias metrics on live traffic
Feature Attribution DriftFeature importance changesCompares SHAP values over time to baseline
Model Monitor Setup Code
from sagemaker.model_monitor import DefaultModelMonitor, DataCaptureConfig

# Step 1: Enable Data Capture on endpoint
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,
    destination_s3_uri="s3://bucket/data-capture/",
    capture_options=["Input", "Output"]  # Capture both request + response
)

# Step 2: Create baseline from training data
monitor = DefaultModelMonitor(role=role, instance_type="ml.m5.xlarge")
monitor.suggest_baseline(
    baseline_dataset="s3://bucket/baseline-data/train.csv",
    dataset_format=DatasetFormat.csv(header=True)
)

# Step 3: Schedule monitoring (hourly)
monitor.create_monitoring_schedule(
    monitor_schedule_name="fraud-model-monitor",
    endpoint_input=EndpointInput(
        endpoint_name="fraud-detection-endpoint",
        destination="/opt/ml/processing/input"
    ),
    output_s3_uri="s3://bucket/monitoring-output/",
    schedule_cron_expression="cron(0 * ? * * *)"  # Every hour
)

# Violations trigger CloudWatch alarms → SNS → PagerDuty

SageMaker Clarify — Bias Detection & Explainability

Pre-Training Bias
  • Class Imbalance (CI): Are protected groups underrepresented?
  • Difference in Proportions (DPL): Does label distribution differ by group?
  • KL Divergence: How different are feature distributions across groups?
  • Run before training to catch data issues early
Post-Training Bias
  • Demographic Parity (DPPL): Equal positive prediction rates?
  • Equalized Odds (DI): Equal true/false positive rates?
  • Accuracy Difference (AD): Does accuracy vary by group?
  • Run after training and continuously in production
Clarify Explainability — SHAP Values

SageMaker Clarify uses SHAP (SHapley Additive exPlanations) to explain individual predictions. This is critical for BFSI compliance (GDPR right to explanation).

from sagemaker import clarify

shap_config = clarify.SHAPConfig(
    baseline=[[0, 0, 0, 0]],  # Reference point for SHAP
    num_samples=500,           # Number of perturbations
    agg_method="mean_abs"      # Aggregation for feature importance
)

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.m5.xlarge"
)

clarify_processor.run_explainability(
    data_config=data_config,
    model_config=model_config,
    explainability_config=shap_config
)
# Output: per-feature importance scores for each prediction
# "This loan was denied primarily due to: credit_score (45%),
#  debt_to_income (30%), employment_length (15%)"

SageMaker JumpStart — Pretrained Models

JumpStart provides 600+ pretrained models that can be deployed or fine-tuned with one click. Key models for an AI Hub:

ModelTypeUse in AI Hub
Llama 3.1 (70B/8B)LLMSelf-hosted alternative to Bedrock (full control, no token fees)
Falcon (180B/40B)LLMOpen-source LLM for low-cost inference
BGE / GTEEmbeddingCustom embedding models for domain-specific RAG
Stable DiffusionImage GenImage generation for creative use cases
WhisperSpeech-to-TextAudio transcription for call center AI
Cross-EncoderRe-rankingRe-rank RAG results for higher relevance
Cost Strategy: For high-volume inference, self-hosting open-source LLMs via JumpStart on GPU instances can be 5-10x cheaper than Bedrock per-token pricing. Trade-off: you manage the infrastructure.
JumpStart Deploy Code
from sagemaker.jumpstart.model import JumpStartModel

# Deploy Llama 3.1 8B in one line
model = JumpStartModel(
    model_id="meta-textgeneration-llama-3-1-8b-instruct",
    instance_type="ml.g5.2xlarge",
    role=role
)

predictor = model.deploy(
    initial_instance_count=1,
    endpoint_name="llama-3-1-endpoint"
)

# Fine-tune with your data
model.fit({
    "training": "s3://bucket/fine-tune-data/"
})

SageMaker for LLM Fine-Tuning

Fine-Tuning Options Comparison
MethodData NeededTraining TimeCostQuality
Full Fine-Tuning1-5K examplesHours-Days$$$Best
LoRA500-2K examplesMinutes-Hours$$90% of full
QLoRA500-2K examplesMinutes-Hours$85% of full
Bedrock Fine-Tuning1K+ examplesHours$$Good (limited models)
Decision: Use Bedrock fine-tuning for supported models (quick, managed). Use SageMaker for open-source models (Llama, Falcon), LoRA/QLoRA, or when you need full control over training hyperparameters and data pipeline.
LoRA Fine-Tuning on SageMaker
# Hugging Face Estimator with PEFT/LoRA
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(
    entry_point="train_lora.py",
    instance_type="ml.g5.12xlarge",  # 4x A10G GPUs
    instance_count=1,
    transformers_version="4.36",
    pytorch_version="2.1",
    py_version="py310",
    role=role,
    hyperparameters={
        "model_id": "meta-llama/Llama-3.1-8B-Instruct",
        "lora_r": "16",          # LoRA rank (lower = faster, less capacity)
        "lora_alpha": "32",     # LoRA scaling factor
        "lora_dropout": "0.05",
        "epochs": "3",
        "learning_rate": "2e-4",
        "per_device_train_batch_size": "4",
        "gradient_accumulation_steps": "4",  # Effective batch = 16
    }
)

huggingface_estimator.fit({"training": "s3://bucket/lora-data/"})

Cost Optimization Strategies

Training Cost Savings
  • Spot instances: 60-90% savings with checkpointing
  • Right-sizing: Use Inference Recommender to find optimal instance
  • Mixed precision (FP16/BF16): 2x throughput, half memory
  • Savings Plans: 1-3yr commitments for 30-60% off
  • LoRA over full fine-tuning: 10x cheaper training
Inference Cost Savings
  • Multi-model endpoints: Share instances across models
  • Async endpoints: Scale to 0 when idle
  • Serverless inference: Pay only when invoked
  • Model compilation (Neo): 25% faster inference
  • Model quantization (INT8): Smaller model, less GPU needed
  • Inf2 instances: AWS Inferentia chips, 40% cheaper than GPUs
Instance Selection Cheat Sheet: ml.g5.* — best price/performance for most LLM inference. ml.p4d.* — A100 GPUs for large model training. ml.inf2.* — AWS Inferentia for cost-optimized inference (40% cheaper). ml.trn1.* — AWS Trainium for cost-optimized training (50% cheaper than p4d).

SageMaker Interview Questions

Q: "When would you host a model on SageMaker instead of using Bedrock?"
Answer: When you need (1) custom models not available on Bedrock (e.g., domain fine-tuned Llama with LoRA), (2) hardware control (specific GPU type, instance size), (3) cost optimization at high volume (self-hosting is 5-10x cheaper than per-token), (4) models with custom pre/post-processing, or (5) compliance requirements mandating dedicated infrastructure. In practice, most AI Hubs use both: Bedrock for quick multi-model access, SageMaker for custom workloads.
Q: "How do you handle model deployment without downtime?"
Answer: Use production variants for blue/green and canary deployments. Deploy new model as variant B with 5-10% traffic. Monitor latency, accuracy, and error rate via Data Capture + Model Monitor. If metrics are healthy after 1-2 hours, gradually shift traffic (20% → 50% → 100%). If metrics degrade, automatic rollback to variant A. For zero-downtime: use UpdateEndpoint with retain_all_variant_properties to swap models without endpoint recreation.
Q: "How would you design a cost-effective inference architecture for 1M requests/day?"
Answer: At 1M req/day (~12 req/sec avg, higher peak): (1) Use auto-scaling with target tracking on InvocationsPerInstance. (2) Use ml.g5.xlarge for GPU inference (best price/performance). (3) Enable model compilation with Neo for 25% speedup. (4) Consider Inf2 instances for 40% cost reduction. (5) Cache frequent queries in ElastiCache. (6) Use async endpoints for non-interactive workloads. (7) Savings Plans for baseline capacity, auto-scaling for peaks. Estimated cost: $3K-8K/month depending on model size.

Lambda + Step Functions

AI Orchestration Patterns

Step Functions State Machine │ ├─ Step 1: Parse Query → Lambda (input validation) ├─ Step 2: Retrieve Context → SageMaker (embedding) ├─ Step 3: Rank Results → Lambda (cross-encoder re-ranking) ├─ Step 4: Generate Response → Bedrock Agent ├─ Step 5: Validate Output → Bedrock Guardrails ├─ Step 6: Cache → DynamoDB write └─ Step 7: Return (success/failure path)

Why Step Functions over Lambda-only?

Lambda-Only Problems
  • Orchestration mixed with business logic
  • Hard to debug nested if/else
  • Timeout risks (15min max)
  • State loss on failure
Step Functions Benefits
  • Visual workflow with explicit state
  • Built-in retry/error handling
  • Long-running workflows (up to 1 year)
  • Human approval nodes
  • Cross-account execution

Cost Breakdown

ServiceCost
Bedrock call$0.001-0.005 per request
SageMaker endpoint$0.10-0.50 per hour (on-demand)
Step Functions$0.000025 per state transition (negligible)
Lambda$0.20 per 1M invocations

📊 Vector Database Selection

FactorOpenSearchAurora pgvectorDynamoDB
Max VectorsBillionsMillionsMillions
Latency P99200-500ms50-200ms10-50ms
Hybrid SearchBuilt-in (BM25 + vec)Separate FTS neededNot supported
Cost (1M vectors)$2-5/day$0.5-1/day$1-3/day
Best ForLarge-scale, analyticsRelational + vectorsDynamoDB-native apps
OperationalCluster managementManagedFully managed

Recommendation by Scale

  • <100K vectors: Aurora pgvector (simplicity, lowest cost)
  • 100K-10M vectors: Aurora pgvector or OpenSearch (trade-off: cost vs features)
  • >10M vectors: OpenSearch (scale) + DynamoDB for metadata
  • Hybrid search required: OpenSearch (only option with built-in BM25 + vector)

🗃 Data Foundation: S3, Glue, Lake Formation

Data Lake Architecture for AI

Data Sources (APIs, databases, IoT, logs) │ ├─ AWS Glue (ETL/ELT) │ ├─ S3 Data Lake (Bronze / Silver / Gold) │ ├─ /bronze (raw ingested data) │ ├─ /silver (cleaned, normalized) │ └─ /gold (analytics-ready, aggregated) │ ├─ Lake Formation (security, governance, ABAC tagging) │ └─ Consumption ├─ Athena (SQL queries) ├─ SageMaker (ML training) ├─ Bedrock (RAG ingestion) └─ Analytics (BI tools)

AI-Specific: Chunking & Embedding Pipeline

# Glue ETL Job for RAG document processing
documents = glue_context.create_dynamic_frame.from_options(
    format_options={"paths": ["s3://data/documents/"]},
    format="pdf"
)

def chunk_document(record):
    text = record["content"]
    chunks = semantic_chunk(text, chunk_size=512)
    return [
        {"doc_id": record["id"], "chunk_id": i,
         "text": chunk, "embedding": embed(chunk)}
        for i, chunk in enumerate(chunks)
    ]

embeddings_frame = documents.map(lambda x: chunk_document(x))
embeddings_frame.write_dynamic_frame.to_s3(
    connection_options={"path": "s3://embeddings-bucket/"},
    format="parquet"
)

Lake Formation Governance for AI

Data Catalog: ├─ Dataset: customer_data │ ├─ PII columns: [email, ssn, phone] → REDACT tag │ ├─ Sensitivity: HIGH → Limited access │ └─ Lake Formation policy: Only data scientists │ └─ Dataset: training_data ├─ Status: APPROVED_FOR_TRAINING ├─ Model lineage: tracked └─ Audit: enabled

🔗 API Gateway Patterns for AI

Three API Patterns

Synchronous

POST /v1/chat/completions

SLO: P50 <200ms, P99 <1s. For interactive chat.

Asynchronous

POST /v1/jobs → 202 Accepted

Webhook callback or polling. For batch/long-running.

Streaming (SSE)

POST /v1/chat/stream

Token-by-token via Server-Sent Events. Best UX for chat.

Streaming Implementation

# FastAPI example (Lambda + function URL)
@app.post("/chat/stream")
async def stream_chat(request: ChatRequest):
    async def generate():
        response = bedrock_client.invoke_model_with_response_stream(
            modelId="claude-3-5-sonnet",
            body={"messages": request.messages}
        )
        for event in response['body']:
            if 'chunk' in event:
                chunk = json.loads(event['chunk']['bytes'])
                yield f"data: {json.dumps(chunk)}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

🏢 Multi-Tenant, Multi-Account Architecture

Three Isolation Models

ModelPatternIsolationCostBest For
SiloAccount-per-tenantMaximumHighestRegulated BFSI, healthcare
BridgeSchema-per-tenantGoodModerateMost enterprise AI platforms
PoolRow-level isolationBasicLowestSaaS with low sensitivity

Recommended: Hub-Spoke Architecture

Hub (Shared Services): ├─ Bedrock access layer (guardrails, models) ├─ SageMaker endpoints (shared fine-tuned models) ├─ S3 data lake (with ABAC tagging) └─ OpenSearch (multi-tenant awareness) Spoke (Per-Tenant): ├─ Lambda for tenant-specific logic ├─ DynamoDB for conversation history ├─ RDS for application data ├─ VPC with security group isolation └─ IAM roles (assume role from spoke to hub) Networking: ├─ Transit Gateway connects hubs and spokes ├─ Service control policies (spend limits per tenant) └─ VPC endpoints for private API access

Isolation Enforcement

IAM Policy (ABAC)
{
    "Effect": "Allow",
    "Action": "bedrock:InvokeAgent",
    "Resource": "arn:aws:bedrock:*:*:agent/agent-123",
    "Condition": {
        "StringEquals": {
            "bedrock:tenant": "tenant-a"
        }
    }
}
Application-Layer (Lambda)
def handle_request(event):
    tenant_id = event['requestContext']['authorizer']['claims']['tenant']
    # Validate tenant_id matches request
    if event.get('body', {}).get('tenant_id') != tenant_id:
        raise UnauthorizedException("Tenant mismatch")
    # Inject tenant context into all downstream calls
    result = bedrock_client.invoke_agent(
        agentId='agent-123',
        sessionStateValues={'tenant_id': tenant_id}
    )
Database-Level (RLS)
-- Row-level security policy
CREATE POLICY tenant_isolation ON conversations
    USING (tenant_id = current_setting('app.current_tenant'));

🖥 Distributed Systems Patterns

Consistency Models for AI Platforms

ComponentModelWhy
Model RegistryStrong (ACID)Must be correct for deployment
Conversation HistoryCausalMessage ordering matters
Embedding IndexEventualStale embeddings are acceptable briefly
User PreferencesEventually strongCan be cached, eventually synced
Fine-tuning DatasetStrongMust be exact for reproducibility

Event-Driven Architecture

Event Stream (Kinesis / EventBridge) ├─ Event 1: UserQuerySubmitted ├─ Event 2: DocumentsRetrieved ├─ Event 3: ModelInvoked ├─ Event 4: GuardrailsApplied └─ Event 5: ResponseCached │ │ │ Monitor Cache Audit Log Benefits: Complete audit trail, event replay, multiple consumers

Saga Pattern for AI Workflows

Choreography (EventBridge)

Services emit events, others listen. Decoupled, scalable. Best for loosely-coupled microservices.

Orchestration (Step Functions)

Central coordinator manages workflow with explicit error handling and compensating transactions. Best for AI workflows.

Kubernetes / EKS for ML

EKS Capabilities (2025-2026)

  • Ultra-Scale: Up to 100,000 nodes per cluster (10x increase), supporting 1.6M Trainium chips or 800K GPUs
  • Journal-backed etcd: Replaces Raft consensus for better performance at scale
  • Capabilities Management Suite: Argo CD (GitOps), AWS Controllers for K8s (ACK), KRO
  • Karpenter: Intelligent auto-provisioning with spot instances (70% discount)

ML Workload Patterns

Pattern 1: Distributed Training
# K8s Deployment for multi-GPU training
kind: Deployment
spec:
  replicas: 4
  template:
    spec:
      containers:
      - name: trainer
        resources:
          limits:
            nvidia.com/gpu: 8
        env:
        - name: MASTER_ADDR
          value: master-service.default
Pattern 2: Inference Autoscaling (HPA)

Scale inference pods based on CPU/memory/custom metrics. Set minReplicas: 2 for availability, maxReplicas: 50 for burst, target 70% CPU utilization.

Pattern 3: GPU Node Affinity + Spot

Use node affinity to schedule GPU workloads on nvidia-a100 nodes. Combine with Karpenter spot instances for 70% cost reduction. Handle spot interruptions with graceful migration.

🔌 API Design for AI Services

Three Patterns Compared

PatternEndpointResponseSLOUse Case
SyncPOST /v1/chat/completions200 + JSON bodyP50 <200msInteractive chat
AsyncPOST /v1/jobs202 + job_idMinutesLong-running RAG, batch
StreamingPOST /v1/chat/streamSSE (text/event-stream)TTFB <500msToken-by-token chat UX

Versioning Strategy

Recommended: URL path versioning (/v1/models, /v2/models) — clear, cacheable, and widely adopted. Use feature flags for gradual migration between versions.

📦 SDK Development

Python SDK Structure

my-ai-sdk/ ├─ my_ai_sdk/ │ ├─ client.py (main entry point) │ ├─ models.py (Pydantic request/response) │ ├─ exceptions.py (custom exceptions) │ ├─ resources/ │ │ ├─ chat.py (/chat endpoints) │ │ ├─ models.py (/models endpoints) │ │ └─ agents.py (/agents endpoints) │ └─ utils/ │ ├─ auth.py (auth handling) │ ├─ retry.py (retry logic) │ ├─ streaming.py (SSE handling) │ └─ observability.py (OpenTelemetry) ├─ tests/ └─ setup.py

Key SDK Patterns

Authentication

API key from constructor or AI_API_KEY env var. Bearer token in session headers. User-Agent with SDK version.

Error Handling

Hierarchy: AIErrorAuthenticationError, RateLimitError, ValidationError. Include retry_after for rate limits.

Observability

OpenTelemetry tracing + metrics. Track request duration, token counts, model used. Spans for each API call.

Backward Compatibility

Deprecation warnings for renamed params. Ignore unknown params with warning. Semantic versioning.

🔒 IAM / RBAC / ABAC

Role Hierarchy for AI Platform

Platform Admin │ Create/destroy environments, manage IAM, view cost │ Team Lead / ML Architect │ Deploy models, configure endpoints, view team metrics │ Data Scientist │ Read training data, train models, log metrics (NO prod deploy) │ ML Engineer │ Invoke models in dev/staging, deploy with approval, monitor │ Auditor Read all logs, cannot modify, compliance reporting

ABAC vs RBAC

RBAC (Role-Based)

Fixed roles (scientist, engineer). Simple but rigid. Requires new roles for new scenarios. Good for small teams.

ABAC (Attribute-Based)

Evaluate multiple attributes in real-time (environment, team, IP, time). Dynamic, scales better. Ideal for multi-tenant AI platforms.

Cross-Account Access Pattern

User Account (AssumeRole) │ └─ AssumeRole with external_id │ AI Platform Account ├─ Role: PlatformRole │ └─ Trust: Allow User Account ├─ Bedrock └─ SageMaker

🛡 AI Guardrails Implementation

Three-Layer Safety Architecture

Layer 1: Input Prompt injection detection (regex + ML) PII redaction (SSN, credit card, email) Jailbreak detection Custom keyword filters Layer 2: Execution Rate limiting (prevent abuse) Token limits (prevent exhaustion) Timeout enforcement Sandboxed execution Layer 3: Output Harmful content detection (ML classifiers) Hallucination scoring (factuality check) Sensitive data redaction Citation validation

Bedrock Guardrails Performance

  • Blocks 88% of harmful content
  • Identifies correct responses with 99% accuracy
  • Configurable thresholds per category (violence, hate speech, sexual content)
  • PII patterns: credit card (BLOCK), SSN (REDACT), email (REDACT)

Custom Guardrail (Lambda)

class CustomGuardrails:
    def filter_input(self, user_input: str) -> Tuple[str, bool]:
        """Filter input for PII. Returns (filtered_text, is_safe)"""
        for pii_type, pattern in self.pii_patterns.items():
            if pii_type == 'credit_card':
                return None, False  # BLOCK entirely
            elif pii_type in ['ssn', 'email']:
                filtered = filtered.replace(match, f"[REDACTED_{pii_type}]")
        return filtered, is_safe

    def filter_output(self, output: str, context: Dict) -> Tuple[str, bool]:
        """Validate output: grounding check + toxicity scoring"""
        if not self._is_grounded_in_context(output, context):
            return output, False  # Block ungrounded responses
        if self._score_toxicity(output) > 0.9:
            return "[Content blocked - unsafe]", False
        return output, True

🔐 Data Security for LLM Workloads

Encryption Strategy

LayerMethodDetails
In TransitTLS 1.2+, mTLSAll network communication encrypted. mTLS for service-to-service.
At RestKMS encryptionS3 buckets, RDS, EBS, DynamoDB all KMS-encrypted.
Model WeightsEncrypted artifactsDecrypted only in secure enclave. Access logs for who deployed what.
Fine-tuning DataTime-bound accessSeparate encrypted bucket. Deleted after training completes.

Data Sensitivity Classification

PUBLIC → Freely accessible INTERNAL → Within company only CONFIDENTIAL → Limited team access RESTRICTED → Executive approval needed PII → Individual privacy (GDPR/CCPA) PHI → Health information (HIPAA) PCI → Payment information (PCI-DSS)

🏦 BFSI Compliance

Key Regulations

RegulationRegionKey Requirements
GDPREURight to explanation, data portability, privacy by design
CCPACaliforniaData subject rights, opt-out mechanisms
PCI-DSSGlobalPayment card data security standards
SOXUSFinancial reporting controls and audit trails
GLBAUSFinancial privacy protections
EU AI ActEURisk-based AI governance framework

Model Governance Workflow

1. Development ├─ Build model, test for bias/fairness, document methodology │ 2. Compliance Review ├─ Explainability check, regulatory alignment ├─ Risk assessment, 3-lines-of-defense review │ 3. Approval Gate ├─ Business owner sign-off ├─ Compliance approval ├─ Risk committee review (if high risk) ├─ Legal sign-off │ 4. Production Monitoring ├─ Bias drift detection ├─ Performance monitoring ├─ Usage auditing └─ Annual recertification

Compliance Architecture

ControlImplementation
Model Risk ManagementRisk classification (low/med/high), model inventory & versioning, approval workflows
Data GovernanceData lineage tracking, quality monitoring, retention policies
ExplainabilityDecision explanations (why approved/denied?), audit logs, monitoring dashboards
Fairness TestingDemographic parity analysis (<5% variance), bias detection across protected groups
Adversarial TestingPrompt injection tests, jailbreak attempts, robustness validation
Audit LoggingImmutable S3 logs (Glacier, 7-year retention), KMS encryption, every model decision logged
Interview Tip: In BFSI, always mention the "3 lines of defense" model: (1) Business units own risk, (2) Compliance/Risk teams provide oversight, (3) Internal audit provides independent assurance. Show you understand regulated industry governance.

💡 Behavioral & Technical Leadership

Ownership Mindset — Platform Builder, Not Project Executor

This role expects you to think like a product owner, not just an engineer fulfilling tickets. Demonstrate this by talking about:

  • Vision: How your platform decisions enable business outcomes (cost reduction, time-to-market, compliance)
  • Trade-offs: Conscious build-vs-buy decisions with clear reasoning (not just "we used AWS because it was easy")
  • Iteration: How you evolved the platform based on real usage data, not just upfront design
  • Team enablement: How you made it easy for other teams to onboard (self-service, documentation, SDKs)

Translating AI Systems to Business Value

Technical to Business
  • "RAG reduces hallucination" → "Customers get accurate answers, reducing support tickets by 30%"
  • "Multi-tenant isolation" → "Each business unit's data is secure, enabling regulatory compliance"
  • "Canary deployment" → "We catch issues before they impact all customers"
Metrics That Matter
  • Time to onboard a new AI use case
  • Cost per AI inference request
  • Model accuracy / hallucination rate
  • Platform uptime and P99 latency
  • Number of teams self-servicing on platform

Building & Scaling Engineering Teams

Be ready to discuss how you've structured and grown teams. Key points to cover:

  • Team Topology: Platform team (infra + shared services) + Feature teams (use-case specific) + ML/AI team (model development)
  • Hiring: What you look for in AI platform engineers (distributed systems + ML interest, not just ML PhDs)
  • Culture: Blameless post-mortems, documentation as a first-class citizen, inner-source contributions
  • Scaling: From 3-person team to 15+ — how you organized, delegated, and maintained quality

Stakeholder Communication

AudienceWhat They Care AboutHow to Communicate
C-SuiteROI, competitive advantage, riskBusiness metrics, cost savings, risk mitigation
Product ManagersFeatures, timelines, capabilitiesRoadmap, what's possible vs what's not, trade-offs
Engineering LeadsArchitecture, reliability, tech debtArchitecture diagrams, SLOs, migration plans
Compliance/LegalData privacy, audit trails, regulationsCompliance matrices, governance workflows, audit reports

💬 Common Interview Questions

Q1: "Design a production-grade RAG system for a financial services company"
Answer Framework: Start with requirements (accuracy >95%, compliance, multi-tenant). Architecture: Hierarchical chunking → hybrid search (OpenSearch BM25 + vector) → cross-encoder re-ranking → Bedrock with guardrails (PII redaction, hallucination detection) → immutable audit logging (S3 Glacier). Explain why Bedrock Knowledge Bases are insufficient for BFSI (no custom re-ranking, limited chunking). Mention RAGAS evaluation with >0.8 on faithfulness, relevance, and precision metrics.
Q2: "How would you handle multi-tenancy in an AI platform?"
Answer Framework: Hub-spoke architecture. Hub: shared Bedrock, SageMaker, OpenSearch. Spoke: per-tenant Lambda, DynamoDB, VPC. Three-layer isolation: IAM (ABAC with tenant tags), Application (tenant_id validation in every Lambda), Database (RLS policies). For BFSI, lean toward silo model for highest-sensitivity tenants, bridge model for others. Transit Gateway for networking, SCPs for cost control.
Q3: "Our LLM chatbot is hallucinating too much. What would you do?"
Answer Framework: Systematic debugging: (1) Measure current hallucination rate with RAGAS faithfulness metric. (2) Check RAG quality — is retrieval returning relevant docs? Measure Precision@K. (3) If retrieval is poor: add re-ranking, switch to hybrid search, improve chunking. (4) If retrieval is good but generation hallucinates: add output guardrails (grounding check), consider fine-tuning on domain data, reduce temperature. (5) Add monitoring: track hallucination rate per query type, set up alerts.
Q4: "Compare LangChain, LangGraph, and Bedrock Agents for production"
Answer Framework: LangChain: good for simple linear chains (chat, Q&A), wide ecosystem but no native state management. LangGraph: best for complex agents with loops, retries, human-in-the-loop — explicit state graph enables auditing and recovery. Bedrock Agents: managed, fast to set up, but limited to 15 iterations, less customizable. For production in regulated industries, LangGraph on ECS with custom state persistence in DynamoDB — gives full control and auditability.
Q5: "Design the monitoring and alerting for an AI platform"
Answer Framework: Four drift types to monitor: input drift (embedding distribution), output drift (token length, lexical diversity), data quality drift (embedding shifts), business metric drift (user satisfaction). Stack: CloudWatch for infrastructure, Evidently AI for drift detection, custom dashboards for token economics and cost per request. Automatic alerts: hallucination rate >5%, P95 latency >2x baseline, NDCG drop >20%. Include cost monitoring: tokens/query, cache hit rate, model cost breakdown.
Q6: "Walk me through building an AI platform in a regulated BFSI environment"
Answer Framework: Start with governance: 3-lines-of-defense model, model risk classification. Architecture: multi-account isolation (silo for sensitive, bridge for others). Every model decision logged immutably (S3 Glacier, 7-year retention). Fairness testing before deployment (demographic parity <5% variance). Explainability reports for credit decisions (GDPR right to explanation). Adversarial testing (prompt injection, jailbreak). Annual recertification of all models. Show the compliance workflow: Dev → Compliance Review → Approval Gate → Production Monitoring.
Q7: "How do you decide between building custom vs using managed AWS services?"
Answer Framework: Decision tree: (1) Is this a core differentiator? If yes, build. (2) Does the managed service meet accuracy/performance requirements? (3) Is customization needed beyond what the managed service offers? (4) What's the ops cost of self-hosting? Example: Bedrock Knowledge Bases for POC, custom RAG pipeline for production. SageMaker for training, custom endpoints on EKS for specialized inference. Always prototype with managed, migrate to custom when requirements demand it.

📜 Infrastructure as Code (Terraform / CloudFormation)

Terraform vs CloudFormation

DimensionTerraformCloudFormation
LanguageHCL (HashiCorp Config Language)JSON / YAML
Multi-CloudYes (AWS, GCP, Azure, etc.)AWS only
State ManagementRemote state (S3 + DynamoDB lock)Managed by AWS
Drift Detectionterraform planStack drift detection
ModularityModules (reusable, versioned)Nested stacks / CDK constructs
Best ForMulti-cloud, complex infraAWS-only, tight integration

AI Platform IaC Patterns

Modular Structure
modules/ ├─ vpc/ (networking) ├─ eks/ (K8s cluster) ├─ bedrock/ (model access) ├─ opensearch/ (vector store) ├─ sagemaker/ (endpoints) └─ monitoring/ (CloudWatch)
Environment Promotion
environments/ ├─ dev/ (terraform.tfvars) ├─ staging/(terraform.tfvars) └─ prod/ (terraform.tfvars) Same modules, different configs. CI/CD: plan → review → apply

Prepared for Suman • AI Architect Interview Prep • March 2026

Covers: AI Hub Architecture • GenAI (LLMs, RAG, Agents) • AWS (Bedrock, SageMaker, EKS) • Platform Engineering • Governance & BFSI Compliance • Leadership