Building an AutoML Pipeline for Vector Data in Azure SQL

Description

This session demonstrates how to build an AutoML-driven pipeline inside Azure SQL to process and optimize vector data. Learn how to automatically profile datasets, select the best embedding model and parameters, and improve retrieval accuracy using adaptive evaluation and feedback loops.

Key Takeaways

My Notes

Action Items

Slides

📥 Download Slides

Building an AutoML
Pipeline for Vector
Data in Azure SQL
Ömer Çolakoğlu,
Microsoft Data Platform MVP
Founder, FEYOTECH
omer@feyotech.com
https://www.linkedin.com/in/omercolakoglu/
What We Will Cover Today
The Problem
Why is choosing the right embedding model so hard?
Architecture
How AutoEmbedding works end-to-end
SQL Server 2025
VECTOR type & sp_invoke_external_rest_endpoint
Data Profiling
10 domain types: review, ecommerce, book, knowledge_base...
Model Evaulation
Accuracy x Latency x Cost x RAG Quality weighted composite score
New Features
Smart Chunking, RAG Pipeline & PDF Import
Rag & Results
RAG evaluation results & model recommendations
We are happy!
SQL Server 2025 has released.
We are happy!
Native vector type is built in.
We are happy!
We have an AI Ready Enterprise Database Platform
We are happy!
We can embed data with any LLM Embedding Model with
sp_invoke_external_rest_endpoint
We are happy!
We can use DiskANN and search increadable fast.
We are happy!
We can search data as semantic and any language without translation.
We are happy!
Vector stores data as semantic. It means, Vector is a format that stores
data as some float values. These are not meanful for human but very
meanful for computers.
Questions in our brains???
Questions in our brains
There are a lot of data types and a lot of llm embedding models. So which
model is is the best for our scenario?
Questions in our brains
One of the most important
features of the vector data
type is its dimension.
Large dimensions are
more expensive.
But does a larger
dimension mean more
accurate results?
Questions in our brains
Vector is an universal
format of data like body
language.
So we can search data as
vector in any language and
get the same result.
Does it really work?
Questions in our brains
We have also big text
data. And we have to
chunk it.
Also we have a little bit
bigger data than our max
chunk size.
Chunking or truncating?
Questions in our brains
SQL Server 2025 has
1998 limit for the vector
type, because of the 8KB
page size.
Questions in our brains
DiskANN vector index is
really fast but it is
readonly.
So, what should we do for
the changed data.
Questions in our brains
DiskANN vector index is
really fast but it is
readonly.
So, what should we do for
the changed data.
Why Is Embedding Model Selection So Hard?
Why Is Embedding Model Selection So Hard?
Real-World Challenges
How It's Done Today
30+ available models with different dimensions (768-3072)
Manual benchmarking takes days
Performance varies dramatically per domain
Repeat for every new model release
Cost: $0.00002 to $0.001 per 1K tokens
Excel spreadsheets for comparison
Latency: 50ms to 2000ms per request
Which metric matters most? How to weight?
Azure vs OpenAI vs Voyage vs Jina vs Ollama...
SQL integration is always custom per project
SQL Server integration requires custom code each time
Customer reviews != Product catalog
So, we need an intelligent pipeline.
How Auto Embedding Works?
End-to-End Pipeline

  1. Datasource Management
    5.Analysis
    • SQL Server connection / PDF import
    • Datasource CRUD (add, edit, test)
    • Model metrics (overall score, accuracy, latency, cost, memory)
    • Consistency test results (similar/different pair analysis)
    • Multilingual test (language pair similarity)
    • Retrieval test (precision, top-K ranking)
    • RAG quality details (context relevance, faithfulness, completeness)
    • Latency performance indicator
    • Full model comparison table
    • Strategy simulation (change weights → see score impact)
    • Live Query Test (ad-hoc search, LLM evaluation)
  2. Data Profile
    • Language detection and distribution
    • Type classification (10 domains, score-based)
    • Text statistics (length, duplicate rate)
    • Category extraction
    • Column profiling (Text, Filter, Date, ID)
    • Noise detection and cleaning (HTML, noise)
    • Recommended models and strategy
    • Text structure optimization (field detection, LLM suggestion)
    • VectorStr formula definition
  3. Question Management
    • Multilingual question pool (TR, EN, DE, FR)
    • Automatic question generation via LLM
    • Expected keywords / ground truth
    4.Experiment
    • Model selection (multi-select)
    • Strategy selection (recommended by data type)
    • Weight settings (accuracy, latency, cost, RAG quality)
    • Sampling count
    • Chunk settings (size, overlap, semantic/fixed)
    • RAG settings (enable/disable, top-K, max queries)
    • Live progress tracking + stop control
    6.Results and Evaulation
    • Winning model and score summary
    • AI recommendation (LLM explanation)
    • Key decision factors
    • Model ranking table
    • Vectorizing all data with the selected model
    7.Tools
    • A/B Experiment comparison
    • Experiment history
    • Activity log (audit trail)
    • SQL Server status panel
    • Multilingual query test (cross-language analysis)
    Data Source Management
    Data Source Management
    Data Profiling-Automatic Domain Detection
    RAG QUAL
    ACCURACY
    RAG QUAL
    ACCURACY
    RAG QUAL
    ACCURACY
    RAG QUAL
    ACCURACY
    RAG QUAL
    ACCURACY
    RAG QUAL
    ACCURACY
    RAG QUAL
    COST
    LATENCY
    COST
    LATENCY
    COST
    LATENCY
    COST
    LATENCY
    COST
    LATENCY
    COST
    LATENCY
    COST
    LATENCY
    COST
    LATENCY
    Reviews
    E Commerce
    Book
    Knowledge Base
    Code
    Medical
    Scientific
    • Amazon Camera
    Reviews (Used
    this Project)
    • Amazon Book
    Reviews
    • Yelp Restaurant
    Reviews
    • Walmart
    Products (Used
    in this Project)
    • Shopify Store
    Inventory
    • eBay Listings
    • Big Data in
    Practice (Used in
    this Project)
    • SQL Server 2025
    Unveiled (Used
    in this Project)
    • StackOverflow
    Q&A (Used in this
    Project)
    • Wikipedia Articles
    • GitHub
    Repository Docs
    • API
    Documentation
    • Jupyter Notebook
    Cells
    • PubMed
    Abstracts
    • Clinical Trial
    Reports
    • Drug Interaction
    Database
    • Nature
    Publications
    • Research
    Datasets (UCI)
    • Physics Textbook
    ACCURACY
    RAG QUAL
    ACCURACY
    Multilingual
    • Multilingual
    Wikipedia
    • Subtitle
    Databases
    TR
    Data Profiling-Automatic Domain Detection
    Create Data Profile
    Data Profiling-Automatic Domain Detection
    | Create Data Profile
    Type Classification,
    Category Distribution
    Data Profiling-Automatic Domain Detection
    | Create Data Profile
    Recomended Models, Query Pool
    Data Profiling-Automatic Domain Detection
    | Create Data Profile
    Data Profile & Search Strategy
    Data Profiling-Automatic Domain Detection
    | Create Data Profile
    Recomended Search Strategy
    Data Profiling-Automatic Domain Detection
    | Create Data Profile
    Column to Embed Analysis
    Data Profiling-Automatic Domain Detection
    | Create Data Profile
    Column to Embed Analysis
    Here we go. Experiment time.
    The pipeline will find the best options for us.
    Experiment
    1.Model Selection
    Experiment
    Raw Data
    Experiment
    2.Sampling, Strategy, Rerank Parameters,
    Estimated Budget
    Experiment
    2.Sampling, Strategy, Rerank Parameters,
    Estimated Budget
    Experiment
    3.Live Results
    Experiment
    3.Live Results
    Experiment
    4.Detail Analysis
    Experiment | Detail Analysis
    1.Consistency Test – Similar Pairs
    Experiment | Detail Analysis
  4. Multilingual Test - EN/TR Translation Pairs
    Experiment | Detail Analysis
  5. Retrieval Test - Query and Top-K Results
    Experiment | Detail Analysis
    Query Detail (Top-K Results)
    Experiment | Detail Analysis
  6. All Models - Comparison
    Experiment | Detail Analysis
  7. Strategy Simulation
    Experiment | Detail Analysis
  8. Live Query Test
    Experiment | Detail Analysis
  9. Live Query Test
    Experiment
    Summary & Result
    Experiment
    Summary & Result
    Vectorize All Data
    Vectorize All Data
    1.Source Selection
    Vectorize All Data
    2.Model Selection
    Vectorize All Data
  10. Embedding Progress
    Vectorize All Data
  11. Embeded Data On SQL Server
    Vectorize All Data
  12. Vector Serch On SQL Server
    What about other datasets?
    What about other datasets?
    RAG QUAL
    COST
    LATENCY
    ACCURACY
    COST
    LATENCY
    ACCURACY
    RAG QUAL
    COST
    LATENCY
    RAG QUAL
    ACCURACY
    LATENCY
    COST
    RAG QUAL
    ACCURACY
    Strategy
    Reviews
    E Commerce
    Book
    Knowledge Base
    Amazon
    Camera
    Reviews
    Walmart
    Products
    SQL Server
    2025 Unveiled
    Book PDF
    StackOverflow
    Q&A
    Honestly, I have more than you wonder.
    Ladies and gentelmen! Here is….
    On Prem Notebook LM
    Based on SQL Server 2025
    and AutoML Embedding Pipeline
    On Prem Notebook LM Based on SQL Server 2025
    On Prem Notebook LM Based on SQL Server 2025
    Chat With Your Embedded PDF
    On Prem Notebook LM Based on SQL Server 2025
    Chat With 40 Million Rows StackOverFlow Data
    On Prem Notebook LM Based on SQL Server 2025
    Add Source
    On Prem Notebook LM Based on SQL Server 2025
    Summary, Mindmap,Info Cards
    Thank you.
    It’s honor to be here and talk to you.
    Sound off.
    The mic is all yours.
    Influence the product roadmap.
    Join the Fabric User Panel
    Join the SQL User Panel
    Share your feedback directly with our
    Fabric product group and researchers.
    Influence our SQL roadmap and ensure
    it meets your real-life needs
    https://aka.ms/JoinFabricUserPanel
    https://aka.ms/JoinSQLUserPanel