Building a GEN AI Healthcare Application with Databricks and Azure AI Foundry: Part 1 — Preparing Data

Amine Charot

--

The healthcare sector is rapidly evolving, with Artificial Intelligence (AI) becoming a cornerstone for innovation. Generative AI (GEN AI) applications, with their ability to synthesize data into actionable insights, are particularly promising for tackling complex healthcare challenges. This article, the first in a serie, explores the process of preparing data and implementing technical solutions for a GEN AI application using Databricks and Azure AI Foundry, focusing on a healthcare use case leveraging PDF data.

Business Perspective

Generative AI is transforming the healthcare industry by enabling advanced data-driven applications, such as predictive diagnostics, personalized treatment planning, and operational optimizations. To realize these benefits, healthcare organizations must first address the challenge of managing and preparing their data — a task complicated by the sensitive, voluminous, and unstructured nature of healthcare data.

Using Databricks and Azure AI Foundry offers several business advantages:

  1. Scalability: Both platforms support vast amounts of data, enabling organizations to process everything from patient records to operational metrics efficiently.
  2. Compliance: Azure and Databricks provide built-in tools to comply with regulatory standards such as HIPAA and GDPR, ensuring patient data is handled securely.
  3. Cost Efficiency: The unified analytics offered by Databricks and the AI capabilities of Azure reduce the need for disparate tools, lowering overall costs.
  4. Enhanced Decision-Making: With GEN AI applications, healthcare providers can leverage insights derived from data to make timely, informed decisions that improve patient outcomes.

In this use case, the focus is on building an AI-powered solution to analyze and retrieve information from healthcare PDFs, a common data source in the industry.

Solution Architecture

The solution is built on the integration of Databricks and Azure AI Foundry, leveraging their unique capabilities to address the end-to-end data pipeline and AI modeling requirements. Below is an overview of the architecture:

  1. Data Ingestion Layer: Uses Databricks Auto Loader to monitor and load data incrementally from Azure Data Lake Storage.
  2. Storage Layer: Raw and processed data is stored in Delta Lake, providing a reliable foundation for downstream analytics.
  3. Data Processing Layer: Employs tools like LlamaIndex for chunking and parsing unstructured text data.
  4. AI Model Serving Layer: Embedding models, such as text-embedding-3-large, are deployed in Azure AI Foundry and integrated via Databricks Model Serving.
  5. Vector Search and Retrieval Layer: Databricks Vector Search indexes and queries vectorized embeddings for similarity-based document retrieval.

This architecture ensures scalability, reliability, and performance while maintaining compliance with healthcare data regulations.

Technical Perspective

The technical implementation involves several stages, each leveraging cutting-edge capabilities of Databricks and Azure AI Foundry.

Note the code is available in my repository : charotAmine/healthcareAI-Databricks

Here’s a detailed breakdown:

Step 1: Loading Data with Databricks Auto Loader

Databricks Auto Loader simplifies the ingestion of large volumes of data by incrementally and efficiently loading files from cloud storage. In this case, data resides in Azure Data Lake Storage, which is mounted as a volume in Databricks.

  • Auto Loader Overview: Auto Loader supports schema inference and evolution, making it an ideal choice for processing semi-structured or unstructured data such as PDFs.
  • Integration with Azure Data Lake: Files stored in Azure are automatically monitored, and new files are ingested into Databricks in near real-time.

Code:

spark.sql(f"USE CATALOG `{catalog_name}`")

df = (spark.readStream
.format('cloudFiles')
.option('cloudFiles.format', 'BINARYFILE')
.load('dbfs:'+data_directory_path))

Step 2: Saving Raw Data as Delta Table

Delta Lake ensures data reliability and consistency by providing ACID transactions, schema enforcement, and versioning. The raw data ingested from Auto Loader is stored in Delta format, enabling efficient querying and downstream transformations.

Benefits of Delta Tables:

  • High performance for large-scale data operations.
  • Support for time travel and data lineage.
  • Integration with other Databricks features like MLflow and AutoML.
(df.writeStream
.trigger(availableNow=True)
.option("checkpointLocation", f'dbfs:{data_directory_path}/checkpoints/raw_data')
.table(f'{schema_name}.pdf_raw').awaitTermination())

Step 3: Data Processing and Chunk Parsing

Once the raw data is stored, it is processed and parsed into chunks suitable for AI embedding. This is achieved using LlamaIndex, a framework for creating and managing chunked document indices.

Why LlamaIndex Over PyPDF?: While PyPDF is effective for basic text extraction from PDFs, it lacks advanced features for processing and organizing unstructured text. LlamaIndex offers significant advantages:

  1. Context Preservation: LlamaIndex retains semantic coherence within chunks, making the extracted data more meaningful for embedding and retrieval tasks.
  2. Flexibility in Chunking: It provides advanced options for splitting documents into logical sections, paragraphs, or sentence-level chunks.
  3. Integration Capabilities: LlamaIndex seamlessly integrates with embedding pipelines, optimizing performance for RAG systems.

Why Chunking Matters: Chunking is a critical step in implementing Retrieval-Augmented Generation (RAG) systems. By breaking down large documents into smaller, semantically coherent chunks, RAG systems can:

  1. Improve the relevance of search and retrieval tasks.
  2. Optimize the embedding process by focusing on smaller, meaningful pieces of text.
  3. Reduce computational overhead during embedding and vector indexing.

Implementation Steps:

  1. Extract text from PDF files and preprocess it.
  2. Split the text into manageable chunks using LlamaIndex.
import pyspark.sql.functions as F
# Import core components of Llama Index for text and token processing
from llama_index.core.node_parser import SimpleNodeParser
# Import core components of Llama Index for text and token processing
from llama_index.core.readers.base import Document
import io
from PyPDF2 import PdfReader
from PyPDF2.errors import PdfReadError

parser = SimpleNodeParser() # SimpleNodeParser for chunking

# Define UDF to extract text from PDF bytes with error handling
@F.udf("string")
def extract_pdf_text(content):
try:
pdf = io.BytesIO(content) # Create file-like object from bytes
reader = PdfReader(pdf) # Read PDF
text = ""
for page in reader.pages:
text += page.extract_text() # Extract text from each page
return text
except PdfReadError as e:
return f"Error extracting PDF text: {e}"
except Exception as e:
return "Error: Unsupported or corrupted PDF"

# Define UDF for chunking using LlamaIndex
@F.udf("array<string>")
def llama_index_chunk_udf(content):
try:
document = Document(text=content) # Wrap content in a Document object
nodes = parser.get_nodes_from_documents([document]) # Parse content into nodes
return [node.get_text() for node in nodes]
except Exception as e:
return [f"Error chunking content: {e}"]

Step 4: Embedding with Azure AI Foundry

Embedding transforms text data into vector representations that can be processed by AI models. For this pipeline, we use the text-embedding-3-large foundation model, a high-performing embedding model known for its efficiency in representing textual data. The model is deployed in Azure AI Foundry to serve as the core embedding mechanism.

Model Serving Overview: Model Serving in Databricks facilitates real-time and batch inference of machine learning models. By deploying the foundation model in Azure AI Foundry, the model leverages Azure’s scalability and security features. A gateway is established in Databricks to seamlessly interact with the deployed model, enabling efficient processing.

Steps to Deploy and Serve the Model:

  1. Deploy the Foundation Model: The text-embedding-3-large model is hosted on Azure AI Foundry to ensure robust performance and secure deployment.
  2. Create a Gateway: Databricks is configured to connect to Azure AI Foundry via a gateway that enables the embedding model to be accessed for inference tasks.
  3. Process Embeddings: Text chunks are sent through the model to generate embeddings for downstream tasks like vector search.

Code :

# Define UDF for Azure OpenAI embeddings
@F.udf("array<float>")
def azure_openai_embed_udf(content):
try:
if not content or not content.strip():
raise ValueError("Empty or invalid content for embedding")

import mlflow.deployments
deploy_client = mlflow.deployments.get_deploy_client("databricks")
response = deploy_client.predict(endpoint="embedding_aifoundry", inputs={"input": content})

return response.data[0]['embedding']
except Exception as e:
return [f"Error chunking content: {e}"]

# Streaming pipeline
(
spark.readStream.table(f'{schema_name}.pdf_raw')
.withColumn("decoded_text", extract_pdf_text(F.col("content"))) # Extract text from PDF bytes
.withColumn("chunks", F.explode(llama_index_chunk_udf(F.col("decoded_text")))) # Apply LlamaIndex chunking
.withColumn("embedding", azure_openai_embed_udf(F.col("chunks"))) # Apply embedding
.selectExpr("path as url", "chunks as content", "embedding") # Select final columns
.writeStream
.trigger(availableNow=True)
.option("checkpointLocation", f'dbfs:{data_directory_path}/checkpoints/pdf_cleans')
.table(f'{schema_name}.pdf_clean_embedding')
.awaitTermination()
)

Step 5: Vector Search Implementation

Databricks’ Vector Search capabilities enable efficient retrieval of similar documents based on their embeddings. This is achieved by creating an index within Databricks and storing the vectorized data for fast querying.

Steps:

  1. Store embeddings in a vector index.
  2. Use Databricks’ built-in vector search engine to perform similarity searches.

Code:

# Import Databricks VectorSearch client for vector-based search operations
from databricks.vector_search.client import VectorSearchClient
# Import Databricks VectorSearch client for vector-based search operations
vsc = VectorSearchClient(disable_notice=True)
vector_name = "healthcare_medium"

if len(vsc.list_endpoints()) == 0 or vector_name not in [e['name'] for e in vsc.list_endpoints()['endpoints']]:
vsc.create_endpoint(name=vector_name, endpoint_type="STANDARD")

clean_table = f"{catalog_name}.{schema_name}.pdf_clean_embedding"
index_name = f"{catalog_name}.{schema_name}.healthcare_index"

vsc.create_delta_sync_index(
endpoint_name=vector_name,
index_name=index_name,
source_table_name=clean_table,
pipeline_type="TRIGGERED", # Sync needs to be manually triggered
primary_key="id",
embedding_dimension=3072, # Match your model embedding size
embedding_vector_column="embedding"
)

More about Vector Search can be found here.

Note : you need to enable change data feed delta.enableChangeDataFeed = true

Step 6: Testing the Solution

To validate the pipeline I send a query the vector search index with sample inputs.

import mlflow.deployments

question = "How does Australia's healthcare system balance universal access through Medicare with the role of private supplementary insurance, and what benefits do Australians gain from each?"
deploy_client = mlflow.deployments.get_deploy_client("databricks")
response = deploy_client.predict(
endpoint="embedding_aifoundry", inputs={"input": question})
embeddings = [e['embedding'] for e in response.data]

results = vsc.get_index(vector_name, index_name).similarity_search(
query_vector=embeddings[0],
columns=["url", "content"],
num_results=1)

docs = results.get('result', {}).get('data_array', [])
print(docs)

Challenges and Best Practices

  • Data Sensitivity: Ensure de-identification and anonymization of patient information.
  • Real-Time Processing: Use Auto Loader for efficient ingestion of live data streams.
  • Scalability: Leverage Databricks’ distributed computing for handling large datasets.

Conclusion

In this article, we explored the data preparation and processing workflow for a healthcare GEN AI application using Databricks and Azure AI Foundry. By combining cutting-edge tools like Auto Loader, Delta Lake, LlamaIndex, Databricks Model Serving, and vector search, we’ve built a robust pipeline that sets the foundation for AI-powered healthcare solutions.

Stay tuned for the next part of this serie, where we’ll dive deeper into preparing the model then deploying it.

--

--

Responses (1)