{"id":111888,"date":"2025-01-22T08:50:58","date_gmt":"2025-01-22T16:50:58","guid":{"rendered":"https:\/\/www.backblaze.com\/blog\/?p=111888"},"modified":"2025-01-22T11:01:48","modified_gmt":"2025-01-22T19:01:48","slug":"building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats","status":"publish","type":"post","link":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/","title":{"rendered":"Building an AI Agent with Backblaze B2, LangChain, and Drive Stats"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1440\" height=\"820\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs.png\" alt=\"A decorative image showing multiple computer windows folding into the cloud. \" class=\"wp-image-111901\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs.png 1440w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs-300x171.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs-1024x583.png 1024w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs-768x437.png 768w\" sizes=\"auto, (max-width: 1440px) 100vw, 1440px\" \/><\/figure>\n\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Last August, <a href=\"https:\/\/www.backblaze.com\/blog\/how-to-build-your-own-llm-with-backblaze-b2-jupyter-notebook\/\" target=\"_blank\" rel=\"noreferrer noopener\">I explained how you can use a Jupyter Notebook to explore AI development<\/a>; specifically, building a chatbot that answers questions based on custom context downloaded from a private bucket in Backblaze B2 Cloud Storage.\u00a0<\/p>\n\n\n\n<p>In this post, I\u2019ll look at another AI technology, agents, and show you how I built an AI agent that answers questions about hard drive reliability based on over 11 years of raw data from our <a href=\"https:\/\/www.backblaze.com\/drivestats\" target=\"_blank\" rel=\"noreferrer noopener\">Drive Stats<\/a> franchise.\u00a0<\/p>\n\n\n\n<p>The Drive Stats dataset is ideal for this kind of work. It\u2019s a real-world dataset, but, it only weighs in at around 500 million records consuming about 20GB of storage in Parquet format (\u201conly\u201d being a relative term), so you can use it with big data and AI tools on a laptop in a reasonable amount of time rather than spinning up an expensive virtual machine (VM) and\/or spending days waiting for an operation to complete. As an example, converting the entire Drive Stats data set from CSV to Parquet using a Python app on my MacBook Pro takes a couple of hours. On the same hardware, converting a terabyte-scale data set would take about four days.<\/p>\n\n\n\n<div class=\"abstract\" style=\"line-height: 1.8; margin: 24px 12px; padding: 24px 12px 10px 12px;\">\n<h4>Speaking of Drive Stats<\/h4>\n<p>The Drive Stats 2024 report comes out February 11, and we\u2019re hosting a LinkedIn Live event where Andy Klein, resident Drive Stats guru, will share highlights. Register today to save your spot.<\/p>\n<!--HubSpot Call-to-Action Code --><span class=\"hs-cta-wrapper\" id=\"hs-cta-wrapper-38ec6c04-aa64-4a3a-9f21-931c1853d797\"><span class=\"hs-cta-node hs-cta-38ec6c04-aa64-4a3a-9f21-931c1853d797\" id=\"hs-cta-38ec6c04-aa64-4a3a-9f21-931c1853d797\"><!--[if lte IE 8]><div id=\"hs-cta-ie-element\"><\/div><![endif]--><a href=\"https:\/\/cta-redirect.hubspot.com\/cta\/redirect\/2832298\/38ec6c04-aa64-4a3a-9f21-931c1853d797\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" class=\"hs-cta-img\" id=\"hs-cta-img-38ec6c04-aa64-4a3a-9f21-931c1853d797\" style=\"border-width:0px;\" src=\"https:\/\/no-cache.hubspot.com\/cta\/default\/2832298\/38ec6c04-aa64-4a3a-9f21-931c1853d797.png\"  alt=\"Register Here \u2794\u00a0\"\/><\/a><\/span><script charset=\"utf-8\" src=\"https:\/\/js.hscta.net\/cta\/current.js\"><\/script><script type=\"text\/javascript\"> hbspt.cta.load(2832298, '38ec6c04-aa64-4a3a-9f21-931c1853d797', {\"useNewLoader\":\"true\",\"region\":\"na1\"}); <\/script><\/span><!-- end HubSpot Call-to-Action Code -->\n<\/div>\n\n\n\n\n<p>You can use these same techniques with any large dataset, from healthcare to ecommerce to financial services. In this example, we\u2019re working with a single table, but you could adapt the sample code to a data lake comprising any number of tables.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is an AI agent?<\/h2>\n\n\n\n<div class=\"wp-block-group has-purple-background-color has-background\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<p>In the spirit of the times, I posed this question to ChatGPT. Its answer:<\/p>\n\n\n\n<p>An <strong>AI agent<\/strong> is a software system designed to autonomously perform tasks or make decisions based on its environment and goals. It leverages artificial intelligence techniques\u2014such as machine learning, reasoning, and natural language processing\u2014to process information, make decisions, and take actions to achieve specific objectives.<\/p>\n\n\n\n<p>Key components of an AI agent include:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Perception<\/strong>: The ability to sense and understand its environment. This could be through sensors, input data, or other means of gathering information.<\/li>\n\n\n\n<li><strong>Reasoning\/decision-making<\/strong>: The core processing mechanism that helps the agent interpret its environment, make decisions, and plan actions. It could use various algorithms, such as decision trees, reinforcement learning, or neural networks.<\/li>\n\n\n\n<li><strong>Action<\/strong>: Once the agent has analyzed the environment and made a decision, it takes action to achieve its goal, whether it&#8217;s performing an operation, giving a recommendation, or interacting with another system.<\/li>\n\n\n\n<li><strong>Learning<\/strong>: Some AI agents can adapt over time, improving their decision-making and actions based on experience (via reinforcement learning, supervised learning, etc.).<\/li>\n<\/ol>\n\n\n\n<p>AI agents can range from simple systems, like chatbots or virtual assistants, to more complex systems like autonomous vehicles, robots, or financial trading algorithms.<\/p>\n\n\n\n<p>In general, the term &#8220;agent&#8221; emphasizes the idea of <strong>autonomy<\/strong>\u2014the agent operates independently, often with the ability to learn, adapt, and make decisions based on changing conditions without direct human intervention.<\/p>\n<\/div><\/div>\n\n\n\n<p>In this example, the agent\u2019s environment is a database containing the Drive Stats data (more on that below), and I want it to perform the following tasks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Based on a natural language question, such as \u201cWhich drive has the lowest annual failure rate?\u201d, generate a SQL query that retrieves data that will help answer the question.<\/li>\n\n\n\n<li>Execute that query against the Drive Stats dataset.<\/li>\n\n\n\n<li>Based on the query results, either create a new query that better answers the question, or generate a natural language answer.<\/li>\n<\/ul>\n\n\n\n<p>As in my previous post, I\u2019m using the open source <a href=\"https:\/\/python.langchain.com\/docs\/introduction\/\" target=\"_blank\" rel=\"noreferrer noopener\">LangChain framework<\/a>. <a href=\"https:\/\/python.langchain.com\/docs\/tutorials\/sql_qa\/\" target=\"_blank\" rel=\"noreferrer noopener\">This tutorial on building a question\/answering system over SQL data<\/a> was my starting point. I\u2019ll explain key points of the integration in this blog post; the full source code is available as a Jupyter notebook in the <a href=\"https:\/\/github.com\/backblaze-b2-samples\/ai-agent-demo\" target=\"_blank\" rel=\"noreferrer noopener\">ai-agent-demo<\/a> repository.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Querying the Drive Stats dataset<\/h2>\n\n\n\n<p>Now I\u2019ve established that my agent will be writing a SQL query, the next question is, \u201cWhat will it be querying?\u201d I\u2019ve <a href=\"https:\/\/www.backblaze.com\/blog\/querying-a-decade-of-drive-stats-data\/\" target=\"_blank\" rel=\"noreferrer noopener\">written about querying the Drive Stats dataset before<\/a>; in that blog post I explained how I wrote a Python script to convert the Drive Stats data from the CSV format in which we publish it to <a href=\"https:\/\/parquet.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Apache Parquet<\/a>, a column-oriented file format particularly well-suited for storing tabular data for use in analytical queries, and upload it to a Backblaze B2 Bucket using the Apache Hive table format. There\u2019s a broad ecosystem of tools and platforms that can manipulate Parquet data in object storage (for example, Apache Spark and <a href=\"https:\/\/www.backblaze.com\/blog\/data-driven-decisions-wwith-snowflake-and-backblaze-b2\/\" target=\"_blank\" rel=\"noreferrer noopener\">Snowflake<\/a>) and I chose <a href=\"https:\/\/trino.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Trino<\/a>, the open source distributed SQL engine that forms the basis for Amazon Athena, to execute queries against the data.<\/p>\n\n\n\n<p>I could have used the same technologies for this exercise, but I decided to add <a href=\"https:\/\/iceberg.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Apache Iceberg<\/a> to the mix. While Parquet is a file format that specifies how tabular data is stored in files, Iceberg is a table format that governs how those files can be combined and interpreted as a database table. Iceberg provides a number of advantages over Hive as a table format, including better performance and much more flexible data partitioning.<\/p>\n\n\n\n<div class=\"abstract\" style=\"line-height: 1.8; margin: 24px 12px; padding: 24px 12px 10px 12px;\">\n<h4>What is partitioning?<\/h4>\n<p><a href=\"https:\/\/iceberg.apache.org\/docs\/latest\/partitioning\/\" \nrel=\"noopener\" \"nofollow\" target=\"_blank\">Partitioning<\/a> splits a dataset on one or more column values, easing data management and improving performance when a query includes a partition column.<\/p>\n\n<p>Partitioning by year and month makes sense for the Drive Stats dataset\u2014the resulting Parquet files are in the hundreds of megabytes, the sweet spot for Parquet data. To apply this partitioning to the Drive Stats data using the Hive table format, I had to create otherwise redundant <code>month<\/code> and <code>year<\/code> columns from the existing <code>date<\/code> column, complicating the schema.<\/p>\n\n<p>Iceberg, by contrast, supports <a href=\u201dhttps:\/\/iceberg.apache.org\/docs\/latest\/partitioning\/#icebergs-hidden-partitioning\u201d rel=\u201dnoopener\u201d \u201cnofollow\u201d target=\u201d_blank\u201d>hidden partitioning<\/a>, allowing you to apply a transformation to a column value to produce a partition value without adding any new columns. With the Drive Stats data, that meant I could simply define the partitioning as <code>month(date)<\/code> (the resulting value being the number of months since 1\/1\/1970, rather than an integer between 1 and 12), with no need to create any additional columns.<\/p>\n<\/div>\n\n\n\n<p>LangChain&#8217;s <code>SQLDatabase<\/code> class provides access to databases via the <a href=\"https:\/\/www.sqlalchemy.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">SQLAlchemy<\/a>\u00a0 open-source Python library. The demo code obtains a <code>SQLDatabase<\/code> instance by providing a <a href=\"https:\/\/docs.oracle.com\/javase\/8\/docs\/api\/java\/net\/URI.html\" target=\"_blank\" rel=\"noreferrer noopener\">URI<\/a> containing the <code>trino<\/code> scheme, a username and the location of the database node:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>db = SQLDatabase.from_uri('trino:\/\/admin@localhost:8080\/iceberg\/drivestats')<\/code><\/pre>\n\n\n\n<p>Note: In this and other code excerpts in this blog post, I\u2019ve omitted extraneous \u201cboilerplate\u201d code. As mentioned above, the full source code is available in the <a href=\"https:\/\/github.com\/backblaze-b2-samples\/ai-agent-demo\" target=\"_blank\" rel=\"noreferrer noopener\">ai-agent-demo<\/a> repository.<\/p>\n\n\n\n<p>As you can infer from the <code>localhost<\/code> domain name, I\u2019m running Trino on my laptop. I\u2019m actually running it in Docker, using the Iceberg\/Hive Docker Compose script from the <a href=\"https:\/\/github.com\/backblaze-b2-samples\/trino-getting-started-b2\" target=\"_blank\" rel=\"noreferrer noopener\">trino-getting-started-b2<\/a> repository. I\u2019ll dive into that example in a future blog post.<\/p>\n\n\n\n<p>A simple query confirms that we have a successful database connection:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>db.run(\"SELECT COUNT(*) FROM drivestats\")\n\n'&#91;(537220724,)]'<\/code><\/pre>\n\n\n\n<p>As the result conveys, there are over 537 million rows in the Drive Stats dataset.<\/p>\n\n\n\n<p>Each row contains the metrics collected from a single drive in the Backblaze fleet on a specific day. The schema has evolved over time, but, currently, the following columns are included:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>date: <\/strong>The date of collection.<\/li>\n\n\n\n<li><strong>serial_number: <\/strong>The unique serial number of the drive.<\/li>\n\n\n\n<li><strong>model: <\/strong>The manufacturer\u2019s model number of the drive.<\/li>\n\n\n\n<li><strong>capacity_bytes: <\/strong>The drive\u2019s capacity in bytes.<\/li>\n\n\n\n<li><strong>failure: <\/strong>1 if this was the last day that the drive was operational before failing, 0 if all is well.<\/li>\n\n\n\n<li><strong>pod_slot_num<\/strong>: The physical location of a drive within a storage server, as an integer from 0 to 59. The specific slot differs based on the storage server type and capacity: Backblaze (45 or 60 drives), Dell (26 drives), or Supermicro (60 drives).<\/li>\n\n\n\n<li><strong>pod_id<\/strong>: There are 20 storage servers in each Backblaze Vault. The <code>pod_id<\/code> is a numeric field with values from 0 to 19 assigned to each of the 20 storage servers.<\/li>\n\n\n\n<li><strong>vault_id<\/strong>: All data drives are members of a Backblaze Vault. Each Vault consists of either 900 or 1,200 hard drives divided evenly across 20 storage servers. The Vault is a numeric value starting at 1,000.<\/li>\n\n\n\n<li><strong>cluster_id<\/strong>: The name of a given collection of storage servers logically grouped together to optimize system performance, formatted as a numeric field with up to two digits. Note: At this time the <code>cluster_id<\/code> is not always correct; we are working on fixing that.<\/li>\n\n\n\n<li><strong>datacenter<\/strong>: The Backblaze data center where the drive is installed, currently one of <code>ams5<\/code> (Amsterdam, Netherlands), <code>iad1<\/code> (Reston, Virginia), phx1 (Phoenix, Arizona), <code>sac0<\/code> (Sacramento, California), <code>sac2<\/code> (Stockton, California) or, now live, <code>yyz1<\/code>, <a href=\"https:\/\/www.backblaze.com\/blog\/backblaze-adds-canadian-region-expanding-location-choices-and-data-sovereignty-options\/\">our new Toronto, Ontario, data center<\/a>.<\/li>\n\n\n\n<li><strong>is_legacy_format<\/strong>: Currently 0, but may change in future as more fields are added.<\/li>\n\n\n\n<li>A collection of <a href=\"https:\/\/www.backblaze.com\/blog\/what-smart-stats-indicate-hard-drive-failures\/\">SMART<\/a> attributes. The number of attributes collected has risen over time; currently we store 93 SMART attributes in each record, each one in both raw and normalized form, with field names of the form <strong>smart_<\/strong><strong><em>n<\/em><\/strong><strong>_normalized<\/strong> and <strong>smart_<\/strong><strong><em>n<\/em><\/strong><strong>_raw<\/strong>, where <em>n<\/em> is between 1 and 255.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Using OpenAI to generate a SQL query<\/h2>\n\n\n\n<p>For this project, I decided to use the <a href=\"https:\/\/platform.openai.com\/docs\/overview\" target=\"_blank\" rel=\"noreferrer noopener\">OpenAI API<\/a>, rather than running a large language model (LLM) directly on my laptop. LangChain has a <a href=\"https:\/\/python.langchain.com\/docs\/integrations\/chat\/openai\/\" target=\"_blank\" rel=\"noreferrer noopener\">chat model integration for OpenAI<\/a>, as well as <a href=\"https:\/\/python.langchain.com\/docs\/integrations\/chat\/#featured-providers\" target=\"_blank\" rel=\"noreferrer noopener\">many other providers<\/a>, so you could use, for example, a local <a href=\"https:\/\/www.llama.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Llama<\/a> model (via <a href=\"https:\/\/python.langchain.com\/docs\/integrations\/chat\/ollama\/\" target=\"_blank\" rel=\"noreferrer noopener\">ChatOllama<\/a>) or one of the <a href=\"https:\/\/www.anthropic.com\/claude\" target=\"_blank\" rel=\"noreferrer noopener\">Claude<\/a> models (via <a href=\"https:\/\/python.langchain.com\/docs\/integrations\/chat\/anthropic\/\" target=\"_blank\" rel=\"noreferrer noopener\">ChatAnthropic<\/a>) if you prefer.<\/p>\n\n\n\n<p>To use the OpenAI API, you must <a href=\"https:\/\/platform.openai.com\/signup\" target=\"_blank\" rel=\"noreferrer noopener\">sign up for an OpenAI account<\/a> and <a href=\"https:\/\/platform.openai.com\/api-keys\" target=\"_blank\" rel=\"noreferrer noopener\">create an OpenAI API key<\/a>. This code loads the API key from a <code>.env<\/code> file and creates a chat model instance using OpenAI\u2019s <a href=\"https:\/\/openai.com\/index\/gpt-4o-mini-advancing-cost-efficient-intelligence\/\" target=\"_blank\" rel=\"noreferrer noopener\">GPT-4o mini<\/a> model:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># OPENAI_API_KEY must be defined in the .env file\nload_dotenv()\nllm = ChatOpenAI(model=\"gpt-4o-mini\")<\/code><\/pre>\n\n\n\n<p>Now we need a system prompt template. We\u2019ll combine this with the database schema and a natural language question to form the prompt that we send to OpenAI. As in the LangChain tutorial, I\u2019m using a prompt from the LangChain <a href=\"https:\/\/smith.langchain.com\/hub\" target=\"_blank\" rel=\"noreferrer noopener\">Prompt Hub<\/a>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>query_prompt_template = hub.pull(\"langchain-ai\/sql-query-system-prompt\")\nquery_prompt_template.messages&#91;0].pretty_print()<\/code><\/pre>\n\n\n\n<p>This is the prompt template text, with the placeholders shown in <code>{braces}<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>================================ System Message ================================\n\nGiven an input question, create a syntactically correct {dialect} query to run to help find the answer. Unless the user specifies in his question a specific number of examples they wish to obtain, always limit your query to at most {top_k} results. You can order the results by a relevant column to return the most interesting examples in the database.\n\nNever query for all the columns from a specific table, only ask for a few relevant columns given the question.\n\nPay attention to use only the column names that you can see in the schema description. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.\n\nOnly use the following tables:\n{table_info}\n\nQuestion: {input}<\/code><\/pre>\n\n\n\n<p>Notice how the template requires you to specify the correct SQL dialect, constrains the number of results returned, and encourages the model to not hallucinate column names that do not exist in the schema.<\/p>\n\n\n\n<p>A helper function populates the prompt template, sends it to the model, and returns the generated SQL query:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def write_query(state: State):\n    prompt = query_prompt_template.invoke(\n        {\n            \"dialect\": db.dialect,\n            \"top_k\": 10,\n            \"table_info\": db.get_table_info(),\n            \"input\": state&#91;\"question\"],\n        }\n    )\n    structured_llm = llm.with_structured_output(QueryOutput)\n    result = structured_llm.invoke(prompt)\n    return {\"query\": result&#91;\"query\"].rstrip(';')}<\/code><\/pre>\n\n\n\n<p>We can test the helper function by calling it directly with a Python dictionary containing a simple question:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>question = {\"question\": \"How many drives are there?\"}\nquery = write_query(question)<\/code><\/pre>\n\n\n\n<p>The resulting <code>query<\/code> dictionary does indeed contain a valid SQL query, but it won\u2019t give us the answer we are looking for.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{'query': 'SELECT COUNT(*) AS drive_count FROM drivestats'}<\/code><\/pre>\n\n\n\n<p>That query will tell us how many rows there are in the dataset, rather than how many drives. We supplied the database schema to the model, but we haven\u2019t given it any information on the semantics of the columns in the <code>drivestats<\/code> table. We can provide a bit more detail to obtain the correct query:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>question = {\"question\": \"Each drive has its own serial number. How many drives are there?\"}\nquery = write_query(question)<\/code><\/pre>\n\n\n\n<p>This time, the generated SQL query is correct:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{'query': 'SELECT COUNT(DISTINCT serial_number) AS total_drives FROM drivestats'}<\/code><\/pre>\n\n\n\n<p>As you can see, it&#8217;s important to check the output of AI models\u2014they can and do generate unexpected results.<\/p>\n\n\n\n<p>A second helper function executes the query against the database:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def execute_query(state: State):\n    execute_query_tool = QuerySQLDatabaseTool(db=db)\n    return {\"result\": execute_query_tool.invoke(state&#91;\"query\"])}<\/code><\/pre>\n\n\n\n<p>We can test it using the (correct) generated query:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>result = execute_query(query)\n\n{'result': '&#91;(430464,)]'}<\/code><\/pre>\n\n\n\n<p>We need one more helper function, to pass the result set to the model and have it generate a natural language response. This time, we define our own prompt:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def generate_answer(state: State):\n    prompt = (\n        \"Given the following user question, corresponding SQL query, \"\n        \"and SQL result, answer the user question.\\n\\n\"\n        f'Question: {state&#91;\"question\"]}\\n'\n        f'SQL Query: {state&#91;\"query\"]}\\n'\n        f'SQL Result: {state&#91;\"result\"]}'\n    )\n    response = llm.invoke(prompt)\n    return {\"answer\": response.content}<\/code><\/pre>\n\n\n\n<p>Again, we can test it in isolation. Notice that we have to provide the question and query, as well as the result so that the model has the context it needs:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>answer = generate_answer(question | query | result)\nanswer&#91;'answer']\n\n'There are 430,464 drives.'<\/code><\/pre>\n\n\n\n<p>Success! At the present time, there are indeed 430,464 drives in the Drive Stats dataset.<\/p>\n\n\n\n<p>LangChain\u2019s <a href=\"https:\/\/www.langchain.com\/langgraph\" target=\"_blank\" rel=\"noreferrer noopener\">LangGraph<\/a> orchestration framework allows us to compile our three helper functions into a single <a href=\"https:\/\/langchain-ai.github.io\/langgraphjs\/concepts\/low_level\/#graphs\" target=\"_blank\" rel=\"noreferrer noopener\">graph<\/a> object:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>graph_builder = StateGraph(State).add_sequence(\n    &#91;write_query, execute_query, generate_answer]\n)\ngraph_builder.add_edge(START, \"write_query\")\ngraph = graph_builder.compile()<\/code><\/pre>\n\n\n\n<p>We can visualize the flow in the notebook:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>display(Image(graph.get_graph().draw_mermaid_png()))<\/code><\/pre>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"324\" height=\"666\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2025\/01\/Backblaze-agent_1_query-flow.png\" alt=\"A diagram showing a query workflow. The workflow is defined as start, write_query, execute_query, generate_answer. \" class=\"wp-image-111890\" style=\"width:200px\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/Backblaze-agent_1_query-flow.png 324w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/Backblaze-agent_1_query-flow-146x300.png 146w\" sizes=\"auto, (max-width: 324px) 100vw, 324px\" \/><\/figure>\n<\/div>\n\n\n<div style=\"height:10px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>We\u2019ve combined the <code>write_query<\/code> and <code>execute_query<\/code> steps into a graph object that can run agent-generated queries. I\u2019ll quote the security note from the LangChain tutorial on the inherent risks in doing so:<\/p>\n\n\n\n<p class=\"has-purple-background-color has-background\">Building Q&amp;A systems of SQL databases requires executing model-generated SQL queries. There are inherent risks in doing this. Make sure that your database connection permissions are always scoped as narrowly as possible for your chain\/agent&#8217;s needs. This will mitigate though not eliminate the risks of building a model-driven system. For more on general security best practices, <a href=\"https:\/\/python.langchain.com\/docs\/security\/\" target=\"_blank\" rel=\"noreferrer noopener\">see here<\/a>.<\/p>\n\n\n\n<p>In this example, we are querying a public dataset, and I followed best practice by configuring Trino\u2019s Iceberg connector with a read-only application key scoped to the bucket containing the Drive Stats Iceberg tables.<\/p>\n\n\n\n<p>Now let\u2019s stream a new question through the flow. This mode of operation displays the output of each step as it is executed, essential for understanding the flow\u2019s behavior, particularly when it is behaving unexpectedly. The model returns structured text in <a href=\"https:\/\/en.wikipedia.org\/wiki\/Markdown\" target=\"_blank\" rel=\"noreferrer noopener\">Markdown<\/a> format. With a couple of lines of code to extract the message from the step variable, we can use the <code>display_markdown<\/code> function to render each step\u2019s output:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>for step in graph.stream(\n    {\"question\": \"Each drive has its own serial number. How many drives did each data center have on 9\/1\/2024\"}, stream_mode=\"updates\"\n):\n    # unwrap the step value to get the markdown message\n    state = one(step.values())\n    message = one(state.values())\n    display_markdown(message, raw=True)<\/code><\/pre>\n\n\n\n<p>This is the model\u2019s output, and it gives us three different messages. I\u2019ve separated them with a horizontal line for clarity:<\/p>\n\n\n\n<div class=\"wp-block-group\" style=\"padding-right:var(--wp--preset--spacing--60);padding-left:var(--wp--preset--spacing--60)\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-container-core-group-is-layout-81e88365 wp-block-group-is-layout-constrained\">\n<p>SELECT datacenter, COUNT(DISTINCT serial_number) AS drive_count FROM drivestats WHERE date = &#8216;2024-09-01&#8217; GROUP BY datacenter ORDER BY drive_count DESC LIMIT 10<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>Error: (trino.exceptions.TrinoUserError) TrinoUserError(type=USER_ERROR, name=TYPE_MISMATCH, message=&#8221;line 3:12: Cannot apply operator: date = varchar(10)&#8221;, query_id=20250113_221649_00214_bsut5) [SQL: SELECT datacenter, COUNT(DISTINCT serial_number) AS drive_count FROM drivestats WHERE date = &#8216;2024-09-01&#8217; GROUP BY datacenter ORDER BY drive_count DESC LIMIT 10] (Background on this error at: <a href=\"https:\/\/sqlalche.me\/e\/20\/f405\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/sqlalche.me\/e\/20\/f405<\/a>)<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>The SQL query encountered an error because it attempted to compare a date column with a string in the format &#8216;YYYY-MM-DD&#8217;. Specifically, the error message indicates that the <code>date<\/code> column is of a different type (likely not a string), which is causing the type mismatch.<\/p>\n\n\n\n<p>To answer the user question about how many drives each data center had on 9\/1\/2024, the SQL query needs to be corrected. Here\u2019s the revised query:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT datacenter, COUNT(DISTINCT serial_number) AS drive_count \nFROM drivestats \nWHERE date = DATE '2024-09-01' \nGROUP BY datacenter \nORDER BY drive_count DESC \nLIMIT 10<\/code><\/pre>\n\n\n\n<p>This corrected query uses the <code>DATE<\/code> keyword to ensure that the date string is properly interpreted as a date type, which should resolve the type mismatch error. If executed successfully, this query will provide the count of distinct drives in each data center for the specified date. However, without executing the corrected query against the database, we cannot provide the exact counts.<\/p>\n<\/div><\/div>\n\n\n\n<p>As you can see in the output from the first step, the model generated an invalid query, comparing a date to a string, despite the database schema being included in the prompt. The output of the second step contains the resulting error message from the database, while the third step contains the model\u2019s diagnosis of the error.<\/p>\n\n\n\n<p>This exchange highlights a limitation of a flow that is simply a linear series of steps, such as <code>write_query<\/code>, <code>execute_query<\/code>, and <code>generate_answer<\/code>. We cannot rely on the model to generate a valid SQL query, although it is able to point the way towards resolving its error.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Creating a ReAct AI agent with LangGraph<\/h2>\n\n\n\n<p>The LangGraph framework gives you the capability to create AI agents based on arbitrarily complex logic. In this article, I\u2019ve used its prebuilt <a href=\"https:\/\/react-lm.github.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">ReAct<\/a> (Reason+Act) agent, since it neatly demonstrates the agent concept, rewriting the SQL query repeatedly in response to database errors.<\/p>\n\n\n\n<p>There are three steps to creating the agent. The first is to create an instance of LangChain&#8217;s <a href=\"https:\/\/python.langchain.com\/api_reference\/community\/agent_toolkits\/langchain_community.agent_toolkits.sql.toolkit.SQLDatabaseToolkit.html\" target=\"_blank\" rel=\"noreferrer noopener\">SQLDatabaseToolkit<\/a>, passing it the database and model, and obtain its list of tools:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>toolkit = SQLDatabaseToolkit(db=db, llm=llm)\ntools = toolkit.get_tools()<\/code><\/pre>\n\n\n\n<p>The <code>tools<\/code> list contains tools that execute queries, retrieve the names, schemas and content of database tables, and check SQL query syntax.<\/p>\n\n\n\n<p>The next step is to retrieve a suitable prompt template from the Prompt Hub and populate the template placeholders:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>prompt_template = hub.pull(\"langchain-ai\/sql-agent-system-prompt\")\nsystem_message = prompt_template.format(dialect=db.dialect, top_k=10)<\/code><\/pre>\n\n\n\n<p>Here is the prompt template\u2019s text:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>================================ System Message ================================\n\nYou are an agent designed to interact with a SQL database.\nGiven an input question, create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.\nUnless the user specifies a specific number of examples they wish to obtain, always limit your query to at most {top_k} results.\nYou can order the results by a relevant column to return the most interesting examples in the database.\nNever query for all the columns from a specific table, only ask for the relevant columns given the question.\nYou have access to tools for interacting with the database.\nOnly use the below tools. Only use the information returned by the below tools to construct your final answer.\nYou MUST double check your query before executing it. If you get an error while executing a query, rewrite the query and try again.\n\nDO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.\n\nTo start you should ALWAYS look at the tables in the database to see what you can query.\nDo NOT skip this step.\nThen you should query the schema of the most relevant tables.<\/code><\/pre>\n\n\n\n<p>Now we can create an instance of the prebuilt agent:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>agent_executor = create_react_agent(llm, tools, \nstate_modifier=system_message)<\/code><\/pre>\n\n\n\n<p>Note how the agent must select the next step, and how the flow can cycle between the agent and tools steps:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>display(Image(agent_executor.get_graph().draw_mermaid_png()))<\/code><\/pre>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"428\" height=\"498\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2025\/01\/Backblaze-agent_2_agent-and-tools-flow.png\" alt=\"A diagram showing the workflow between tools and agent. The workflow is as follows: start, agent, then a split option to access tools (a recursive step), or to end. The diagram shows that after agent, you can optionally select tools or end, indicating that you can end without choosing tools. \" class=\"wp-image-111893\" style=\"width:300px\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/Backblaze-agent_2_agent-and-tools-flow.png 428w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/Backblaze-agent_2_agent-and-tools-flow-258x300.png 258w\" sizes=\"auto, (max-width: 428px) 100vw, 428px\" \/><\/figure>\n<\/div>\n\n\n<div style=\"height:10px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Again, we can stream the agent\u2019s execution to show us each step of its operation.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>for step in agent_executor.stream(\n    {\"messages\": &#91;{\"role\": \"user\", \"content\": \"Each drive has its own serial number. How many drives did each data center have on 9\/1\/2024?\"}]},\n    stream_mode=\"values\",\n):\n    step&#91;\"messages\"]&#91;-1].pretty_print()<\/code><\/pre>\n\n\n\n<p>The output from this flow is over 300 lines long; I posted it in its entirety <a href=\"https:\/\/gist.github.com\/metadaddy\/95eef09c35df4e2e3a53f70e81453ad3\" target=\"_blank\" rel=\"noreferrer noopener\">as a Gist<\/a>, but I\u2019ll summarize the steps here:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Question: Each drive has its own serial number. How many drives did each data center have on 9\/1\/2024?<\/li>\n\n\n\n<li>The model calls the \u201clist tables\u201d tool.<\/li>\n\n\n\n<li>The list tables tool responds with a single table name, <code>drivestats<\/code>.<\/li>\n\n\n\n<li>The model calls the \u201cget schema\u201d tool, passing it the table name.<\/li>\n\n\n\n<li>The get schema tool responds with the schema and three sample rows from the <code>drivestats<\/code> table.<\/li>\n\n\n\n<li>The model submits a query to the \u201cquery checker\u201d tool:<br \/><code>SELECT datacenter, COUNT(serial_number) AS drive_count FROM drivestats WHERE date = '2024-09-01' GROUP BY datacenter ORDER BY drive_count DESC LIMIT 10;<\/code><\/li>\n\n\n\n<li>The query checker responds with the checked query, which is the same as its input. Note that the query checker only checks the SQL query\u2019s syntax. The query contains the same data type mismatch as the query we generated earlier, as well as another error, as we\u2019re about to discover.<\/li>\n\n\n\n<li>The model submits the query to the \u201cquery executor\u201d tool.<\/li>\n\n\n\n<li>The query executor responds with a syntax error\u2014Trino does not allow a trailing semi-colon on the query.<\/li>\n\n\n\n<li>The model submits a modified query to the query checker tool:<br \/><code>SELECT datacenter, COUNT(serial_number) AS drive_count FROM drivestats WHERE date = '2024-09-01' GROUP BY datacenter ORDER BY drive_count DESC LIMIT 10<\/code><\/li>\n\n\n\n<li>The query checker responds with the checked query, which is the same as its input.<\/li>\n\n\n\n<li>The model submits the query to the \u201cquery executor\u201d tool.<\/li>\n\n\n\n<li>The query executor responds with a type mismatch error since the query tries to compare a string value with a date column.<\/li>\n\n\n\n<li>The model submits a query with the necessary DATE type identifier to the query checker tool:<br \/><code>SELECT datacenter, COUNT(serial_number) AS drive_count FROM drivestats WHERE date = DATE '2024-09-01' GROUP BY datacenter ORDER BY drive_count DESC LIMIT 10<\/code><\/li>\n\n\n\n<li>The query checker responds with the checked query, which is the same as its input.<\/li>\n\n\n\n<li>The model submits the query to the \u201cquery executor\u201d tool.<\/li>\n\n\n\n<li>The query executor responds with a result set:<br \/><code>[<br \/>\u00a0 ('phx1', 89477),<br \/>\u00a0 ('sac0', 78444),<br \/>\u00a0 ('sac2', 60775),<br \/>\u00a0 ('', 24080),<br \/>\u00a0 ('iad1', 22800),<br \/>\u00a0 ('ams5', 16139)<br \/>]<\/code><\/li>\n\n\n\n<li>The model returns a message containing the answer:<br \/><\/li>\n<\/ul>\n\n\n\n<p style=\"padding-right:var(--wp--preset--spacing--60);padding-left:var(--wp--preset--spacing--60)\">On September 1, 2024, the following datacenters had the specified number of drives:<br \/><br \/>1. <strong>phx1<\/strong>: 89,477 drives<br \/>2. <strong>sac0<\/strong>: 78,444 drives<br \/>3. <strong>sac2<\/strong>: 60,775 drives<br \/>4. <strong>(unknown datacenter)<\/strong>: 24,080 drives<br \/>5.<strong> iad1<\/strong>: 22,800 drives<br \/>6. <strong>ams5<\/strong>: 16,139 drives<br \/><br \/>These results show the datacenters with their respective drive counts.<\/p>\n\n\n\n<p>Now let\u2019s see if the model can calculate the annualized failure rate of a drive model. We\u2019ll use the Seagate ST4000DM000, just because that is the drive model with the most days of operation in the dataset.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>for step in agent_executor.stream(\n        {\"messages\": &#91;{\"role\": \"user\", \"content\": \"Each drive has its own serial number. What is the annualized failure rate of the ST4000DM000 drive model?\"}]},\n        stream_mode=\"values\",\n):\n    step&#91;\"messages\"]&#91;-1].pretty_print()<\/code><\/pre>\n\n\n\n<p><a href=\"https:\/\/gist.github.com\/metadaddy\/8349f733d9244b8f7c880158bf02adfb\" target=\"_blank\" rel=\"noreferrer noopener\">The agent\u2019s response<\/a> mixes Markdown and LaTex notation. I used <a href=\"https:\/\/www.quicklatex.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">QuickLaTeX<\/a> to render the LaTex to images:<\/p>\n\n\n\n<div class=\"wp-block-group\" style=\"padding-right:var(--wp--preset--spacing--60);padding-left:var(--wp--preset--spacing--60)\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-container-core-group-is-layout-81e88365 wp-block-group-is-layout-constrained\">\n<p>The annualized failure rate (AFR) for the ST4000DM000 drive model can be calculated using the following information:<\/p>\n\n\n\n<p>&#8211; Total failures: 5,791<\/p>\n\n\n\n<p>&#8211; Total drives: 37,040<\/p>\n\n\n\n<p>&#8211; Time period: from May 10, 2013, to September 30, 2024, which is approximately 11.35 years.<\/p>\n\n\n\n<p>The formula for calculating the annualized failure rate is:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"776\" height=\"86\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2025\/01\/Backblaze-agent_3_AFR-calc.png\" alt=\"The calculation for the annualized failure rate. It's total failures divided by total drives, multiplied by one over the total years, multiplied by 100. \" class=\"wp-image-111896\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/Backblaze-agent_3_AFR-calc.png 776w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/Backblaze-agent_3_AFR-calc-300x33.png 300w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/Backblaze-agent_3_AFR-calc-768x85.png 768w\" sizes=\"auto, (max-width: 776px) 100vw, 776px\" \/><\/figure>\n<\/div>\n\n\n<p>Plugging in the numbers:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"710\" height=\"86\" src=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2025\/01\/Backblaze-agent_4_AFR-calc-2.png\" alt=\"Real number for the annualize failure rate calculations. In this instance, the text reads 5791 divided by 37040, multiplied by one over 11.35, multiplied by 100, which equals approximately 13.77 percent. \" class=\"wp-image-111897\" srcset=\"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/Backblaze-agent_4_AFR-calc-2.png 710w, https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/Backblaze-agent_4_AFR-calc-2-300x36.png 300w\" sizes=\"auto, (max-width: 710px) 100vw, 710px\" \/><\/figure>\n<\/div>\n\n\n<p>Therefore, the annualized failure rate (AFR) of the ST4000DM000 drive model is approximately <strong>13.77%<\/strong>.<\/p>\n<\/div><\/div>\n\n\n\n<p>It\u2019s impressive that the agent shows its working so comprehensively, but, unfortunately, it arrives at the wrong answer. Those drives were not all running for the entire span of the Drive Stats dataset. The correct calculation involves determining the number of days with data for those drives and dividing it by 365 to get the correct number of years\u2019 operation.<\/p>\n\n\n\n<p>It\u2019s clear that the model is not able to answer questions on drive reliability given the data available to it so far. The solution lies in <a href=\"https:\/\/en.wikipedia.org\/wiki\/Prompt_engineering\" target=\"_blank\" rel=\"noreferrer noopener\">prompt engineering<\/a>\u2014providing more context on the semantics of the data in the system prompt.<\/p>\n\n\n\n<p>We can extend the default AI agent system prompt template to include specific instructions on working with the Drive Stats dataset:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>prompt_template.messages&#91;0].prompt.template += \"\"\"\nEach row of the drivestats table records one day of a drive\u2019s operation, and contains the serial number of a drive, its model name, capacity in bytes, whether it failed on that day, SMART attributes and identifiers for the slot, pod, vault, cluster and data center in which it is located.\n\nUse this calculation for the annualized failure rate (AFR) for a drive model over a given time period:\n\n1. **drive_days** is the number of rows for that model during the time period.\n2. **failures** is the number of rows for that model during the time period where **failure** is equal to 1.\n3. **annual failure rate** is 100 * (**failures** \/ (**drive_days** \/ 365)).\n\nUse double precision arithmetic in the calculation to avoid truncation errors. To convert an integer **i** to a double, use CAST(**i** AS DOUBLE)\n\nNote that the date column is a DATE type, not a string. Use the DATE type identifier when comparing the date column to a string.\n\nDo not add a semi-colon suffix to SQL queries.\"\"\"<\/code><\/pre>\n\n\n\n<p>Now, when we ask the same question on the annual failure rate of the ST4000DM000 drive model, the AI agent generates a correct SQL query and a more concise, and correct, final response (you can inspect the <a href=\"https:\/\/gist.github.com\/metadaddy\/2838c8521b6dfd6de57fb11c38417d31\" target=\"_blank\" rel=\"noreferrer noopener\">full output here<\/a>).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT 100 * (CAST(COUNT(CASE WHEN failure = 1 THEN 1 END) AS DOUBLE) \/ (COUNT(*) \/ 365)) AS annual_failure_rate\nFROM drivestats\nWHERE model = 'ST4000DM000'<\/code><\/pre>\n\n\n\n<p style=\"padding-right:var(--wp--preset--spacing--60);padding-left:var(--wp--preset--spacing--60)\">The annual failure rate (AFR) for the ST4000DM000 drive model is approximately <strong>2.63%<\/strong>.<\/p>\n\n\n\n<p>Let\u2019s ask the AI agent for a statistic that we can corroborate from the <a href=\"https:\/\/www.backblaze.com\/blog\/backblaze-drive-stats-for-q3-2024\/\" target=\"_blank\" rel=\"noreferrer noopener\">Backblaze Drive Stats for Q3 2024<\/a> blog post.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>response = agent_executor.invoke(\n    {\"messages\": &#91;{\"role\": \"user\", \"content\": \"What was the annual failure rate of the ST8000NM000A drive model in Q3 2024?\"}]}\n)\nresponse&#91;'messages']&#91;-3].pretty_print()\ndisplay_markdown(response&#91;'messages']&#91;-1].content, raw=True)<\/code><\/pre>\n\n\n\n<p>The query makes sense, and the response agrees with the table in the blog post:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT 100 * (CAST(SUM(failure) AS DOUBLE) \/ (COUNT(*) \/ 365)) AS annual_failure_rate\nFROM drivestats\nWHERE model = 'ST8000NM000A' AND date >= DATE '2024-07-01' AND date &lt; DATE '2024-10-01'<\/code><\/pre>\n\n\n\n<p style=\"padding-right:var(--wp--preset--spacing--60);padding-left:var(--wp--preset--spacing--60)\">The annual failure rate (AFR) of the ST8000NM000A drive model in Q3 2024 is approximately 1.61%.<\/p>\n\n\n\n<p>Interestingly, this time the SQL query used <code>SUM(failure)<\/code> to count the number of failures, rather than the equivalent, but rather long-winded <code>COUNT(CASE WHEN failure = 1 THEN 1 END)<\/code> it used in the previous query. Also, looking at <a href=\"https:\/\/gist.github.com\/metadaddy\/da4f4a5e98810f41d06ce510d05ed20a\" target=\"_blank\" rel=\"noreferrer noopener\">the full response<\/a>, we can see that, as directed by the custom prompt, the agent generated the correct syntax for comparing dates, so it didn\u2019t need to correct and retry any queries.<\/p>\n\n\n\n<p>Finally, let\u2019s ask a more convoluted question, including the constraints given in the blog post:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>response = agent_executor.invoke(\n    {\"messages\": &#91;{\"role\": \"user\", \"content\": \"Considering only drive models which had at least 100 drives in service at the end of the quarter and which accumulated 10,000 or more drive days during the quarter, which drive had the most failures in Q3 2024, and what was its failure rate?\"}]}\n)\nresponse&#91;'messages']&#91;-3].pretty_print()\ndisplay_markdown(response&#91;'messages']&#91;-1].content, raw=True)<\/code><\/pre>\n\n\n\n<p>Again, the AI agent is able to generate a valid SQL query, this time including a subquery, and its response matches the data from the blog post exactly:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>WITH drive_stats AS (\n    SELECT model,\n           COUNT(DISTINCT serial_number) AS drive_count,\n           COUNT(*) AS drive_days,\n           COUNT(CASE WHEN failure = 1 THEN 1 END) AS failures\n    FROM drivestats\n    WHERE date >= DATE '2024-07-01' AND date &lt; DATE '2024-10-01'\n    GROUP BY model\n    HAVING COUNT(DISTINCT serial_number) >= 100 AND COUNT(*) >= 10000\n)\nSELECT model,\n       failures,\n       100 * (CAST(failures AS DOUBLE) \/ (CAST(drive_days AS DOUBLE) \/ 365)) AS failure_rate\nFROM drive_stats\nORDER BY failures DESC\nLIMIT 10<\/code><\/pre>\n\n\n\n<p style=\"padding-right:var(--wp--preset--spacing--60);padding-left:var(--wp--preset--spacing--60)\">The drive model with the most failures in Q3 2024 is the <strong>TOSHIBA MG08ACA16TA<\/strong>, which had <strong>181 failures<\/strong>. Its failure rate during this period was approximately <strong>1.84%<\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Closing thoughts<\/h2>\n\n\n\n<p>My experience building an AI agent was astonishment at its ability to correctly generate quite complex SQL queries based on natural language instructions, tempered with frustration at its limitations, particularly the way that it would confidently generate an incorrect response, rather than saying \u201cI\u2019m sorry, but I don\u2019t know how to do that.\u201d Your AI agent development process should include generous testing time, as well as ongoing monitoring to ensure that it is coming up with the right answers.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Pat Patterson uses Drive Stats data to demonstrate how to build and test your own AI agent. <\/p>\n","protected":false},"author":174,"featured_media":111901,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[7,434,438,483],"tags":[489,468],"class_list":["post-111888","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cloud-storage","category-featured-1","category-featured-cloud-storage","category-tech-lab","tag-ai-ml","tag-b2cloud","entry"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Building an AI Agent with Backblaze B2, LangChain, and Drive Stats<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Building an AI Agent with Backblaze B2, LangChain, and Drive Stats\" \/>\n<meta property=\"og:description\" content=\"Pat Patterson uses Drive Stats data to demonstrate how to build and test your own AI agent.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/\" \/>\n<meta property=\"og:site_name\" content=\"Backblaze Blog | Cloud Storage &amp; Cloud Backup\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/backblaze\" \/>\n<meta property=\"article:published_time\" content=\"2025-01-22T16:50:58+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-01-22T19:01:48+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1440\" \/>\n\t<meta property=\"og:image:height\" content=\"820\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Pat Patterson\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@backblaze\" \/>\n<meta name=\"twitter:site\" content=\"@backblaze\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Pat Patterson\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"24 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Building an AI Agent with Backblaze B2, LangChain, and Drive Stats","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/","og_locale":"en_US","og_type":"article","og_title":"Building an AI Agent with Backblaze B2, LangChain, and Drive Stats","og_description":"Pat Patterson uses Drive Stats data to demonstrate how to build and test your own AI agent.","og_url":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/","og_site_name":"Backblaze Blog | Cloud Storage &amp; Cloud Backup","article_publisher":"https:\/\/www.facebook.com\/backblaze","article_published_time":"2025-01-22T16:50:58+00:00","article_modified_time":"2025-01-22T19:01:48+00:00","og_image":[{"width":1440,"height":820,"url":"https:\/\/www.backblaze.com\/blog\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs.png","type":"image\/png"}],"author":"Pat Patterson","twitter_card":"summary_large_image","twitter_creator":"@backblaze","twitter_site":"@backblaze","twitter_misc":{"Written by":"Pat Patterson","Est. reading time":"24 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/#article","isPartOf":{"@id":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/"},"author":{"name":"Pat Patterson","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/person\/a724a8aee97b6451107442747cd101a4"},"headline":"Building an AI Agent with Backblaze B2, LangChain, and Drive Stats","datePublished":"2025-01-22T16:50:58+00:00","dateModified":"2025-01-22T19:01:48+00:00","mainEntityOfPage":{"@id":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/"},"wordCount":3530,"commentCount":0,"publisher":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/#primaryimage"},"thumbnailUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs.png","keywords":["AI\/ML","B2Cloud"],"articleSection":["Cloud Storage","Featured","Featured-Cloud Storage","Tech Lab"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/","url":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/","name":"Building an AI Agent with Backblaze B2, LangChain, and Drive Stats","isPartOf":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/#primaryimage"},"image":{"@id":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/#primaryimage"},"thumbnailUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs.png","datePublished":"2025-01-22T16:50:58+00:00","dateModified":"2025-01-22T19:01:48+00:00","breadcrumb":{"@id":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/#primaryimage","url":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs.png","contentUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs.png","width":1440,"height":820,"caption":"A decorative image showing multiple computer windows folding into the cloud."},{"@type":"BreadcrumbList","@id":"https:\/\/www.backblaze.com\/blog\/building-an-ai-agent-with-backblaze-b2-langchain-and-drive-stats\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Building an AI Agent with Backblaze B2, LangChain, and Drive Stats"}]},{"@type":"WebSite","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#website","url":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/","name":"Backblaze Cloud Solutions Blog","description":"Cloud Storage &amp; Cloud Backup","publisher":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#organization","name":"Backblaze","url":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/i0.wp.com\/www.backblaze.com\/blog\/wp-content\/uploads\/2017\/12\/backblaze_icon_transparent.png?fit=512%2C512&ssl=1","contentUrl":"https:\/\/i0.wp.com\/www.backblaze.com\/blog\/wp-content\/uploads\/2017\/12\/backblaze_icon_transparent.png?fit=512%2C512&ssl=1","width":512,"height":512,"caption":"Backblaze"},"image":{"@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/backblaze","https:\/\/x.com\/backblaze","https:\/\/www.youtube.com\/user\/Backblaze","https:\/\/en.wikipedia.org\/wiki\/Backblaze"]},{"@type":"Person","@id":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/#\/schema\/person\/a724a8aee97b6451107442747cd101a4","name":"Pat Patterson","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/01\/PatPatterson1920px-150x150.png","url":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/01\/PatPatterson1920px-150x150.png","contentUrl":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2022\/01\/PatPatterson1920px-150x150.png","caption":"Pat Patterson"},"description":"Pat Patterson is the chief technical evangelist at Backblaze. Over his three decades in the industry, Pat has built software and communities at Sun Microsystems, Salesforce, StreamSets, and Citrix. In his role at Backblaze, he creates and delivers content tailored to the needs of the hands-on technical professional, acts as the \u201cvoice of the developer\u201d on the Product team, and actively participates in the wider technical community. Outside the office, Pat runs far, having completed ultramarathons up to the 50 mile distance. Catch up with Pat via Bluesky or LinkedIn.","url":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/author\/pat\/"}]}},"jetpack_featured_media_url":"https:\/\/backblazeprod.wpenginepowered.com\/wp-content\/uploads\/2025\/01\/bb-bh-Twelve-Labs.png","_links":{"self":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/posts\/111888","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/users\/174"}],"replies":[{"embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/comments?post=111888"}],"version-history":[{"count":0,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/posts\/111888\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/media\/111901"}],"wp:attachment":[{"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/media?parent=111888"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/categories?post=111888"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/backblazeprod.wpenginepowered.com\/blog\/wp-json\/wp\/v2\/tags?post=111888"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}