Querying Bucket Access Logs
    • Dark
      Light

    Querying Bucket Access Logs

    • Dark
      Light

    Article summary

    This article explains how you can use the open-source Trino SQL query engine to query your Bucket Access Logs.

    Prerequisites

    Before querying your Bucket Access Logs, you must perform the following steps:

    1. Enable Bucket Access Logs in your source bucket, specifying date-based partitioning. Make a note of the example log file name, which should look similar to: logs/0123456789ab/us-west-004/my-images/2025/06/12/2025-06-12-00-00-00-[UniqueString].
    2. Create an application key with access to the Bucket Access Logs destination bucket. This is the bucket where the access logs are being written.
      Please note the application key immediately after creating it, as it will not be available later.
    3. Note your destination bucket’s endpoint, for example, s3.us-west-004.backblazeb2.com
    4. Note the region segment of your destination bucket’s endpoint - in the above example, this is us-west-004
    There is a bug in Trino that causes partition projection to fail. This means that queries for records within a particular period need to scan all of the log records, rather than just those in partitions satisfying the query conditions. While the query results are correct, the queries may take substantially longer to run than they should. We will update this article when the Trino team fixes this issue.

    Run Trino in Docker

    The easiest way to get started with Trino is via the official Docker image. The trino-getting-started-b2 GitHub repository contains Docker Compose scripts that run Trino, plus the Apache Hive metastore and MariaDB database that you will need to query Bucket Access Logs.

    First, clone the trino-getting-started-b2 GitHub repository and change to the hive/trino-b2 subdirectory within the new directory:

    git clone https://github.com/backblaze-b2-samples/trino-getting-started-b2.git

    cd trino-getting-started-b2/hive/trino-b2

    Now you must set the S3 credentials in the catalog properties file at etc/catalog/b2.properties:

    connector.name=hive
    
    hive.metastore.uri=thrift://hive-metastore:9083
    
    hive.non-managed-table-writes-enabled=true
    
    hive.storage-format=PARQUET
    
    hive.recursive-directories=true
    
    fs.native-s3.enabled=true
    
    s3.endpoint=https://{your destination bucket endpoint}
    
    s3.region={the region in your destination bucket endpoint}
    
    s3.aws-access-key={your application key ID}
    
    s3.aws-secret-key={your application key}
    
    s3.exclusive-create=false

    Note that the s3.endpoint property must be a URL with an https:// prefix. 

    For example:

    s3.endpoint=https://s3.us-west-004.backblazeb2.com

    You must also set the fs.s3a endpoint and key properties in the conf/core-site.xml and conf/metastore-site.xml files. The relevant portions of both files should look like this:

       <property>
            <name>fs.s3a.endpoint</name>
            <value>{your destination bucket endpoint}</value>
        </property>
        <property>
            <name>fs.s3a.access.key</name>
            <value>{your application key ID}</value>
        </property>
        <property>
            <name>fs.s3a.secret.key</name>
            <value>{your application key}</value>
        </property>

    Note that, in these configuration files, the fs.s3a.endpoint property is simply a domain name, and must not contain the https:// prefix. For example:

       <property>
            <name>fs.s3a.endpoint</name>
            <value>s3.us-west-004.backblazeb2.com</value>
        </property>

    Now you can run the Docker Compose command to start Trino:

    docker compose up -d

    Wait a few seconds for Trino to start. You can check that it has started by running the following command:

    docker logs trino-b2-trino-coordinator-1

    If you do not see ======== SERVER STARTED ======== near the end of the log, wait a few more seconds and rerun the same docker logs command.

    Now you can start the Trino command-line interface (CLI):

    docker exec -it trino-b2-trino-coordinator-1 trino

    Create a Schema and Table for Your Bucket Access Logs

    In the Trino CLI, create a schema. In the following example, we’ve used bucket_access_logs as the schema name, and destination-bucket as the name of the bucket containing the log files. To use a different schema name, you must substitute your destination bucket name in the location property:

    CREATE SCHEMA b2.bucket_access_logs WITH (location = 's3a://destination-bucket/');

    Recall that the example log file name you noted earlier has the form:

    prefix/account-id/region/source-bucket/2025/06/12/2025-06-12-00-00-00-[UniqueString]

    To create the table, you must construct an s3a URL that locates your log files. This has the following form:

    s3a://destination-bucket/prefix/account-id/source-bucket-region/source-bucket/

    For example, if your source bucket name is my-images, your destination bucket name is my-images-logs, and you used the prefix logs/, the location would look like this:

    s3a://my-images-logs/logs/0123456789ab/us-west-004/my-images/

    Now you can create a table that references your Bucket Access Logs. In the following example, change the table name from source_bucket_logs to reference your table, and substitute your log location URL in the external_location property.

    CREATE TABLE b2.bucket_access_logs.source_bucket_logs (
      bucketowner VARCHAR, bucket_name VARCHAR, 
      requestdatetime VARCHAR, remoteip VARCHAR, 
      requester VARCHAR, requestid VARCHAR, 
      OPERATION VARCHAR, KEY VARCHAR, request_uri VARCHAR, 
      httpstatus VARCHAR, errorcode VARCHAR, 
      bytessent BIGINT, objectsize BIGINT, 
      totaltime BIGINT, turnaroundtime BIGINT, 
      referrer VARCHAR, useragent VARCHAR, 
      versionid VARCHAR, hostid VARCHAR, 
      sigv VARCHAR, ciphersuite VARCHAR, 
      authtype VARCHAR, endpoint VARCHAR, 
      tlsversion VARCHAR, accesspointarn VARCHAR, 
      aclrequired VARCHAR
    ) WITH (
      format = 'REGEX', regex = '([^ ]*) ([^ ]*) \[(.*?)\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$', 
      external_location = 's3a://destination-bucket/prefix/account-id/source-bucket-region/source-bucket/'
    );
    

    Query the Bucket Access Logs

    Now you can run queries against the table, replacing bucket_access_logs.source_bucket_logs with the names you gave to your schema and table.

    Show some log records:

    SELECT * FROM b2.bucket_access_logs.source_bucket_logs  LIMIT 10;

    Show the number of log records:

    SELECT COUNT(*) FROM b2.bucket_access_logs.source_bucket_logs;

    Show who deleted data and when (timestamp, IP address, and application key):

    SELECT requestdatetime, remoteip, requester, key FROM b2.bucket_access_logs.source_bucket_logs WHERE operation like '%DELETE%';

    Show all operations that were performed using an application key:

    SELECT * FROM b2.bucket_access_logs.source_bucket_logs WHERE requester='00415f935cf4dcb000000009e';

    Show all operations that were performed on an object in a specific period:

    Since the requestdatetime column has type VARCHAR, you must use the parse_datetime function to convert it to a TIMESTAMP.
    SELECT 
      * 
    FROM 
      b2.bucket_access_logs.source_bucket_logs 
    WHERE 
      Key = 'path/to/my/image.png' 
      AND parse_datetime(
        requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
      ) BETWEEN TIMESTAMP '2025-04-22 19:00:00 UTC' 
      AND TIMESTAMP '2025-04-22 20:00:00 UTC';
    

    Show how much data was transferred to a specific IP address in a specific period:

    SELECT 
      coalesce(
        SUM(bytessent), 
        0
      ) AS bytessenttotal 
    FROM 
      b2.bucket_access_logs.source_bucket_logs 
    WHERE 
      remoteip = '12.75.216.51' 
      AND parse_datetime(
        requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
      ) BETWEEN DATE '2025-04-22' 
      AND DATE '2025-04-23';
    

    Find request IDs for HTTP 4xx errors in a specific period:

    SELECT 
      requestdatetime, 
      key, 
      httpstatus, 
      errorcode, 
      requestid, 
      hostid 
    FROM 
      b2.bucket_access_logs.source_bucket_logs 
    WHERE 
      httpstatus like '4%' 
      AND parse_datetime(
        requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
      ) BETWEEN DATE '2025-04-22' 
      AND DATE '2025-04-23';
    

    Show all requesters that are sending PUT object requests in a certain period:

    SELECT 
      bucket, 
      requester, 
      remoteip, 
      key, 
      httpstatus, 
      errorcode, 
      requestdatetime 
    FROM 
      b2.bucket_access_logs.source_bucket_logs 
    WHERE 
      operation = 'REST.PUT.OBJECT' 
      AND parse_datetime(
        requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
      ) BETWEEN DATE '2025-04-22' 
      AND DATE '2025-04-23';
    

    Show all requesters that are sending GET object requests in a certain period:

    SELECT 
      bucket, 
      requester, 
      remoteip, 
      key, 
      httpstatus, 
      errorcode, 
      requestdatetime 
    FROM 
      b2.bucket_access_logs.source_bucket_logs 
    WHERE 
      operation = 'REST.GET.OBJECT' 
      AND parse_datetime(
        requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
      ) BETWEEN TIMESTAMP '2025-04-22 19:00:00 UTC' 
      AND TIMESTAMP '2025-04-22 20:00:00 UTC';
    

    Show all anonymous requesters that are making requests to a bucket during a certain period:

    SELECT 
      bucket, 
      requester, 
      remoteip, 
      key, 
      httpstatus, 
      errorcode, 
      requestdatetime 
    FROM 
      b2.bucket_access_logs.source_bucket_logs 
    WHERE 
      requester = '-' 
      AND parse_datetime(
        requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
      ) BETWEEN TIMESTAMP '2025-04-22 19:00:00 UTC' 
      AND TIMESTAMP '2025-04-22 20:00:00 UTC';
    

    Copy Bucket Access Logs to a New Table

    Log files are not optimized for analytical queries, particularly when partitioning is not available. You may find that queries against the Bucket Access Logs take some time to complete if you have a lot of log records. Query performance can be improvedby copying the log data to a new table in a more optimal format, such as Apache Iceberg.

    Edit the following catalog file and save it as hive/trino-b2/etc/catalog/iceberg.properties:

    connector.name=iceberg
    
    hive.metastore.uri=thrift://hive-metastore:9083
    
    iceberg.register-table-procedure.enabled=true
    
    fs.native-s3.enabled=true
    
    s3.endpoint=https://{your destination bucket endpoint}
    
    s3.region={the region in your destination bucket endpoint}
    
    s3.aws-access-key={your application key ID}
    
    s3.aws-secret-key={your application key}
    
    s3.exclusive-create=false

    You must restart the docker containers for the new catalog to be available:

    docker compose restart

    Use the same docker logs command as above to verify that Trino is available:

    docker logs trino-b2-trino-coordinator-1

    Create a new schema. Note that in this simple example, we are using the same Hive Metastore for both catalogs, so you must use a different schema name for the Iceberg catalog.

    CREATE SCHEMA iceberg.bucket_access_logs_iceberg WITH (location = 's3a://destination-bucket/');

    Now you can create a new table populated with the Bucket Access Logs records. Note that this table uses the Parquet file formseat and is partitioned by days, so time-based queries will be executed efficiently. You may change the location prefix from iceberg, but to avoid confusion, you should ensure it is different from your Bucket Access Logs prefix.

    It may take considerable time to create the new table, depending on the amount of data in your Bucket Access Logs.
    CREATE TABLE iceberg.bucket_access_logs_iceberg.source_bucket_logs WITH (
      format = 'PARQUET', partitioning = ARRAY[ 'day(request_timestamp)' ], 
      location = 's3a://destination-bucket/iceberg/'
    ) AS 
    SELECT 
      bucketowner, 
      bucket, 
      parse_datetime(
        requestdatetime, 'dd/MMM/YYYY:HH:mm:ss Z'
      ) AS requestdatetime, 
      remoteip, 
      requester, 
      requestid, 
      operation, 
      key, 
      request_uri, 
      httpstatus, 
      errorcode, 
      bytessent, 
      objectsize, 
      totaltime, 
      turnaroundtime, 
      referrer, 
      useragent, 
      versionid, 
      hostid, 
      sigv, 
      ciphersuite, 
      authtype, 
      endpoint, 
      tlsversion, 
      accesspointarn, 
      aclrequired 
    FROM 
      b2.bucket_access_logs.source_bucket_logs;
    

    Now you can run queries against the Iceberg table. Since the Iceberg table’s requestdatetime column is now a TIMESTAMP rather than a VARCHAR, you do not need to use the parse_datetime function in your queries.

    For example, to show all operations that were performed on an object in a specific period:

    SELECT 
      * 
    FROM 
      iceberg.bucket_access_logs_iceberg.source_bucket_logs 
    WHERE 
      Key = 'path/to/my/image.png' 
      AND requestdatetime BETWEEN TIMESTAMP '2025-04-22 19:00:00 UTC' 
      AND TIMESTAMP '2025-04-22 20:00:00 UTC';
    

    Bucket Access Logs records will continue to be written to the original table, so you will need to update the Iceberg table with new records periodically. You can do this with an INSERT statement:

    INSERT INTO iceberg.bucket_access_logs_iceberg.source_bucket_logs 
    SELECT 
      bucketowner, 
      bucket_name, 
      parse_datetime(
        requestdatetime, 'dd/MMM/YYYY:HH:mm:ss Z'
      ) AS requestdatetime, 
      remoteip, 
      requester, 
      requestid, 
      operation, 
      key, 
      request_uri, 
      httpstatus, 
      errorcode, 
      bytessent, 
      objectsize, 
      totaltime, 
      turnaroundtime, 
      referrer, 
      useragent, 
      versionid, 
      hostid, 
      sigv, 
      ciphersuite, 
      authtype, 
      endpoint, 
      tlsversion, 
      accesspointarn, 
      aclrequired 
    FROM 
      b2.bucket_access_logs.your_bucket_log 
    WHERE 
      parse_datetime(
        requestdatetime, 'dd/MMM/YYYY:HH:mm:ss Z'
      ) > (
        SELECT 
          MAX(requestdatetime) 
        FROM 
          iceberg.bucket_access_logs_iceberg.source_bucket_logs
      );
    
    Are you looking for instructions to access the Backblaze B2 Bucket Access Log from an environment other than Trino? Please let us know, and we will consider writing more articles.

    Was this article helpful?