- Print
- DarkLight
Querying Bucket Access Logs
- Print
- DarkLight
This article explains how you can use the open-source Trino SQL query engine to query your Bucket Access Logs.
Prerequisites
Before querying your Bucket Access Logs, you must perform the following steps:
- Enable Bucket Access Logs in your source bucket, specifying date-based partitioning. Make a note of the example log file name, which should look similar to:
logs/0123456789ab/us-west-004/my-images/2025/06/12/2025-06-12-00-00-00-[UniqueString]
. - Create an application key with access to the Bucket Access Logs destination bucket. This is the bucket where the access logs are being written.Please note the application key immediately after creating it, as it will not be available later.
- Note your destination bucket’s endpoint, for example,
s3.us-west-004.backblazeb2.com
- Note the region segment of your destination bucket’s endpoint - in the above example, this is
us-west-004
Run Trino in Docker
The easiest way to get started with Trino is via the official Docker image. The trino-getting-started-b2 GitHub repository contains Docker Compose scripts that run Trino, plus the Apache Hive metastore and MariaDB database that you will need to query Bucket Access Logs.
First, clone the trino-getting-started-b2 GitHub repository and change to the hive/trino-b2
subdirectory within the new directory:
git clone https://github.com/backblaze-b2-samples/trino-getting-started-b2.git
cd trino-getting-started-b2/hive/trino-b2
Now you must set the S3 credentials in the catalog properties file at etc/catalog/b2.properties:
connector.name=hive
hive.metastore.uri=thrift://hive-metastore:9083
hive.non-managed-table-writes-enabled=true
hive.storage-format=PARQUET
hive.recursive-directories=true
fs.native-s3.enabled=true
s3.endpoint=https://{your destination bucket endpoint}
s3.region={the region in your destination bucket endpoint}
s3.aws-access-key={your application key ID}
s3.aws-secret-key={your application key}
s3.exclusive-create=false
Note that the s3.endpoint property must be a URL with an https:// prefix.
For example:
s3.endpoint=https://s3.us-west-004.backblazeb2.com
You must also set the fs.s3a
endpoint and key properties in the conf/core-site.xml and conf/metastore-site.xml files. The relevant portions of both files should look like this:
<property>
<name>fs.s3a.endpoint</name>
<value>{your destination bucket endpoint}</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>{your application key ID}</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>{your application key}</value>
</property>
Note that, in these configuration files, the fs.s3a.endpoint
property is simply a domain name, and must not contain the https://
prefix. For example:
<property>
<name>fs.s3a.endpoint</name>
<value>s3.us-west-004.backblazeb2.com</value>
</property>
Now you can run the Docker Compose command to start Trino:
docker compose up -d
Wait a few seconds for Trino to start. You can check that it has started by running the following command:
docker logs trino-b2-trino-coordinator-1
If you do not see ======== SERVER STARTED ========
near the end of the log, wait a few more seconds and rerun the same docker logs command.
Now you can start the Trino command-line interface (CLI):
docker exec -it trino-b2-trino-coordinator-1 trino
Create a Schema and Table for Your Bucket Access Logs
In the Trino CLI, create a schema. In the following example, we’ve used bucket_access_logs
as the schema name, and destination-bucket
as the name of the bucket containing the log files. To use a different schema name, you must substitute your destination bucket name in the location property:
CREATE SCHEMA b2.bucket_access_logs WITH (location = 's3a://destination-bucket/');
Recall that the example log file name you noted earlier has the form:
prefix/account-id/region/source-bucket/2025/06/12/2025-06-12-00-00-00-[UniqueString]
To create the table, you must construct an s3a URL that locates your log files. This has the following form:
s3a://destination-bucket/prefix/account-id/source-bucket-region/source-bucket/
For example, if your source bucket name is my-images
, your destination bucket name is my-images-logs
, and you used the prefix logs/
, the location would look like this:
s3a://my-images-logs/logs/0123456789ab/us-west-004/my-images/
Now you can create a table that references your Bucket Access Logs. In the following example, change the table name from source_bucket_logs to reference your table, and substitute your log location URL in the external_location property.
CREATE TABLE b2.bucket_access_logs.source_bucket_logs (
bucketowner VARCHAR, bucket_name VARCHAR,
requestdatetime VARCHAR, remoteip VARCHAR,
requester VARCHAR, requestid VARCHAR,
OPERATION VARCHAR, KEY VARCHAR, request_uri VARCHAR,
httpstatus VARCHAR, errorcode VARCHAR,
bytessent BIGINT, objectsize BIGINT,
totaltime BIGINT, turnaroundtime BIGINT,
referrer VARCHAR, useragent VARCHAR,
versionid VARCHAR, hostid VARCHAR,
sigv VARCHAR, ciphersuite VARCHAR,
authtype VARCHAR, endpoint VARCHAR,
tlsversion VARCHAR, accesspointarn VARCHAR,
aclrequired VARCHAR
) WITH (
format = 'REGEX', regex = '([^ ]*) ([^ ]*) \[(.*?)\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$',
external_location = 's3a://destination-bucket/prefix/account-id/source-bucket-region/source-bucket/'
);
Query the Bucket Access Logs
Now you can run queries against the table, replacing bucket_access_logs.source_bucket_logs
with the names you gave to your schema and table.
Show some log records:
SELECT * FROM b2.bucket_access_logs.source_bucket_logs LIMIT 10;
Show the number of log records:
SELECT COUNT(*) FROM b2.bucket_access_logs.source_bucket_logs;
Show who deleted data and when (timestamp, IP address, and application key):
SELECT requestdatetime, remoteip, requester, key FROM b2.bucket_access_logs.source_bucket_logs WHERE operation like '%DELETE%';
Show all operations that were performed using an application key:
SELECT * FROM b2.bucket_access_logs.source_bucket_logs WHERE requester='00415f935cf4dcb000000009e';
Show all operations that were performed on an object in a specific period:
VARCHAR
, you must use the parse_datetime
function to convert it to a TIMESTAMP
.SELECT
*
FROM
b2.bucket_access_logs.source_bucket_logs
WHERE
Key = 'path/to/my/image.png'
AND parse_datetime(
requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
) BETWEEN TIMESTAMP '2025-04-22 19:00:00 UTC'
AND TIMESTAMP '2025-04-22 20:00:00 UTC';
Show how much data was transferred to a specific IP address in a specific period:
SELECT
coalesce(
SUM(bytessent),
0
) AS bytessenttotal
FROM
b2.bucket_access_logs.source_bucket_logs
WHERE
remoteip = '12.75.216.51'
AND parse_datetime(
requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
) BETWEEN DATE '2025-04-22'
AND DATE '2025-04-23';
Find request IDs for HTTP 4xx errors in a specific period:
SELECT
requestdatetime,
key,
httpstatus,
errorcode,
requestid,
hostid
FROM
b2.bucket_access_logs.source_bucket_logs
WHERE
httpstatus like '4%'
AND parse_datetime(
requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
) BETWEEN DATE '2025-04-22'
AND DATE '2025-04-23';
Show all requesters that are sending PUT object requests in a certain period:
SELECT
bucket,
requester,
remoteip,
key,
httpstatus,
errorcode,
requestdatetime
FROM
b2.bucket_access_logs.source_bucket_logs
WHERE
operation = 'REST.PUT.OBJECT'
AND parse_datetime(
requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
) BETWEEN DATE '2025-04-22'
AND DATE '2025-04-23';
Show all requesters that are sending GET object requests in a certain period:
SELECT
bucket,
requester,
remoteip,
key,
httpstatus,
errorcode,
requestdatetime
FROM
b2.bucket_access_logs.source_bucket_logs
WHERE
operation = 'REST.GET.OBJECT'
AND parse_datetime(
requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
) BETWEEN TIMESTAMP '2025-04-22 19:00:00 UTC'
AND TIMESTAMP '2025-04-22 20:00:00 UTC';
Show all anonymous requesters that are making requests to a bucket during a certain period:
SELECT
bucket,
requester,
remoteip,
key,
httpstatus,
errorcode,
requestdatetime
FROM
b2.bucket_access_logs.source_bucket_logs
WHERE
requester = '-'
AND parse_datetime(
requestdatetime, 'dd/MMM/yyyy:HH:mm:ss Z'
) BETWEEN TIMESTAMP '2025-04-22 19:00:00 UTC'
AND TIMESTAMP '2025-04-22 20:00:00 UTC';
Copy Bucket Access Logs to a New Table
Log files are not optimized for analytical queries, particularly when partitioning is not available. You may find that queries against the Bucket Access Logs take some time to complete if you have a lot of log records. Query performance can be improvedby copying the log data to a new table in a more optimal format, such as Apache Iceberg.
Edit the following catalog file and save it as hive/trino-b2/etc/catalog/iceberg.properties:
connector.name=iceberg
hive.metastore.uri=thrift://hive-metastore:9083
iceberg.register-table-procedure.enabled=true
fs.native-s3.enabled=true
s3.endpoint=https://{your destination bucket endpoint}
s3.region={the region in your destination bucket endpoint}
s3.aws-access-key={your application key ID}
s3.aws-secret-key={your application key}
s3.exclusive-create=false
You must restart the docker containers for the new catalog to be available:
docker compose restart
Use the same docker logs command as above to verify that Trino is available:
docker logs trino-b2-trino-coordinator-1
Create a new schema. Note that in this simple example, we are using the same Hive Metastore for both catalogs, so you must use a different schema name for the Iceberg catalog.
CREATE SCHEMA iceberg.bucket_access_logs_iceberg WITH (location = 's3a://destination-bucket/');
Now you can create a new table populated with the Bucket Access Logs records. Note that this table uses the Parquet file formseat and is partitioned by days, so time-based queries will be executed efficiently. You may change the location prefix from iceberg, but to avoid confusion, you should ensure it is different from your Bucket Access Logs prefix.
CREATE TABLE iceberg.bucket_access_logs_iceberg.source_bucket_logs WITH (
format = 'PARQUET', partitioning = ARRAY[ 'day(request_timestamp)' ],
location = 's3a://destination-bucket/iceberg/'
) AS
SELECT
bucketowner,
bucket,
parse_datetime(
requestdatetime, 'dd/MMM/YYYY:HH:mm:ss Z'
) AS requestdatetime,
remoteip,
requester,
requestid,
operation,
key,
request_uri,
httpstatus,
errorcode,
bytessent,
objectsize,
totaltime,
turnaroundtime,
referrer,
useragent,
versionid,
hostid,
sigv,
ciphersuite,
authtype,
endpoint,
tlsversion,
accesspointarn,
aclrequired
FROM
b2.bucket_access_logs.source_bucket_logs;
Now you can run queries against the Iceberg table. Since the Iceberg table’s requestdatetime column is now a TIMESTAMP
rather than a VARCHAR
, you do not need to use the parse_datetime
function in your queries.
For example, to show all operations that were performed on an object in a specific period:
SELECT
*
FROM
iceberg.bucket_access_logs_iceberg.source_bucket_logs
WHERE
Key = 'path/to/my/image.png'
AND requestdatetime BETWEEN TIMESTAMP '2025-04-22 19:00:00 UTC'
AND TIMESTAMP '2025-04-22 20:00:00 UTC';
Bucket Access Logs records will continue to be written to the original table, so you will need to update the Iceberg table with new records periodically. You can do this with an INSERT
statement:
INSERT INTO iceberg.bucket_access_logs_iceberg.source_bucket_logs
SELECT
bucketowner,
bucket_name,
parse_datetime(
requestdatetime, 'dd/MMM/YYYY:HH:mm:ss Z'
) AS requestdatetime,
remoteip,
requester,
requestid,
operation,
key,
request_uri,
httpstatus,
errorcode,
bytessent,
objectsize,
totaltime,
turnaroundtime,
referrer,
useragent,
versionid,
hostid,
sigv,
ciphersuite,
authtype,
endpoint,
tlsversion,
accesspointarn,
aclrequired
FROM
b2.bucket_access_logs.your_bucket_log
WHERE
parse_datetime(
requestdatetime, 'dd/MMM/YYYY:HH:mm:ss Z'
) > (
SELECT
MAX(requestdatetime)
FROM
iceberg.bucket_access_logs_iceberg.source_bucket_logs
);