question

Brian Gish avatar image
Brian Gish asked

Timestamp parsing in AWS Glue

Hello!

I'm currently exporting all my playstream events to S3. I'm now playing around with AWS Glue and AWS Athena so I can write SQL against my playstream events. For the most part it's working perfectly. The only issue I'm seeing right now is that when I run my AWS Glue Crawler it thinks timestamp columns are string columns. Has anyone had luck writing a custom classifiers to parse playfab datetime values as timestamp columns.

Thanks in advance,

Brian

PlayStream
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

brendan avatar image
brendan answered

Sorry this got a bit lost - the thinking was that we would get time to research Glue, but that didn't happen. Simply put, Glue isn't really something we've worked with, so we don't have an example we can use to test this configuration. Here's our write-up on getting events from the S3 bucket into Redshift, based upon what we have worked with to date:

Getting Playstream Events into Redshift

One common request from playfab developers is how to run queries against their title's data. This outlines a simple way to set your playstream events up for queries using redshift, enabling custom reporting and analytics for your title.

This process assumes that you have already set up the playstream archiver to write to an S3 bucket in your AWS account. If you do not have an AWS account, you can take advantage of their free tier and redshift trial period. See https://aws.amazon.com/redshift/free-trial/ and https://aws.amazon.com/free/ for more details. If you have not set up your playstream archiver, see the Playstream Archiver tab in game manager for what's required.

There are 3 steps to this process:

. Set up an IAM role which allows read access your playstream archive bucket.

. Launch and configure a redshift cluster.

. Execute a copy command to pull the data in.

Once you've completed these steps, your game's data will be available for whatever queries you want.

Set up an IAM role to read from S3

Go to your AWS console dashboard. Click on Identity and Access Management. and create a new role. See the docs http://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-service.html#roles-creatingrole-service-console for more details. The name doesn't matter as long as it's clear. For example, 'playstream-redshift-analyst' would be an appropriate role. Select Amazon Redshift for the role type. Next, you'll need a policy which has read access to S3. If you do not already have a policy for this, you should create one using

{  
  "Version":
  "2012-10-17", 
  "Statement": [
    { 
      "Effect": "Allow", 
      "Action": [ 
        "s3:Get*", 
        "s3:List*"
      ], 
      "Resource": "*"
    }
  ]
}

Launch a Redshift Cluster

We next need to create our redshift cluster. Go to the Redshift console, and create a cluster. http://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-console.html#create-cluster will guide you through the process. The only configuration that matters at this point is instance type and number of nodes.

To start, we recommend using the cheapest instance with one node to verify everything works. This will handle a few million events gracefully, so it works fine for testing. For more serious work, consult the redshift docs to figure out what best suits your needs.

Once you've configured your cluster, you'll need to add the role as a tag. See http://docs.aws.amazon.com/redshift/latest/mgmt/copy-unload-iam-role.html for more detail. This is to allow the cluster to assume that role automatically when it needs to.

Copy the data

Now that your redshift cluster is up, you'll need to connect to it. See http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html for some suggested sql clients, and details about how to do it. Note that redshift is only PostgreSQL-like, so it may be missing some tables your client expect from PostgreSQL.

Once you're able to run sql against your cluster, you're ready to load the data. It's time to use the COPY command. See http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html for a full explanation. For our purposes, it's enough to know it reads from the source, parses the events, and writes them to the specified table. You'll need to explain to redshift how to parse the given JSON objects into rows in a table. That's where Redshift manifest files come in. We have provided the manifest for the log in a public S3 bucket and a public Github Repo.

GitHub Repo: https://github.com/PlayFab/Playstream-Tools

S3 Bucket: pf-playstream-redshift-manifests

In order for a copy command to work, you must have a table whose rows are in the order of the items in your manifest. In general, copying files should look like

CREATE TABLE my_event (  column_1 INTEGER not
  null,  Column_2
  VARCHAR(100) column null,  …);
COPY my_event FROM 's3://MyGame-Playstream-Archive/MyPrefix/MyEventType/Time'
CREDENTIALS 'aws_iam_role=<my_role>
FORMAT AS JSON 's3://pf-playstream-redshift-manifests/namespace/event_name'
GZIP
TIMEFORMAT AS 'auto';

We have a working example for the com.playfab.player_logged_in event table in github. Simply copy paste the sql and fill in the timeframe you want, and the name of your redshift role. With the login table alone, you can easily calculate important kpis like DAU and retention by platform. Note it may take some time to copy all of your data into the table, so be ready to wait a few minutes at least. Once you commit the transaction, your table will be good to go.

Best Practices

. The options GZIP and TIMEFORMAT are required to read events from the playstream archive. Do not forget them.

. By default, if there are any bad events the whole copy command will fail, and you should rollback the transaction. One can allow redshift to skip some number of bad lines using the MAXERRORS option as part of the copy command.

. NOLOAD is a handy option for the copy command while testing. It will parse the files and report errors, but not load the table. Consequently, it will run much faster.

. There is a bug in redshift where it cannot parse datetimes whose fractional seconds part end in .999999x where x is between 5 and 9. This will cause an average of 5/10 million to fail. If your queries require perfect accuracy, copy the timestamps in as strings, remove the last digit with the SUBSTRING function, and use SELECT INTO to make a new table with timestamps as the types..

. We have split archive into many small files to improve parallelization. In particular, having more nodes will significantly improve loading speeds.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

brendan avatar image
brendan answered

Have you tried adding a custom classifier for the timestamp?

2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Brian Gish avatar image Brian Gish commented ·

Hi Brendan, sorry for the delay. I have tried a custom classifier, but I'm still unable to get it to work. It could be that I'm just doing it wrong. I've attached a screenshot of how I set it up. edit-classifier.png


Grok Pattern

%{PLAYFAB_TIMESTAMP:playfab-timestamp:timestamp}

Custom patterns

MILLISECONDS (\d){3,7}

PLAYFAB_TIMESTAMP %{YEAR}-%{MONTHNUM}-%{MONTHDAY}T%{HOUR}:%{MINUTE}:%{SECOND}.%{MILLISECONDS}Z

My pattern is based on several of the built in patterns.

http://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#classifier-builtin-patterns?icmpid=docs_glue_console

0 Likes 0 ·
edit-classifier.png (23.9 KiB)
Michael Tea avatar image Michael Tea commented ·

@Brendan bump

0 Likes 0 ·
mohamad avatar image
mohamad answered

If your go AWS Glue, under table, click on your table then click on Edit Schema top right, there under the Timestamp row, you will be able to click on the String and select Timestamp.

,

If you go, AWS Glue, click on the table, you can edit the schema. Click on on the "string" in the Timestamp row and select Timestamp.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.