-
Notifications
You must be signed in to change notification settings - Fork 38
Add sink timestamp section into file names #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There is another purpose to this requirement, and an important one. If the path doesn't follow a particular convention, then it is not possible to partition the data and limit queries via Athena. Ideally the S3 loader will be able to create files in a manner which allows one to easily create partitioned athena tables, and only load certain partitions. (the old format Edit for clarity: The convention to follow is key=value in the name, eg. run=2020-04-23. I'm unsure at present whether this must be a directory name, or this convention in the filename is sufficient. |
We could also use this parameter to partition data by date. |
|
Currently file names are generated here: https://github.com/snowplow/snowplow-s3-loader/blob/master/src/main/scala/com.snowplowanalytics.s3/loader/KinesisS3Emitter.scala#L150 with only year/month/day and Kinesis sequence number, which are mostly useless if we want to reply some set of data from S3.
/cc @istreeter
The text was updated successfully, but these errors were encountered: