We have a customer requirement where a 10Mb CSV wi...
# random
r
We have a customer requirement where a 10Mb CSV will be uploaded once per week to a service we'll create. Then users, about 200 of them, will retrieve portions of that data, multiple times each day, via a web app that we'll also create and that will need to retrieve up to about 10k records - enough to require pagination. Users will also need to be able to further filter and search within that set of data. My assumption at the moment is that these are flat searches that are AND'd together. I'm pondering on the best approach with AWS. It seems like too simple a requirement to consider an RDS database but the search requirements mean that DynamoDB doesn't really work that well either. I'm thinking then that reading the contents of the data from S3 and performing the searching/filtering in memory might be the way to go. I know there's also the ability to query JSON and CSV directly from S3 but I've no experience with that and the docs I've looked make it look not quite as straightforward as using SQL. If in-memory is the way to go, any super-duper technologies for querying JSON/CSV that your have experience of? I'm wondering how you clever lot might approach this...what are your thoughts?
t
Generally if you need to sort and filter data by a bunch of arbitrary properties dynamo is not a good fit
I actually asked Alex Debrie about this
Hey Dax, I’d say that’s directionally correct but not a firm line. Depends on other factors as well. How big is the operative data set to which you’ll apply all the filters? If it’s something like a user’s orders, of which there will almost always be less than 100, you could fetch all the data and then filter yourself client-side. Also, how big will your total data set be? It might be easier to write in SQL initially but could be tough to scale over time.
But yep, that’s generally correct!
Yep that makes sense. Admin interfaces can be tricky with dynamo
That said if they're just filtering on 10K records you don't really need something that efficient right? Iterating over them serverside on every query should be fast enough
r
Yeah, I guess it's whether I bother with DDB at all. If I'm just going to read the data into memory and then filter it there, I might as well convert the CSV to JSON and write that to S3. Then I run the queries on the JSON in memory
a
I’m not sure if this is what you were alluding to by “querying directly in S3”, but Athena provides an interface for doing that which allows you to use a SQL-like syntax. It is pretty easy to set up and work with and provides database drivers for most common client languages. You don’t need to convert it to JSON if you don’t want and might make it easier if the data is already in CSV format; the CSV files can be queried with Athena directly.
t
Athena is good but it's probably going to be too slow / overkill for 10K records
@Ross Coundon yeah I'd start with the s3 approach. If you eventually do need more advanced querying another (maybe strange) way to do it is to write it to a sqlite database and store that in S3
r
Cool
A colleague also came up with the sqlite approach
Thanks guys
s
I stream dynamodb data to rockset db using dynamo db strearms. Then rockset gives a sql view of the data and you can create a query lambda which is exposed as rest api using that you can apply filters and also paginate. Replication lag from dynamo to rockset i have seen ~5-10sec. If you use single table then you need to split data into multiple tables inside your lambda. Other wise rockset has direct connections also. For moderate traffic i am fine with their intro plan which is ~35$pm. But next tier is expensive ~600$pm. Another option will be to stream to elasticsearch. But thats not SQL compatible and cost is similar in scale. Also another option is to stream to S3 (dynamo also has a kinesis connector). But I dont know how to update data on it, if you are just appending then its fine. Then you can connect that using Glue and Athena. Earlier I used to connect athena with metabase and make queries from there and its SQL compatible. But still you cant do arbitrary searches as while setting up Athena as per i remember you need to intelligently partition data as per the access pattern. Else the Athena queries will timeout. Currently I just use rockset for production and stage and connect that via Retool so that operations team can easily access data. This approach is easy to build and maintain, but not cheap (but of course development effort is very less)