This message was deleted.
# hamilton-help
s
This message was deleted.
e
Hey! Yep, I think it could do that well. What, specifically, are you trying to do by versioning? One easy way is through git hashes — E.G. you know the version of the code according to the git hash.
Another is through inspecting the code and seeing if anything changes.
But yeah, would need to know more about exactly what you want to do there (how do you want to store/access the versions, what you want to do when a version changes, whether you want to use prior versions and current versions at the same time, etc…)
a
Hi, We are processing large batch of files containing unstructured data. From each file, we are extracting features with our 'feature extraction' module. We would like to move that module to hamilton. Then, to each file corresponds a row in DB. The DB stores the features. We would like to store the `feature extraction' version as a feature. If it does not overcomplexify the process, we may want to be able to compute the features by specifying the version of the module. If the version changes, we will then specify if the features have to recomputed depending on the type of changes in the module. My question is more: is there anything embedded in Hamilton to manage versions ? or any example ?
e
So yes, this should be possible if you handle the versioning correctly. Don’t have an example currently but there are a few approaches that I think would be quite natural! Will write sample code in an hour or two when I’m at my desk. Two more qs though: 1. How do you want to handle version? (Explicitly number versions and keep old ones around, use the code version, etc…). And do you want to keep old versions around in the same code, or are you ok with going to a prior commit? 2. when you recompute a feature, you want to recompute all downstream ones too, right?
OK, so there are two pieces of this: 1. How to get the version 2. How to compute a specific version (1) has a few strategies: a. version based off of the git hash — this should be pretty easy — the version is just
(feature_name, git_hash)
, but you’ll have to compute a ton if you change anything. You could probably optimize for file hash if you wanted to (git allows you to do that, as do other VCSs) b. version based off of a hash of the code version — this uses something on the main branch (a convenience function), that we’re planning on releasing imminently. Here’s some code: In this case I’ve created a driver with one “feature” just to demo:
Copy code
>> dr.list_available_variables()

[Variable(name='foo', type=<class 'int'>, tags={'module': 'temporary_module_8813b8ba_e321_4b96_9a09_8152c45220e3'}, is_external_input=False, originating_functions=(<function foo at 0x10360e040>,)),
 Variable(name='b', type=<class 'int'>, tags={}, is_external_input=True, originating_functions=None)]
Note that you can grab the “originating functions” of the node by doing this. If it is
None
that means its an external input:
Copy code
>> feature = dr.list_available_variables()[0]
>> print(feature.originating_functions)
(<function temporary_module_8813b8ba_e321_4b96_9a09_8152c45220e3.foo(b: int) -> int>,)
Then you can grab the code:
Copy code
>> code = inspect.getsource(feature.originating_functions[0]) 
>> print(code)
@config.when(a=None)
def foo(b: int) -> int:
    return b
Which includes the code + all decorators. Finally you can add a hash:
Copy code
>> hashlib.sha256(code.encode('utf-8')).hexdigest()
Out[35]: '06ec36295a2978b5e6298f23f5d9df8f01fff75511e11bfc931a032b81b66713'
And then you have a unique version! When the code changes, the function will as well. Finally, you have: c: Handle versions yourself
Copy code
@config.when(foo_version=1)
def foo__v1() -> ...:
    ...

@config.when(foo_version=2)
def foo__v2() -> ...:
   ...

@config.when(foo_version=None) # default
def foo() -> ...:
    ...
Its verbose, but it allows you to keep around old versions
For (2) (querying versions), it depends which strategy of (1) you choose (nothing’s ever easy of course), but I think the only one with easy querying is (c), otherwise you have to store the git pointers.
For the hashing, we’re planning to add the hash as a convenience function onto variable, so you don’t have to worry about it, then it would be as simple as something like:
Copy code
>> vars = dr.list_available_variables()
>> var = vars[...]
>> var.version_hash()

'06ec36295a2978b5e6298f23f5d9df8f01fff75511e11bfc931a032b81b66713'
And now you can see the reason we haven’t added it yet, everyone wants something different so we’ve given you the lower-level tools to make it pretty easy on top 🙂 Happy to chat more about it or hop on a call some time tomorrow if you want to talk through the options.