https://datahubproject.io logo
#ingestion
Title
# ingestion
c

chilly-potato-57465

09/23/2022, 1:33 PM
Hello! In our case we have huge datasets stored in regular file systems (images) and HDFS? As far as I could see from previous questions and sources documentation, there are no plugins to ingest metadata from regular file systems (attributes such as created/modified/ownership/size/assess rights/etc and folders structure) and HDFS. Is this still so? Additionally, I wonder how to ingest metadata (column names) from csv files. I see that is possible from S3 source, is it also possible from regular file systems? Thank you!!
g

gray-shoe-75895

09/27/2022, 12:31 AM
It’s definitely not well-documented, but the s3 source does actually support regular file systems as well
It will automatically inspect csv files and infer the data schema
However, I don’t think we currently capture the filesystem metadata (e.g. created/modified timestamps, owner/group/permissions) right now, but it certainly would make sense to - contributions welcome on that front 🙂
c

chilly-potato-57465

03/15/2023, 2:45 PM
Hi @gray-shoe-75895, great that the S3 source can support regular file systems. What about HDFS? Or is HDFS supported from the Spark plugin? https://datahubproject.io/docs/metadata-integration/java/spark-lineage
Too quick to press enter. Is there any development regarding the file system attributes/metadata? Thank you!
g

gray-shoe-75895

03/17/2023, 7:08 AM
We don’t really have much support for HDFS. For stuff sitting in s3 we can get ownership/tags, but not from filesystems
c

chilly-potato-57465

03/17/2023, 9:03 AM
I see, thank you !
c

creamy-ram-28134

10/16/2023, 9:06 PM
Hi all - I am just coming back to this thread because we have a similar use case. Does anyone know if we can use local Minio S3 instead of AWS S3?
d

dazzling-judge-80093

10/18/2023, 3:16 PM
Minio S3 should work with the s3 source