👋 Hello World! I’m Nico Halecky, and I lead engineering at HumanSignal. I joined HumanSignal because I wanted to deliver the tooling my past applied data and ML teams always needed, but never had. We’ve been heads down and busy this year, and we’re ready to let you all know about this pivotal moment as we bring something new to you 😀.
We’re only halfway through 2023, and yet this year’s transformations with advancements in the AI ecosystem have been nothing short of world-impacting. At HumanSignal, we’re obsessed with building tools that collapse the barrier between AI and humans—betting big on humans—to usher in novel workflows for AI development, in particular the creation and maintenance of the datasets that AI learns from. We want to democratize access to these tools, to allow anyone to build and advance AI. With that, we’re incredibly excited to expand our product offering to the new domain of Data Discovery, which establishes a key pillar of our product vision to Discover, Signal, and Supervise.
While Label Studio currently delivers a best-in-class product for scaling refined human labeling workflows, it does not support the necessary upstream workflows for generation of data samples to be targeted for labeling. These efforts can be significant, particularly when you are trying to ensure samples represent the downstream application well. The workflows and processes that allow human subject-matter experts to identify, understand, and contextualize data—including those to create samples—are referred to as Data Discovery, and they are absolutely critical for the success of any dependent model. We've learned through four years of supporting our users' needs that such Data Discovery workflows are often ad-hoc, involve direct inspection, querying, and basic statistics, and have scaled in complexity with enterprise growth towards the cloud and data.
Data discovery is a set of workflows and processes that allow human SMEs to identify, understand, and contextualize data and involve direct inspection, querying, and basic statistics.
Meanwhile, 2023 has also seen generative AI (GenAI) make significant progress in both performance and accessibility (OSS 🔥) and are primed to be introduced as generalized knowledge sources that can deliver real value for users in data context generation. Similarly, advancements in vector databases (OSS 🔥) provide an incredibly optimized way to store and query embeddings—vector representations of unstructured data projected into a latent space—to allow for novel ways to interact with unstructured data at scale. These are huge advancements, and our hats are off to the OSS AI ecosystem that continues to raise all ships. We stand on the shoulders of giants!
We’re combining both of these key technologies and integrating with Label Studio Enterprise to deliver a novel data discovery workflow. This new data discovery capability delivers value for users by allowing them to easily index their cloud-scale datasets, search them with natural language and similarity, and provide seamless ability to target search results in Label Studio projects to do all sorts of great things in applied AI. This is a cloud-scale solution that is the launching pad for data scientists and labeling teams to understand new datasets, bootstrap new models, and refine existing models.
Let’s be clear about the incredible value this functionality provides: within minutes you can take any cloud bucket of text or image unstructured data, build a dataset that you can search using human natural language, and then easily export to a Label Studio project for a refined labeling pass. It does not require any existing human review as it leverages the general knowledge that state-of-the-art AI provides to immediately dig in. This functionality is simple but very powerful, and we’re super excited to see what you do with it.
Let’s explore more about specific use cases for data discovery:
What we’re launching today is just the start of our expansion into data discovery and its huge impact for data teams. We see it as a first class workflow in any applied data context generation effort. In the future, there are many features that we are considering, including deeper integrations with data warehouses and semi-structured data, allowing zero-shot learning to bootstrap pre-annotations, and sharing datasets across teams, projects, models and organization to allow for deeper collaboration.
Finally, we’re keenly interested to understand how data discovery facilitates generative AI fine tuning, and we’re looking at ways to introduce incredibly powerful feedback loops for enterprises to capture human signals from their subject matter experts—helping you build AI with your company’s DNA. There’s much more on this to come, and you can expect to be hearing more from our engineering team going forward in the form of monthly blogs on topics we are passionate about.
It’s a really exciting time, and we’re looking forward to building the future of AI workflow and orchestration tooling with you and for you. Also, a shameless plug: if any of the above resonates and you’re interested in building novel AI tooling, please consider joining us - I’m looking for curious, passionate and empathetic engineering ICs and managers!
I’m grateful for your community, and excited for what comes next!
Nico Halecky VPE
Data discovery is currently in Private Preview, but we’re adding interested people to our waitlist. If you’d like to learn more about this feature and request an invite to the Private Preview, you can sign up here.
Join the Waitlist