nbdatatools

This is an accessory module of the NoSQLBench project, focusing on test data management, particularly vector test data used for ANN testing.

Testing tools that require specialized data sometimes need an additional line of defense to ensure that the data is appropriate. This repo is a place to put such tools in support of NoSQLBench and other testing systems.

modules

specs

This documents the vectordata layout standard (dataset.yaml + file facets) used by this repo. The conventions described here are directly supported by the other modules.

The format used was initially inspired by that of ann-benchmark, but has since been extended to support a significant variety of test data configurations.

VectorData

This is an API for working directly with a test data format documented in this repo. This allows multiple testing systems to access the same data easily and consistently.

vectordata Javadoc

VectorData javadocs are graciously hosted by javadoc.io.

nbvectors

This is the executable CLI that ships the vector test data tools. Run java -jar nbvectors.jar --help (or append --help to any command) for full option details.

Current commands and subcommands (run --help on any of these for options):

analyze     Inspect vector datasets
  count_zeros    Count zero vectors
  describe       Summarize dataset structure
  select         Extract vectors by index/range
  slice          Window data by range
  find           Locate vectors matching criteria
  check-endian   Endianness sanity check
  verify_knn     Verify KNN answer-keys for one profile
  verify_profiles Efficient multi-profile KNN verification
  flamegraph     Profile hotspots during analysis

convert     Convert between vector formats
  file          fvec/ivec/bvec/csv/json ↔ other formats

compute     CPU helpers
  knn           Generate ground-truth neighbors
  sort          External merge sort for vectors

generate    Produce or slice data
  dataset       Create sample dataset with dataset.yaml
  vectors       Generate random vectors
  mktestdata    Build base/query/ground-truth trio
  fvec-extract  Slice float vectors
  ivec-extract  Slice index files
  ivec-shuffle  Reshuffle integer vectors

datasets    Work with catalogs and downloads
  list          Browse catalogs
  download      Pull datasets/profiles
  prebuffer     Warm caches
  plan          Emit nbvectors commands to build missing artifacts
  curlify       Emit curl commands for remote dataset.yaml with ranged reads

vectordata  Explore vectordata layouts
  info          Summarize dataset and profiles
  views         List views per profile
  profiles      List profile names
  size          Show counts/dimensions for a view
  sample        Print sample vectors from a view
  prebuffer     Prebuffer a view or profile
  cat           Stream vectors from a view
  verify        Prebuffer as a verification pass
  repl          Interactive explorer

catalog     Emit catalog.json/yaml for dataset roots

fetch       Download datasets from Hugging Face
  dlhf          API download with parquet support

merkle      Manage Merkle trees for remote integrity
  create        Build Merkle reference
  verify        Verify against reference
  summary       Summarize tree
  diff          Compare trees
  path          Show paths to leaves
  treeview      Render tree view
  spoilbits     Corrupt specific bits
  spoilchunks   Corrupt specific chunks

cleanup     Clean fvec files
  cleanfvec      Drop zero/duplicate vectors

version     Print version/build information

nbvectors Javadoc

Nbvectors javadocs are graciously hosted by javadoc.io.

Java Version

This project is built with Java 23, and will tend to track the latest LTS at the very least. Generally speaking, one of the most effective ways to speed up your Java app is to use a modern JVM. The same applies to Java-based testing systems.

Ideally, users of these tools should have an experience like this:

Consistent methods of finding documentation and getting CLI help
Simple parameterization of commands and features
User-friendly terminal output and status
Basic quality-of-life features, like auto-completion and similar

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
.github/workflows		.github/workflows
datatools-commands		datatools-commands
datatools-io-transport		datatools-io-transport
datatools-io-xvec		datatools-io-xvec
datatools-jetty-test-server		datatools-jetty-test-server
datatools-mvn-defaults		datatools-mvn-defaults
datatools-nbvectors		datatools-nbvectors
datatools-parquet-reader		datatools-parquet-reader
datatools-status-api		datatools-status-api
datatools-testdata-apis		datatools-testdata-apis
datatools-vectordata		datatools-vectordata
datatools-vshapes		datatools-vshapes
docs		docs
scripts		scripts
specs		specs
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

nbdatatools

modules

specs

VectorData

vectordata Javadoc

nbvectors

nbvectors Javadoc

Java Version

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

nosqlbench/nbdatatools

Folders and files

Latest commit

History

Repository files navigation

nbdatatools

modules

specs

VectorData

vectordata Javadoc

nbvectors

nbvectors Javadoc

Java Version

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages