Skip to content

Conversation

@geraldcombs
Copy link
Contributor

Add FTYPE_METADATA_SLICE, CAP_EXTRACT_METADATA, and associated routines so that plugins can provide the location of each field.

What type of PR is this?

Uncomment one (or more) /kind <> lines:

/kind bug

/kind cleanup

/kind design

/kind documentation

/kind failing-test

/kind feature

Any specific area of the project related to this PR?

Uncomment one (or more) /area <> lines:

/area API-version

/area build

/area CI

/area driver-kmod

/area driver-bpf

/area driver-modern-bpf

/area libscap-engine-bpf

/area libscap-engine-gvisor

/area libscap-engine-kmod

/area libscap-engine-modern-bpf

/area libscap-engine-nodriver

/area libscap-engine-noop

/area libscap-engine-source-plugin

/area libscap-engine-savefile

/area libscap

/area libpman

/area libsinsp

/area tests

/area proposals

Does this PR require a change in the driver versions?

/version driver-API-version-major

/version driver-API-version-minor

/version driver-API-version-patch

/version driver-SCHEMA-version-major

/version driver-SCHEMA-version-minor

/version driver-SCHEMA-version-patch

What this PR does / why we need it:

This makes it possible to extend extractor plugins so that they can provide the location of each field in each event or log message.

Which issue(s) this PR fixes:

This is required to fix https://gitlab.com/wireshark/wireshark/-/issues/20449

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@geraldcombs
Copy link
Contributor Author

@codecov
Copy link

codecov bot commented Mar 21, 2025

Codecov Report

Attention: Patch coverage is 82.75862% with 10 lines in your changes missing coverage. Please review.

Project coverage is 77.19%. Comparing base (d45ed9c) to head (60420eb).
Report is 35 commits behind head on master.

Files with missing lines Patch % Lines
userspace/libsinsp/test/sinsp_with_test_input.cpp 77.27% 5 Missing ⚠️
userspace/libsinsp/plugin.cpp 50.00% 2 Missing ⚠️
userspace/libsinsp/sinsp_filtercheck.cpp 83.33% 2 Missing ⚠️
userspace/libsinsp/test/filter_compiler.ut.cpp 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2322      +/-   ##
==========================================
+ Coverage   77.17%   77.19%   +0.01%     
==========================================
  Files         227      227              
  Lines       30192    30250      +58     
  Branches     4607     4625      +18     
==========================================
+ Hits        23302    23351      +49     
- Misses       6890     6899       +9     
Flag Coverage Δ
libsinsp 77.19% <82.75%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Mar 21, 2025

Perf diff from master - unit tests

     3.96%     +0.76%  [.] sinsp_thread_manager::find_thread
     6.65%     -0.69%  [.] sinsp::next
     0.46%     +0.46%  [.] sinsp_parser::parse_context_switch
    36.01%     -0.45%  [.] sinsp_thread_manager::create_thread_dependencies
     1.13%     +0.34%  [.] next
     2.28%     +0.26%  [.] sinsp_evt::load_params
     1.20%     -0.26%  [.] libsinsp::sinsp_suppress::process_event
     4.04%     -0.24%  [.] next_event_from_file
     5.49%     -0.18%  [.] sinsp_evt::get_type
     0.36%     +0.17%  [.] sinsp_evt::get_param

Heap diff from master - unit tests

peak heap memory consumption: 0B
peak RSS (including heaptrack overhead): 0B
total memory leaked: 0B

Heap diff from master - scap file

peak heap memory consumption: 0B
peak RSS (including heaptrack overhead): 0B
total memory leaked: 0B

Benchmarks diff from master

Comparing gbench_data.json to /root/actions-runner/_work/libs/libs/build/gbench_data.json
Benchmark                                                         Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
BM_sinsp_split_mean                                            +0.0105         +0.0106           149           151           149           151
BM_sinsp_split_median                                          +0.0106         +0.0107           149           151           149           151
BM_sinsp_split_stddev                                          -0.4618         -0.4622             0             0             0             0
BM_sinsp_split_cv                                              -0.4674         -0.4679             0             0             0             0
BM_sinsp_concatenate_paths_relative_path_mean                  +0.0494         +0.0496            58            60            58            60
BM_sinsp_concatenate_paths_relative_path_median                +0.0521         +0.0523            57            60            57            60
BM_sinsp_concatenate_paths_relative_path_stddev                -0.6176         -0.6176             0             0             0             0
BM_sinsp_concatenate_paths_relative_path_cv                    -0.6356         -0.6356             0             0             0             0
BM_sinsp_concatenate_paths_empty_path_mean                     +0.0444         +0.0445            24            25            24            25
BM_sinsp_concatenate_paths_empty_path_median                   +0.0439         +0.0440            24            25            24            25
BM_sinsp_concatenate_paths_empty_path_stddev                   +1.5770         +1.5814             0             0             0             0
BM_sinsp_concatenate_paths_empty_path_cv                       +1.4675         +1.4714             0             0             0             0
BM_sinsp_concatenate_paths_absolute_path_mean                  +0.0758         +0.0759            59            63            59            63
BM_sinsp_concatenate_paths_absolute_path_median                +0.0709         +0.0710            59            63            59            63
BM_sinsp_concatenate_paths_absolute_path_stddev                +0.2846         +0.2848             1             1             1             1
BM_sinsp_concatenate_paths_absolute_path_cv                    +0.1940         +0.1941             0             0             0             0

Copy link
Contributor

@gnosek gnosek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the use case you're aiming for but I'm not sure this is the right approach. You cannot rely on knowing a plugin's "extract metadata" extract field, so you still need explicit per-plugin support. At that point, you don't need the new capability, since you know the plugin supports it.

To me it looks like you could simply:

  • extend ss_plugin_extract_field with:
    bool extract_offsets;
    u32 start_offset;
    u32 end_offset;
    
  • set extract_offsets = true if you want the actual byte offsets (so we don't pay the cost unnecessarily)
  • if extract_offsets == true, fill these fields in the plugin and reset extract_offsets to false (to mark that the offsets are valid; we could also have an enum here)

So then every extraction will also include the byte offsets without introducing new capabilities or new data types and you don't need to support plugins explicitly.

(cc @jasondellaluce for the stuff below)

Though I just realized the comment for ss_plugin_extract_field is wrong and adding new fields at the end will break the ABI, since we pass them as a pointer-to-array, so changing the size will make things explode when num_fields>1.

We never seem to actually submit multiple extract requests (sinsp_filter_check_plugin::extract_nocache just sets num_fields=1) so I guess we might as well deprecate the functionality (and possibly reintroduce it later once we actually have a use case for it)

Copy link
Member

@leogr leogr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: I did not notice the above comment from @gnosek. We are proposing a similar thing in the end.

Hey @geraldcombs,

Thank you for this PR. Although I totally agree with fixing the issue of providing the location of each field, I think the approach used in this PR does not fully align with the design of the plugin API. I've also discussed this privately with @jasondellaluce, and we mostly agree.

In details:

  • Adding these metadata looks to me like an extension of the current capability and not a distinct capability (which would also introduce unnecessary cross-capability dependency)
  • Creating a kind of polymorphism for the extraction function is too cumbersome and has a non-negligible performance impact (the consumer will have to call it twice)
  • Fields types are intended to map field types in libs, not for providing metadata. So adding FTYPE_METADATA_SLICE for this purpose seems misleading to us.
  • More straightforward solutions that align with the API's original design are available.

So, I want to propose a feasible alternative:

ss_plugin_extract_field allows us to add members in a backward-compatible way. So we can either:

  • Add two u32 members to hold the positions
  • Or a more complex datatype to allow generic metadata (I'd not go for this path unless we have a compelling use case since this would introduce another opaque format which both the plugin and the framework need to deal with)

This will not require introducing a new FTYPE. Moreover, plugins that do not support this feature will not populate those members, so there's no need to introduce a new capability either.

In the same struct, we can add another member that signals to the plugin when the consumer requires these metadata (e.g., ss_plugin_bool request_metadata) so that the plugin will produce that result only when requested.

Finally, this will require some additions to sinsp_filter_check_plugin (or perhaps in some other sinsp API) to expose these data to the final consumer.

@leogr
Copy link
Member

leogr commented Mar 21, 2025

Though I just realized the comment for ss_plugin_extract_field is wrong and adding new fields at the end will break the ABI, since we pass them as a pointer-to-array, so changing the size will make things explode when num_fields>1.

We never seem to actually submit multiple extract requests (sinsp_filter_check_plugin::extract_nocache just sets num_fields=1) so I guess we might as well deprecate the functionality (and possibly reintroduce it later once we actually have a use case for it)

@gnosek You are right. I did not consider that. If we deprecate batch extraction, we will be fine. But I'm not sure of the consequences. @jasondellaluce wdyt?

@geraldcombs
Copy link
Contributor Author

@gnosek @leogr thanks for the review!

I initially started with the idea that it would be useful to be able to extract multiple types of metadata, but as you point out just adding offsets is cleaner and more efficient. This and the related PRs shoudl be more in line with your suggestions.

We never seem to actually submit multiple extract requests (sinsp_filter_check_plugin::extract_nocache just sets num_fields=1) so I guess we might as well deprecate the functionality (and possibly reintroduce it later once we actually have a use case for it)

Stratoshark uses it. It's much faster than extracting each field individually, at least for the cloudtrail plugin.

@geraldcombs
Copy link
Contributor Author

Though I just realized the comment for ss_plugin_extract_field is wrong and adding new fields at the end will break the ABI, since we pass them as a pointer-to-array, so changing the size will make things explode when num_fields>1.

I'm not sure it's something we should draw inspiration from, but the Win32 OPENFILENAME struct has an lStructSize member, presumably to solve the same sort of problem. https://learn.microsoft.com/en-us/windows/win32/api/commdlg/ns-commdlg-openfilenamew

@gnosek
Copy link
Contributor

gnosek commented Mar 22, 2025

@geraldcombs,

Stratoshark uses it. It's much faster than extracting each field individually, at least for the cloudtrail plugin.

Out of pure ignorance, does performance matter (that much) for Stratoshark? I.e. do you do mass field extraction on entire captures (where I can see how perf would be critical), or only on the current one selected in the GUI (where I'd expect it to be lost in the noise)?

I'm not sure it's something we should draw inspiration from, but the Win32 OPENFILENAME struct has an lStructSize member, presumably to solve the same sort of problem. https://learn.microsoft.com/en-us/windows/win32/api/commdlg/ns-commdlg-openfilenamew

Yup, embedding the struct size as a member is a classic solution to this problem. This is going to be painful for SDKs, since you can no longer treat ss_plugin_field_extract_input.fields as an array of objects (it's going to become an array of objects-with-unknown-padding-in-between).

How about we try replacing

uint32_t num_fields;
ss_plugin_extract_field* fields;

with a union (and clean it up properly for v4):

union {
    struct {
        uint32_t num_fields;
        ss_plugin_extract_field* fields;
    } legacy;
    struct {
        uint16_t size_of_field;
        uint16_t num_fields;
        ss_plugin_extract_field* fields;
    } flexible;
};

This way we can keep the ABI compatible: new plugins will know that if size_of_field==0, then it's whatever we have now and the plugin framework can enforce size_of_field==0 for legacy plugins and num_fields==1 after we change the struct layout.

For v4, I'd consider using an array of pointers to ss_plugin_extract_field, rather than just an array of objects

Also, as a hardening measure against future changes:

  • add an optional char padding[SIZE] to each struct
  • build the framework with padding
  • build the test suite without padding
  • ensure the tests still work

Also, it feels like we should start working on API v4 soon, there are enough // TODO(v4) comments already :)

cc @leogr @jasondellaluce

EDIT: we'll have to handle the uint16_t ordering based on arch endianness :|

@leogr
Copy link
Member

leogr commented Mar 24, 2025

Hey @geraldcombs and @gnosek

Although I generally like embedding the size, I'm afraid that introducing it now would be a hybrid solution in the APIs. Thus, I'd propose postponing it to v4 (unless we have no other alternative).

By rethinking it a bit (and apologizing for going back and forth), I came up with a simpler idea (possibly 😅 ).
Basically, since it makes sense to request offset metadata regardless of the value of num_fields (ie. I don't believe asking it just for some fields in the same batch is a compelling use case), we can just move this logic up to ss_plugin_field_extract_input and use a distinct struct for holding metadata. For example:

typedef struct ss_plugin_extract_field_metadata {
	uint32_t start_offset;
	uint32_t end_offset;
} ss_plugin_extract_field_metadata;

// Input passed to the plugin when extracting a field from an event for
// the field extraction capability.
typedef struct ss_plugin_field_extract_input {
       ...
	//
	// The length of the fields array (and fields_metadata array, if metadata is requested).
	uint32_t num_fields;
	//
	// An array of ss_plugin_extract_field structs. Each entry
	// contains a single field + optional argument as input, and the corresponding
	// extracted value as output. Memory pointers set as output must be allocated
	// by the plugin and must not be deallocated or modified until the next
	// extract_fields() call.
	ss_plugin_extract_field* fields;

       ...

	// An array of ss_plugin_extract_field_metadata structs. 
	// The array is allocated only if some metadata flags (see below) will be enabled.
	// Indexed as like as `fields`.
	ss_plugin_extract_field_metadata* fields_metadata;
	
	// If true, signal that the framework wants offsets metadata
	ss_plugin_bool metadata_offsets;
} ss_plugin_field_extract_input;

Appending members to ss_plugin_field_extract_input shouldn't be an issue since this struct is:

  • never used in an array
  • always passed as a pointer-to-struct 👇
    	    ss_plugin_rc (*extract_fields)(ss_plugin_t* s,
    	                                   const ss_plugin_event_input* evt,
    	                                   const ss_plugin_field_extract_input* in);
    

By doing so, we can retain full backward compatibility.

wdyt?

@gnosek
Copy link
Contributor

gnosek commented Mar 24, 2025

we can just move this logic up to ss_plugin_field_extract_input and use a distinct struct for holding metadata.

Big +1 from me, though you just did the same thing as we originally did with fields -- passing an array of structs across the plugin boundary :)

We can make this less painful by embedding the size from day 1, but it will still involve pointer arithmetic in the plugin. Doing this the next_batch way (ss_plugin_extract_field_metadata***) might be overkill for a pair of u32s though.

@ekoops
Copy link
Contributor

ekoops commented Mar 24, 2025

I like @leogr idea but, as pointed out by @gnosek , this will introduce the same problem we are having now with the exposure of array of structs. If the solution to this problem involves including the size of the struct, at this point there are no big advantages with respect to @gnosek 's solution in #2322 (comment) , so I would go with this one.

@leogr
Copy link
Member

leogr commented Mar 24, 2025

@gnosek and @ekoops you're totally right and thank you for helping with brainstorming this. So, considering your point, I'd keep it as simple as possible by going very specific for the offsets use case, thus something like:

typedef struct ss_plugin_extract_field_offsets {
	uint32_t start_offset;
	uint32_t end_offset;
} ss_plugin_extract_field_offsets;

// Input passed to the plugin when extracting a field from an event for
// the field extraction capability.
typedef struct ss_plugin_field_extract_input {
       ...
	//
	// The length of the fields array (and fields_offsets array, if it is requested).
	uint32_t num_fields;
	//
	// An array of ss_plugin_extract_field structs. Each entry
	// contains a single field + optional argument as input, and the corresponding
	// extracted value as output. Memory pointers set as output must be allocated
	// by the plugin and must not be deallocated or modified until the next
	// extract_fields() call.
	ss_plugin_extract_field* fields;

       ...

	// If true, signal that the framework wants offsets to be computed
	ss_plugin_bool request_offsets;

	// An array of ss_plugin_extract_field_offsets structs. 
	// The array is allocated only if `request_offsets` is true.
	// Indexed as `fields`.
	ss_plugin_extract_field_offsets* fields_offsets;
	
} ss_plugin_field_extract_input;

If we need to add new metadata before v4, we will continue with this pattern since I don't expect we will need too many extensions.

@gnosek
Copy link
Contributor

gnosek commented Mar 24, 2025

I can work with this :) (but please let's consider v4 some time this century)

How will the plugin indicate that the offsets were actually generated? Setting ->request_offsets to false is one option, or the caller could just zero-init the array and treat 0..0 ranges as explicitly missing

@leogr
Copy link
Member

leogr commented Mar 24, 2025

I can work with this :) (but please let's consider v4 some time this century)

How will the plugin indicate that the offsets were actually generated? Setting ->request_offsets to false is one option, or the caller could just zero-init the array and treat 0..0 ranges as explicitly missing

I believe 0..0 is ok to signal offsets weren't generated.

@geraldcombs
Copy link
Contributor Author

Out of pure ignorance, does performance matter (that much) for Stratoshark? I.e. do you do mass field extraction on entire captures (where I can see how perf would be critical), or only on the current one selected in the GUI (where I'd expect it to be lost in the noise)?

We do both. Although we try to avoid mass field extraction (full dissection) as much as possible, we have to do so when initially loading a file so that we can properly render event columns and correlate events. Full redissection also happens any time the user applies a display filter, changes profiles, and opens various analysis windows.

We dissect single events whenever the user selects an event so that we can build the detail tree, and in the background in order to colorize the event list.

@geraldcombs
Copy link
Contributor Author

// Input passed to the plugin when extracting a field from an event for
// the field extraction capability.
typedef struct ss_plugin_field_extract_input {
       ...
	//
	// The length of the fields array (and fields_offsets array, if it is requested).
	uint32_t num_fields;
	//
	// An array of ss_plugin_extract_field structs. Each entry
	// contains a single field + optional argument as input, and the corresponding
	// extracted value as output. Memory pointers set as output must be allocated
	// by the plugin and must not be deallocated or modified until the next
	// extract_fields() call.
	ss_plugin_extract_field* fields;

       ...

	// If true, signal that the framework wants offsets to be computed
	ss_plugin_bool request_offsets;

	// An array of ss_plugin_extract_field_offsets structs. 
	// The array is allocated only if `request_offsets` is true.
	// Indexed as `fields`.
	ss_plugin_extract_field_offsets* fields_offsets;
	
} ss_plugin_field_extract_input;

Is fields_offsets supposed to be allocated by the caller? If so, can we omit request_offsets and just set fields_offsets to NULL to indicate that we don't want offsets?

@geraldcombs
Copy link
Contributor Author

I believe 0..0 is ok to signal offsets weren't generated.

Stratoshark and Wireshark use the term "generated" to refer to a field that is present but doesn't correspond to event or packet bytes. Generated fields are shown in the UI with square brackets, e.g. the next TCP sequence number which is computed from the current sequence number + the segment length:

    Sequence Number (raw): 3247163969
    [Next Sequence Number: 90    (relative sequence number)]

For libs field extraction, I'm currently interpreting 0..0 to mean "generated" in the Stratoshark/Wireshark sense.

@geraldcombs
Copy link
Contributor Author

Maybe I'm missing something but what about using start and length instead instead of start and end? Assuming both unsigned, they help simplifying checks (i.e.: just simple start + length < data.size())

I started with the cloudtrail plugin and ended up using with Go's slice notation. Switching to start+length isn't a problem if that would be preferred, we would just need to establish a convention for determining whether or not the plugin supports offsets. The current code initializes start+end to an invalid pair (1..0) but it's easy enough to do the same thing with a start+length pair, e.g. (UINT32_MAX, UINT32_MAX).

Stratoshark, Wireshark, and tcpdump all use start+length as a convention for offsets, so I switched to that.

@FedeDP
Copy link
Contributor

FedeDP commented Apr 9, 2025

/milestone 0.21.0

@poiana poiana added this to the 0.21.0 milestone Apr 9, 2025
geraldcombs and others added 6 commits April 9, 2025 12:15
Add ss_plugin_extract_field_offsets as a companion struct to
ss_plugin_extract_field.

Signed-off-by: Gerald Combs <gerald@wireshark.org>
Remove field_offsets from ss_plugin_field_extract_input. We can just
check to see if field_offsets is set. Update some comments.

Signed-off-by: Gerald Combs <gerald@wireshark.org>
Add extraction offsets to the filter cache. Add an offset parameter to
the various extract_nocache functions. Implement offset extraction in
sinsp_filter_check_plugin::extract_nocache, and ignore offsets
elsewhere. Add sinsp_filter_check::extract_with_offsets. Add an offsets
test to plugins.ut.cpp.

Signed-off-by: Gerald Combs <gerald@wireshark.org>
Co-authored-by: Federico Di Pierro <nierro92@gmail.com>
Signed-off-by: Gerald Combs <gerald@wireshark.org>
Wireshark and tcpdump both handle offsets using start+length pairs, so
use that convention here.

Signed-off-by: Gerald Combs <gerald@wireshark.org>
Signed-off-by: Gerald Combs <gerald@wireshark.org>
@geraldcombs geraldcombs force-pushed the plugin-slice-ftype branch 2 times, most recently from bf8f202 to 256bada Compare April 9, 2025 19:34
Add support for extracting offsets for each value instead of just the
first one.

Signed-off-by: Gerald Combs <gerald@wireshark.org>
sinsp_filter_extract_cache::offset() was unused, so remove it.

Signed-off-by: Gerald Combs <gerald@wireshark.org>
@leogr leogr requested review from FedeDP, gnosek and leogr April 15, 2025 08:18
@poiana
Copy link
Contributor

poiana commented Apr 15, 2025

LGTM label has been added.

DetailsGit tree hash: 732b14f0b7b277c03f1fd6f4128dc484038014e1

Copy link
Contributor

@FedeDP FedeDP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@github-project-automation github-project-automation bot moved this from Todo to In progress in Falco Roadmap Apr 18, 2025
@poiana
Copy link
Contributor

poiana commented Apr 18, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: FedeDP, geraldcombs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@poiana poiana merged commit 24539f5 into falcosecurity:master Apr 18, 2025
47 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in Falco Roadmap Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants