bridge.pipelines.utils.cleaning module#
Utilities for cleaning and canonicalizing objects.
The functions are intentionally conservative: they avoid guessing semantics and focus only on reversible, mechanical clean-ups and canonical forms that improve comparison and stability.
- bridge.pipelines.utils.cleaning.canonicalize_shields_url(url)[source]#
Canonicalize a shields.io image URL for stable comparison.
This helper focuses on shields.io badge URLs served from img.shields.io. The main goal is to strip out purely cosmetic parameters that shouldn’t affect logical equality (e.g. user-chosen logos) and to provide a stable query parameter ordering.
Behaviour: - If the URL does not contain img.shields.io, it is returned unchanged. - If it does, the query string is parsed, the logo parameter is removed
(case-insensitive), and the remaining parameters are sorted lexicographically and re-encoded.
- Parameters:
url (str) – The shields.io URL to canonicalize.
- Returns:
The canonicalized URL, suitable for equality checks or deduplication.
- Return type:
str
- bridge.pipelines.utils.cleaning.canonicalize_url(url)[source]#
Canonicalize a generic URL for stable comparison.
This performs a minimal, well-defined normalization intended to make string-based URL comparisons less fragile without changing semantics for typical HTTP(S) URLs.
Normalizations applied: - Lowercase the scheme and netloc (host + port). - Strip trailing slashes from the path, but ensure the path is at least “/”. - Parse the query string into key/value pairs, sort them, and re-encode
(preserving multiplicity via doseq=True).
Drop the fragment entirely (anything after ‘#’).
- Parameters:
url (str) – The URL to canonicalize.
- Returns:
The canonicalized URL.
- Return type:
str
- bridge.pipelines.utils.cleaning.escape_shields_part(value)[source]#
Prepare a label or message for use in a Shields.io badge path segment.
Shields.io encodes meaning into certain characters in the path portion of the URL. This function escapes a free-form string so that it can be safely embedded in that position without accidentally triggering Shields’ special syntax.
Shields path semantics: - - = segment separator - – = literal - - _ = space - __ = literal _
This function: - Converts - to –. - Converts _ to __. - Percent-encodes anything else that needs escaping, but leaves _
unchanged so that Shields can interpret it as a space.
- Parameters:
value (str) – The label or message to escape.
- Returns:
The escaped value, suitable for direct inclusion in a Shields.io URL path segment.
- Return type:
str
- bridge.pipelines.utils.cleaning.normalize_color(value)[source]#
Normalize a color value by stripping a leading ‘#’ and percent-encoding.
Behaviour: - Coerces the value to a string and strips surrounding whitespace. - Removes a leading # if present. - Percent-encodes the remaining value with no safe characters.
- Parameters:
value (str) – The color value to normalize (e.g. “#4c1”, “brightgreen”).
- Returns:
The normalized color string, without a leading ‘#’, and percent-encoded for safe use in URLs.
- Return type:
str
- bridge.pipelines.utils.cleaning.normalize_dict_strings(d)[source]#
Recursively normalize all string-like values in a dictionary.
This function walks the entire structure of the given dict and applies
normalize_text()to any string orNonevalue it encounters.Keys are left unchanged. Non-string scalar values (numbers, booleans, etc.) are preserved as-is.
- Parameters:
d (dict[str, Any]) – The dictionary whose string values (and nested structures) should be normalized.
- Returns:
A new dictionary with the same structure as d, where all nested string/None values have been normalized.
- Return type:
dict[str, Any]
- bridge.pipelines.utils.cleaning.normalize_pydantic_model_strings(model)[source]#
Recursively normalize all string fields in a Pydantic model in-place.
For each declared field on the model:
If the current value is a string or
None, it is passed throughnormalize_text().If the current value is a container (dict, list, tuple, set), its contents are normalized recursively via
_normalize_structure().If the current value is another Pydantic model, it is normalized recursively by calling normalize_pydantic_model_strings on it.
If the object does not look like a Pydantic model (i.e. has neither
model_fieldsnor__fields__), it is returned unchanged.- Parameters:
model (Any) – The Pydantic model instance to normalize, or any other object.
- Returns:
The same
modelobject, potentially modified in-place if it is a Pydantic model with string fields or nested containers.- Return type:
Any
- bridge.pipelines.utils.cleaning.normalize_text(value, normalize_multiline=True)[source]#
Clean and normalize free-text values.
This is a general-purpose text scrubber intended to remove presentation artefacts (HTML entities/tags, box-drawing characters) and normalize whitespace so that strings are more suitable for storage, comparison, or inclusion in metadata formats.
Steps performed: 1. Decode HTML entities (e.g.
<i>→<i>,&→&). 2. Strip all remaining HTML tags (e.g.<i>name</i>→name). 3. Replace the box-drawing dash─with a plain ASCII-. 4. Remove non-printable characters. 5. Optionally collapse all whitespace, including newlines, into singlespaces (normalize_multiline=True).
Strip leading and trailing whitespace.
- Parameters:
value (str | None) – The text to normalize. If
None, the function returnsNone.normalize_multiline (bool, optional) – If True (default), newlines and runs of whitespace are collapsed into single spaces. If False, existing line breaks are preserved and only non-printable characters and HTML artefacts are removed.
- Returns:
The normalized text, or
Noneif the input wasNone.- Return type:
str | None