HTML stripping and whitespace cleanup are the first two steps in FeedPrep's normalization pipeline. They run before any column mapping, value mapping, or transformation — ensuring that downstream steps work with clean, consistent text.
Step 1: Whitespace Cleanup
Supplier feeds frequently contain messy whitespace: extra spaces, leading or trailing blanks, and fields that look like they have a value but are actually just empty space. Whitespace cleanup handles all of this automatically.
What it does:
- Trims leading and trailing whitespace — Removes spaces, tabs, and newlines from the start and end of every field value.
- Collapses multiple spaces — Converts runs of multiple spaces within a value into a single space.
Oslo Dining ChairbecomesOslo Dining Chair. - Converts empty strings to null — Fields that contain only whitespace (or are empty after trimming) are set to null. This ensures that "blank" fields are consistently represented and that downstream validation correctly identifies them as missing values.
| Raw Value | After Cleanup |
|---|---|
Oslo Dining Chair | Oslo Dining Chair |
Black Leather | Black Leather |
| null |
(empty string) | null |
Step 2: HTML Stripping
Many supplier feeds contain HTML markup embedded in field values — especially in product descriptions, but sometimes in unexpected places like titles or specifications. HTML stripping removes all HTML tags from field values, leaving only the plain text content.
Examples:
| Raw Value | After Stripping |
|---|---|
<p>Solid oak <strong>dining</strong> chair</p> | Solid oak dining chair |
Color: <span style="color:red">Red</span> | Color: Red |
<br/>45 cm<br/> | 45 cm |
Per-Column "Keep HTML" Option
Not all HTML should be stripped. Product descriptions often contain intentional formatting — bullet lists, bold text, paragraph breaks — that should be preserved for display on your storefront or marketplace listing.
FeedPrep's adapter includes a keep_html setting that lets you specify which fields should retain their HTML content. When a field is marked with keep_html, the HTML stripping step skips that field entirely.
Common use cases for keeping HTML:
- Product descriptions — Rich text formatting that should appear on your website.
- Feature lists — HTML bullet lists or tables describing product specifications.
- Marketing copy — Styled text intended for display rather than data processing.
Fields not marked with keep_html will always have their HTML stripped. This default-strip approach ensures that stray HTML tags don't contaminate structured data fields like color, material, or dimensions.
Pipeline Position
Whitespace cleanup and HTML stripping are deliberately the first steps in the normalization pipeline:
- Whitespace cleanup
- HTML stripping
- Column mapping
- Value mapping
- Unit conversion
- Transform rules
- Validation
By running first, they ensure that every subsequent step — column mapping, value matching, unit extraction — works with clean text. A value like <b>Black</b> becomes Black before the value mapper ever sees it, so your mapping rules only need to handle the actual content, not formatting artifacts.