HTML Stripping & Whitespace Cleanup

HTML stripping and whitespace cleanup are the first two steps in FeedPrep's normalization pipeline. They run before any column mapping, value mapping, or transformation — ensuring that downstream steps work with clean, consistent text.

Step 1: Whitespace Cleanup

Supplier feeds frequently contain messy whitespace: extra spaces, leading or trailing blanks, and fields that look like they have a value but are actually just empty space. Whitespace cleanup handles all of this automatically.

What it does:

Trims leading and trailing whitespace — Removes spaces, tabs, and newlines from the start and end of every field value.
Collapses multiple spaces — Converts runs of multiple spaces within a value into a single space. Oslo Dining Chair becomes Oslo Dining Chair.
Converts empty strings to null — Fields that contain only whitespace (or are empty after trimming) are set to null. This ensures that "blank" fields are consistently represented and that downstream validation correctly identifies them as missing values.

Raw Value	After Cleanup
`Oslo Dining Chair`	`Oslo Dining Chair`
`Black Leather`	`Black Leather`
	null
(empty string)	null

Step 2: HTML Stripping

Many supplier feeds contain HTML markup embedded in field values — especially in product descriptions, but sometimes in unexpected places like titles or specifications. HTML stripping removes all HTML tags from field values, leaving only the plain text content.

Examples:

Raw Value	After Stripping
`<p>Solid oak <strong>dining</strong> chair</p>`	`Solid oak dining chair`
`Color: <span style="color:red">Red</span>`	`Color: Red`
`<br/>45 cm<br/>`	`45 cm`

Per-Column "Keep HTML" Option

Not all HTML should be stripped. Product descriptions often contain intentional formatting — bullet lists, bold text, paragraph breaks — that should be preserved for display on your storefront or marketplace listing.

FeedPrep's adapter includes a keep_html setting that lets you specify which fields should retain their HTML content. When a field is marked with keep_html, the HTML stripping step skips that field entirely.

Common use cases for keeping HTML:

Product descriptions — Rich text formatting that should appear on your website.
Feature lists — HTML bullet lists or tables describing product specifications.
Marketing copy — Styled text intended for display rather than data processing.

Fields not marked with keep_html will always have their HTML stripped. This default-strip approach ensures that stray HTML tags don't contaminate structured data fields like color, material, or dimensions.

Pipeline Position

Whitespace cleanup and HTML stripping are deliberately the first steps in the normalization pipeline:

Whitespace cleanup
HTML stripping
Column mapping
Value mapping
Unit conversion
Transform rules
Validation

By running first, they ensure that every subsequent step — column mapping, value matching, unit extraction — works with clean text. A value like <b>Black</b> becomes Black before the value mapper ever sees it, so your mapping rules only need to handle the actual content, not formatting artifacts.

HTML Stripping & Whitespace Cleanup

Step 1: Whitespace Cleanup

Step 2: HTML Stripping

Per-Column "Keep HTML" Option

Pipeline Position

Clean Data From the First Step