Last updated on 2020-11-24
Dealing with product data of big online retail stores, market places and price comparison websites is a messy and dirty thing: sparse, incorrect and redundant records of millions of products need to be aligned with huge category trees and product hierarchies of hundred of thousands of categories in dozens of levels. This will be complicated by predecessor and successor relationships of product families, similar names of products or manufacturers and data providers who try to maximize their own advantages by deliberately delivering flawed data.
This results in huge manual efforts for data cleaning usability issues like bad onsite search experience, bad SEO of the resulting websites and bad conversions. In the end this situation is a loss of profit for the site operators.
Recently an internationally operating online marketplace was in exactly this situation: the website was driven by the messy data uploaded by its merchants and its database with 12 million products totally out of control. The company was bleeding money and as part of a liberation blow from its merchants about to sign a contract to manually clean and recreate its complete product taxonomy for an outrageous amount of money.
At this time we were asked if it would be possible to recreate category trees based via technology for less money in less time and not only once but as often as needed. In short it worked! The longer version is: we found a way to transform the messy product data in a machine readable data representation we called “product vectors” and could apply standard AI techniques like clustering, machine learning and similarity detection to product and category relations. The solution we initially developed as a prototype could be applied to a whole range of issues every operator of product centric websites has to deal with.
If you are interested to learn how this could be applied to your situation and help you to save money: LET’S TALK.
Broadly speaking this product vectorization approach could be utilized for category related issues, for product related issues and in the overlapping field:
Category Related
- Product taxonomies which are open to 3rd party updates by partners, merchants or consumers are often growing in undesired directions over time and tend to degenerate. When the inconsistency is reaching a certain level it might be an option to not only optimize the given category tree but to completely retrim the category tree to desired dimensions. By turning product data into vectors and clustering them into a given number of groups we are able to rebuild a given product catalog into a highly consistent and new category tree. By doing this level by level we can recreate an entirely new product hierarchy based on the features of the existing products. A leaner and consistent category tree can help with overall SEO (hence create more traffic) and also with better onsite findability (hence increase conversions).
- Identifying (near) duplicate leaf nodes in a category tree: we can help to clean organically grown catalogues with high category overlap and inconsistency. This helps to make new product listings faster, especially for third party data providers like merchants but it also helps end users to get all products of the same type in one place.
- By looking into common grounds and clusters of products of a given category we are able to identify and / or extract filters, attributes or new sub-categories within leaf nodes.
Product Related
- Matching a product to its best matching category within a given product-taxonomy is an important task for merchants, marketplace operators, price comparison websites and all who rely on accurate product information. Often this is done on incomplete data or even worse by hand. Product vectorization is a fast, reliable and highly accurate automatic way of match making between products and categories. It releases the lister of an item to select a category. Instead, the matching service either provides the best matching leaf node and the lister only has to confirm, or the matching service is listing the item in the best category automatically, without any lister engagement.
- Enriching products automatically in the moment of the initial upload to the infrastructure or its creation, together with the available attributes or filters in its matching category. This enables higher quality listings with better internal linking which will boosts onsite SEO for product taxonomy driven websites. This interconnection is also boosting onsite findability, internal search, conversions and internal linking.
- Identifying highly similar-, near duplicate- or identical-items within an existing pool of products: This will also work even early within the product creation- and/or an upload-process. Our deduplication will work in organized product taxonomies typical for highly structured market places, bigger online shops and price comparison vendors but also with unstructured data repositories more typical for merchant driven marketplaces or data providers. This is a MUST-have feature to automatically ensure high data quality, data consistency, data integrity, to boost conversions and sales and also for a necessity for any successful product based Google SEO. Technically it doesn’t play any role if you only want to deduplicate items only within all the items of a single data provider (e.g. from a single merchant) or within a single category or over your complete inventory, even cross taxonomy (e.g. between overlapping or inconsistent categories)
- A sparse product catalogue could be enriched with data from different sources: this could be elements like better and longer descriptions, product IDs (like EANs or ASINs), images, better titles, structured data sheets, price information, reviews etc. The product data could be enriched on product but also on category level in a given product taxonomy. Data sources could be structured data like Icecat
- Product images are often from poor quality (especially consumer uploads) which is problematic for conversions, sales or clicks. Usually manufacturers and data providers could deliver high quality product images. But matching them is often challenging as there is a high technical hurdle for matching binary image data with text based product data. Vectorizing product catalogues and 3-party metadata could allow us to match high quality data together with existing catalogues in unseen and as often with AI applications in a near magical manner.
- By matching with product catalogues with content rich data like product reviews, price information, ratings, category descriptions, manufacturer descriptions etc. our product vector matching approach is able to enrich your sparse product taxonomy. This helps to drive conversions and even could boost SEO.
- Product names are an important factor for conversion but also for SEO. Websites with user generated product content like marketplaces having issues with establishing meaningful, correct and economically useful product titles. By matching product names below certain quality criteria (length or uniqueness) with titles from high quality 3rd party data catalogues this issue could be solved.
- To create product feeds for listing in price comparison websites and search engines it is necessary to match product inventory from internal product catalogues, even unstructured inventory and messy data to structures of 3rd party catalogs like Google Product. This challenging process can be automated via product vectorization technique and even fine tuned to individual data input and output restrictions.
- Identifying connected but NOT identical items by clustering similar products into connected groups like product families and product series within a given product data catalog is a way to produce automatically semi structured data connections within only weakly structured product catalogues. This enables conversion optimization but also a better internal linking by boosting onsite SEO.
There are probably many more ways to apply this technique to unstructured product data! If you are interested to find ways on how this method could help you to save money, grow conversion, boost sales and grow SEO traffic: LET’S TALK.


