As developers, we’re not just building AI models to analyze companies; we’re stepping into the shoes of active investors. While AI can process vast amounts of data, it misses the human side—the firsthand experiences we gain by touring production facilities or using a product. These insights help us assess a company in ways AI simply can’t.

Today, we’re asking: How big does a company have to be in order to be a candidate for AI model training?

At some point in the process of preparing the dataset for training an AI model, a crucial decision arises: Which companies should be included for training, and what is the minimum market size for a company to be considered for both training and inference?

Why is this important? The answer is twofold:

On one hand, there’s the issue of data quality. Smaller companies often report data less frequently mostly due to limited resources—for example on a yearly basis rather than quarterly, especially for non-US firms. They may also omit non-mandatory observables, resulting in a less complete datasets. This inconsistency can significantly impact the quality of the data used for training the model.

On the other hand, small companies are often younger and may operate differently than more established firms. For instance, they may still be in the early stages of product development and may have gone public through an IPO to raise capital for financing the final stages of development, without having a product on the market yet. In such cases, key financial metrics like revenue are not yet meaningful. Moreover, younger companies typically raise equity capital rather than taking on debt, which leads to financial structures that differ significantly from those of mature companies. These variations can distort the distribution of training samples, making it harder for the AI model to generalize effectively.

Ultimately, the financial metrics of younger companies often follow different distributions compared to more established firms. These differences—such as early-stage revenue patterns or capital structures—can skew the AI’s understanding and predictions. Therefore, selecting companies of an appropriate size is crucial to ensuring the data used for training is relevant, reliable, and consistent. However, excluding smaller companies reduces the size of the training dataset, which is already the single most important bottleneck when applying AI for stock selection.

This presents a dilemma: Data quality versus data quantity. At some point, sacrifices must be made.

The company size landscape

Before we dive into the intricacies of market capitalization, let’s start with a little quiz. In developed markets, what do you think is the average market capitalization of a listed company? The mean or the median value, i.e. the figure that sits right in the middle when companies are sorted by size?

Many people might assume that the average is significantly higher than it actually is, largely due to the presence of giants like Apple, Amazon, and Microsoft. When we look at all companies in the “developed markets,” as defined by MSCI, with a market cap exceeding €100 million, we end up with approximately 10,000 companies.

 

The Histogram Reveals the Truth

When we plot these companies on a histogram, the data tells a surprising story. The largest firms are outliers; only a handful reach a market cap in the trillions and disappear from view due to the sheer number of smaller companies 

The vast majority of companies are much smaller, so much so that they end up in the smallest bin of our histogram. On average, the mean market cap is €6.5 billion, while the median is a mere €710 million. In Germany, a company like Zeal Network fits this profile, while in the US, we find lesser-known entities like Ceva or Ani Pharmaceutical—names recognized mostly by experts in their fields.

If we narrow our focus to companies with a market size of less than €2 billion, we see that roughly 70% of these companies remain.

A telling pattern emerges: as market capitalization size decreases, the number of companies increases. In fact, over 40% of listed companies have a market capitalization of less than €500 million.

The Implications for Machine Learning in Investment

This presents a challenge for using machine learning in business selection. Why? Because stock markets are disproportionately influenced by a small number of large companies. When a major player like Apple falters, the market feels the tremors. However, for our purposes, let’s set aside market fluctuations and concentrate on company fundamentals.

The real challenge arises from the use of market capitalization as a filtering criterion. Excluding smaller companies can significantly reduce the dataset available for training machine learning models. For instance, using only companies with a market cap above €700 million—the median size—would cut the training data by nearly half. And remember, that initial set of 10,000 companies was already a less-than-ideal sample for machine learning applications. So, why impose constraints on market capitalization?

The Trade-off Dilemma

It’s a trade-off: do you choose a smaller, more homogeneous dataset or a larger, potentially more heterogeneous one? The motivation to exclude smaller companies stems from the belief that they often operate under different rules than larger firms. As noted in the Harvard Business Review article “A Small Business Is Not a Little Big Business” by Welsh and White, small businesses do not simply scale down their larger counterparts. They inhabit unique conditions, often clustered in fragmented industries where price-cutting is common. Moreover, external forces, like changes in regulations or tax laws, tend to impact them more heavily than large corporations.

Moreover, from a data science perspective, the quality of data from smaller companies is often inferior, particularly outside the US. Not all countries mandate that companies publish financial statements quarterly, allowing smaller firms to minimize reporting. This results in missing data and a lack of detail in financial statements—challenges that complicate analysis [for more interest in the topic of missing data in financial applications see a recent paper on Missing Financial Data].

Ultimately, the developer must strike a balance, identifying the ideal threshold for excluding smaller, less representative companies from the training set while still retaining a substantial amount of training data.

A Closer Look at Small Companies

Are small companies truly that different? Let’s explore three emerging French firms in the circular economy: Afyren, Carbios, and Hoffmann Green Cement.

The newcomers of the circular economy – A journey through France:

Afyren, Carbios and Hoffmann Green Cement. Circular economy. This could be the headline of the trip to interesting companies in France in December 2022. The term came up again and again during the trip, which introduced us to three emerging French companies, namely Afyren, Carbios and Hoffmann Green Cement.

They all have big ambitions: To become major players in a new, sustainable circular economy. Of the companies, Carbios is still the largest, with a market capitalisation of €410 million. Afyren and Hoffmann Green Cement are still micro-caps, with market capitalisations of €147 million and €131 million respectively. The three companies are not necessarily representative of the typical “small” company, as they have another special feature. Not because they are all French, but because they are very young companies, still at the beginning of their existence. Carbios went public in 2013, Hoffmann Green Cement in 2019 and Afyren in 2021, all within the last decade and all with the goal of raising capital to scale up their prototypes or proof-of-concepts to industrial size. So let’s take a quick look at the companies that aim to revolutionise industries by producing petrochemical-free acids, extending the life of plastics and offering low-carbon cement.

 

Afyren

Our journey begins in Clermont-Ferrand, where Afyren, a manufacturer of bio-based organic acids, is making waves. While 99% of organic acids used in everyday products are petrochemical, Afyren harnesses fermentation technology to create bio-based alternatives using sugar beet byproducts. The final acids are the starting point of a wide range of products such as salty chips, fragrances, additives for food preservation, plastics for consumer products, lubricants for aerospace, hair conditioners or battery coolants and refrigerants, to name a few. Their first plant, completed in 2022, allows for industrial-scale production, and they’ve already pre-sold half of their targeted sales. On the left: Small lab sample of a natural non-genetically modified micro organisms mix to produce organic acids.

Carbios

Next, we visit Carbios, also in Clermont-Ferrand. This innovative company is revolutionising the lifecycle of plastics by developing enzymes that break down PET bottles and textile waste into their basic components (monomers) for reuse in the production of 100% recycled and recyclable PET. Carbios can recycle materials without losing quality—a significant improvement over conventional methods. Their technology is already attracting attention from multinational brands like L’Oréal and PepsiCo.

On the right:  The mysterious enzymatic processes run hidden in big reactors. In them, shredded and porous PET plastic is mixed with water and enzymes. After 24 hours at a temperature of just 65 degrees Celsius, the enzymes break down long polymer chains into individual monomers. If these are isolated from the residual materials, they can be directly reused for the production of packaging or bottles, in the same quality as the original products. 

Hoffmann Green Cement

Finally, we travel to Rives-de-l’Yon, where Hoffmann Green Cement is redefining the cement industry. Conventional cement production is a major greenhouse gas emitter, but Hoffmann’s unique process produces cold decarbonized cement with a significantly lower carbon intensity. However, Hoffmann has yet to generate significant revenue, making it difficult to assess its financial viability based solely on historical data.

Is this the future of cement production?

Eric & Kevin taking the chimney’s lift to the top

 

The Human Element in Investment Decisions

So what is the issue with the theses businesses? Why would one choose to exclude them from training when building machine learning models based on fundamental data, i.e. data from the income statement, balance sheet and cash flow statement? Why reject these companies and ending up with the problem of reducing the already scarce training material? Let’s pick Hoffmann as an example and look at the financial statements. At the time of writing this posts, the latest document is the half-year financial report at June 30, 2022. It is available in french only but this will not prevent us from reading out the most important figures.

Sometimes figures say more than a thousand words! Hoffmann Green Cement has not yet generated any significant turnover. A meagre €500,000 in turnover compared to a market capitalisation of over €100 million! At the same time, the company is burning cash: almost €5 million in operating activities. But this is understandable: The company needs to invest a lot of money to prove that its cement production works on an industrial scale, and to do that it first needs to build sufficient capacity and hire people (in addition to massive investments). At least there is still enough cash in the bank account left over from the IPO. Hoffmann Green Cement’s technology could be a game changer. It could be the future of cement production. But you wouldn’t know it from its latest financials. The company cannot be valued on the basis of its historical data. An investment is a bet that its technology will succeed and become mainstream. Or a bet that the company will be acquired by one of the big cement producers, such as Switzerland’s Holcim, France’s Lafarge, Germany’s HeidelbergCement, or an exotic company like Mexico’s Cemex.

But we want to stay away from betting when it comes to our quantitative approach. We want to shift the return distribution of our equity positions a little to the right, towards outperforming the broad market. However, we do not want to identify the potential extreme outliers that have the ability to dominate an industry in the future through disruptive technologies. Machine learning does not (yet) offer this capability. An approach based on historical fundamentals is not appropriate for estimating future total addressable markets, the likelihood of the adoption of certain technologies, or the future regulation of markets by governments or public authorities.

Human judgement is still required. I am happy to leave it to discretionary managers to find and evaluate such companies. My toolkit is not designed for that. Nor do I want such companies to completely distort my set of established business models and therefore exclude them from the training set. As developers, we must determine the right threshold for excluding smaller firms without drastically shrinking our training pool. But what is the limit? Well, that’s for everyone to find out for themselves.

Ultimately, understanding the market landscape, especially for small companies, requires nuance and careful consideration—balancing the desire for data with the reality of what that data can tell us.