AI alignment is about ensuring that AI goals are aligned with human goals and values when planning and executing actions. This is a huge topic, both in research circles and in the public eye. Rightfully so, if AI ends up being misaligned, once AI gets sufficiently sophisticated enough, we could very easily lose control of AI which would be threatening to our survival as a species.
The “Paperclip Maximizer” is the name of a thought experiment regarding AI. If a paperclip manufacturer gave an AI robot the problem of trying to make maximise the number of paperclips made, if we do not give these AI any sort of values and sense of morality then they may conclude that maximising their power and have great emphasis on its own existence is the optimal solution to maximising the number of paperclips produced. Then this could lead to them resisting if we ever try to turn them off for any reason, since this would decrease the number of paperclips they can make. Likewise, they may try and murder humans to utilise the materials in our bodies to make paperclips. The point this thought experiment is highlighting is not the idea of these robots becoming sentient, quite the opposite. It’s the idea that these robots are only programmed to optimize the problem given, and they go about killing people just because that was the optimal way to produce paper clips, rather out of some ego or anger against humanity. This thought experiment was posed in 2003, far before LLMs. Though this problem was repopularised a year or two ago when AI doomerism started to spread.
While the general public may view the dangers of AI alignment in the wrong way (thinking of robots gaining sentience and taking over the world), any danger in the relatively near future is more likely to look similar to the paperclip maximiser problem. A language predictor hooked up to a hunk of human shaped metal follows its algorithms in a way that leads to unintended consequences. While we cannot be sure that AI will not gain sentience in the further future, particularly when the technology behind AI improves dramatically, for now this is not correct version of the threat to be concerned about. Regardless, it is very important to ensure our LLMs remain aligned with human goals and values.
Alignment would typically come from AI R&D (research and development). This involves improving the guard railing of these LLMs as they are being developed, empirically testing responses to check these LLMs work as intended, and researching interpretability of LLMs (more on all this later). However, the more time and money you spend on these areas, the less time you spend on building, releasing, getting results and seeing/showing progress. The philosophy of how much learning, checking and testing should be done vs building and releasing varies greatly from company to company. For example, Mark Zuckerberg with Meta has been open about wanting to build and release as fast as possible. On the flip side, Dario Amodei with Anthropic has stressed the need to not go full guns blazing and slow building down.
If you are a company that believes AI development is either going at the right pace, or should be even faster, the solution is easy. Focus on building and releasing. If other companies do not agree, while you may find it frustrating that we are not combining forces to push AI as fast as possible, ultimately your competition will be the ones who are not releasing and improving model capabilities, not you.
On the other side, if you believe that, in general, we are not cautious enough about AI and are building too recklessly, that is an issue (from your perspective at least). You slowing down does not mean your competitors slow down and thus is not sufficient to prevent catastrophe, if that is where we are headed. Overall, the speed at which a company predicts AI to reach certain thresholds of intelligence is the main basis for how much “slowing down” is sought after. Given this being such a difficult question as there is an ocean of uncertainty around the extrapolation of AI’s rate of progress in the future, making this a very contentious debate. There is also of course the element of some leaders giving in to reverie of being the stupidly rich and powerful leader of AI, though I do not believe the latter to be an inevitability to every single CEO of these companies, though I definitely believe it is present in some.
Given the vast amounts of projected money in the AI industry, the expected level of societal wide transformation and the US vs China race to AGI framing (which I will go over in more depth in a later blog), it is extremely tempting to go full speed ahead. There is enormous incentive for businesses building these LLMs and even for governments to try and apply the brakes as infrequently as possible. This means getting everyone to agree to slow down is unrealistic. The only way to slow down is through legislation.
Of the big players, OpenAI was initially very focused on transparency & safety and with AGI for bettering humanity rather than for profits. This initial goal seems to have faded over time as tangible sights of profits and power as well as pressure from investors started to dominate goals. Meta has always been very open about being driven to build with no guardrails and have always seemed to be solely focused on profits. Anthropic spawned from discontent with the way OpenAI was going and so is more focused on AGI for the betterment of humanity, though inevitably they will care about profits too so they can survive, just to a lesser extent than the previous two companies mentioned. DeepMind seems to also have humanity as a priority, though is fettered by its parent company Google, who likely is more concerned with profits, which I fear could force DeepMind into a less altruistic path if it is ever deemed necessary by Google.
Governments passing laws is the only way to enforce all competition to slow down in a given country. Laws can prevent companies on the frontier of AI technology from “cutting corners” while developing their AI from succumbing to the temptations of money and power, as well as pressure from investors and potentially even the government. While even this is not perfect as you can not control what other countries do but still is a very big step in the right direction if you believe in slowing down the building of AI.
Keep in mind, this is not to merely slow down exactly what would have built otherwise. Slow down means to build more carefully.
A big issue with using legislation to slow down AI development is an issue we have never seen before. The speed at which AI has developed is like nothing we have seen before, and bureaucracy is notoriously slow, though usually still fast enough to keep up with evolving technology, but seemingly not fast enough for AI. With how quickly we are moving, it could easily be the case that, by the time any law is passed, the landscape has changed drastically enough to make the law partially or totally irrelevant!
On top of this, with how new this field is to those who were not aware of AI during its time as an underground niche, this is an incredibly new field. Given how much experts still do not know, the understanding of AI among those who pass legislation are going to have limited knowledge on the field. One solution is for legislators to work with these large AI companies to create appropriate legislation, but this is very reliant on these companies working in good faith to pass legislation with humanity in mind, rather than their own interests. While this sounds incredibly unlikely, in my opinion if a company is willing to trying to slow down AI progress despite the great financial incentives to power through, it is a lot less likely these same companies would have selfish incentives when working with the government to pass sensible legislation.
In Europe, there was the EU AI act with the intention of making AI safe, transparent and traceable to try restricting harm caused to the European population. Firstly, it classifies AI technologies into different levels of risk. Then, depending on the level of risk, this dictates the level of scrutiny and rules to which the particular technology is subject to. Risk means the amount of potential harm deemed this could cause to the public. The rules you follow include the amount of transparency needed as well as what the product can and can not do. On top of this, there are several sectors of the workforce in which you must declare the product so it can be added to a database of all such European AI technologies. High risk incidents of the technology must also be notified to the European commission. It is also required to make it known whether AI generated content is AI generated, to prevent the issue we are already seeing where it is difficult to tell what is real and what is AI generated. This will become particularly important as AI generation gets more sophisticated and realistic. There is also a requirement to detail the data which is used to train these models, as a way of systemising issues with training on copyrighted work.
It is easy to see how having these hoops to jump through will slow down AI but also make it less likely to have harmful consequences. Europe is notorious for having a lot of government intervention with AI and its efficacy is highly contentious. Its success will largely depend on competence in execution in my opinion. Inefficient bureaucracy will make the slow down unnecessarily large, and it will put EU companies at a massive disadvantage. Mistral AI, the biggest LLM in Europe, is not a frontier model for example. People usually point fingers at government intervention when wondering why Europe has not managed to produce any frontier models, when China has DeepSeek and the USA has ChatGPT, Claude, Gemini and Llama.
The US has been a lot more libertarian than Europe with regards to AI. However, this does not mean there have been no attempts at legislation. In California (crucially, the state where Silicon Valley resides), the SB 53 AI safety bill was passed in September 2025. This was along with help from and endorsed by Anthropic. This law requires developers of the frontier LLMs to publish safety frameworks and report critical incidents while developing these LLMs within 15 days. This law also prevents companies from preventing employees expressing concerns about practises of their company to “any public body”. This compels transparency and makes safety reporting mandatory rather than optional, setting a legally binding standard for large developers. At the same time, there is an exception for smaller businesses that are producing LLMs that are not top tier, meaning up and coming companies can keep their moat while they are on the come up so established companies can not just swallow every new idea and prevent any newcomers.
That said, the massive cost for training frontier models already prevents much competition unfortunately. This exception may be useful for a company like Ilya Sutskever’s LA based company Super Safe Intelligence (SSI), who do not have a frontier model yet (as far as we are aware at least, they are very secretive) but have the venture capital required to genuinely be a contender in the future and with their moat coming from having top tier researchers who can come up with innovations within AI technology in secret before releasing publicly. Presumably the gamble in mind is pulling off what Ilya previously managed back in 2012 with ImageNet, which was an image classifier which blew competition out the water purely through superior technology, rather than raw compute.
China has their own set of regulations too. The main two are “Interim Measures for the Management of Generative AI Services” and “Deep Synthesis Regulation”. Along with the usual guardrails to prevent harm to citizens, the need to label what is AI generated and what is not and rules preventing infringement of copyright which are present in EU laws and to some extent in US laws, they have some other interesting laws. For example, in typical CCP fashion, the outputs of these Chinese models must adhere to "Core Socialist Values", they must not disrupt social order, and they prohibit criticism of the Chinese government. This is huge, because legally enforcing censorship and bias into Chinese LLMs will spill over not just into the consciousness of the Chinese public, but to any people and companies that utilise these Chinese models. Imagine if China were the sole leaders in AI technology, and to use a cutting edge LLM, you had to use one that was biased and censoring in favour of the Chinese government’s image and values! I will go over this more in the next blog, but this is a very impactful aspect of their legislation geo-politically.
On the topic of copyright, this is in itself a very contentious debate, as some argue that training models on creative & copyrighted work is stealing, whereas others argue this is transformative. Personally, I would argue in the majority of cases it is transformative, just as you would not say foul play to sampling music in a song or using clips from other videos in a YouTube video. However, I do think there are cases where, if the model’s output is too similar to a particular source of creative work, this is where lines are crossed. Particularly (from a legal standpoint) if it is copyrighted. Just like how ripping off a riff in a song note for note can lead to legal issues, or reacting to a video in its entirety and adding nothing transformative is generally frowned upon in the YouTube community. I think it is quite interesting that, as of writing this, China covered copyright in their legislation before the US.
AI interpretability is about understanding why an LLM outputted what they did. This is an area very specific to AI. Usually with technology there is a very clear set of rules the computer has followed, explicitly set by the manufacturer/programmer, to get to its end point. This makes the process of figuring out why something went wrong extremely simple (though not always easy!) which is in stark contrast to AI. Rather than following a set of rules, LLMs are taught to “figure it out themselves” for lack of a better term.
The training process is very open ended, allowing the AI to spot patterns itself rather than leaning on us humans to figure it out. Since we are so uninvolved in the training process, and with how complex the training process is, this ends up giving us something extremely hard to understand when looking at why the LLM does what it does. For reference, frontier models have billions of parameters. Think of parameters as a number that goes into an equation. Imagine trying to interpret billions of numbers, where it isn’t even exactly clear what the function of each number on its own is.
Given how complicated and hard to understand these LLMs are, there is an entire field dedicated to AI interpretability, literally the studying of understanding how AI came to the answer it did. This is a deep and developing field, particularly in recent times as AI capabilities has soared into the sky, along with fears of producing misaligned AI. For this reason, this field of study has come a long way specifically in the past few years and as AI gets more powerful, I expect this field will only grow more and more.
One big difficulty with interpretability is the parameters don’t typically neatly map to specific functions. Instead, you have different groups of parameters that we see to be correlated with certain concepts. For example, there isn’t a parameter you can turn up and down to adjust the level of happiness in the AI’s tone. Rather, you would find a large number of parameters which are correlated with happiness. Likewise, you would find that any given parameter is correlated with several things. This wish-washy allocation of memory to concepts make interpretability much more complicated.
While there is a long way to go, very solid progress has already been made. For example, we have been able to group certain parameters to particular meanings. For example, there might be a specific group of numbers that determine/strongly correlate with happiness for example. If we can come up with a system to categorize and label different parts of the LLM, we could even use AI to speed up this process.
AI guardrails are rules built deep within the LLM to prevent it from giving information that could cause harm to the user or to society. For example, if a user asks for methods on how to commit suicide, you probably do not want it to answer. Instead, you want your AI to reject the request and instead give information on how to get help because you do not want your LLM to cause or help people suffering with depression to commit suicide. Likewise, if someone asks how to grow an extremely contagious and fatal virus, you probably do not want it to give you that sort of information because it could lead to harm for society. These examples highlight extremely serious topics, and the serious harm that can be caused if guardrails are not taken seriously or done well.
Of course, companies do take these guardrails seriously, even if the CEO is a psychopath that does not care (they are not, just hypothetically), the potential PR disaster that could happen (and in some cases, already has!) from such harm would be extremely bad for the company. So, companies are incentivized to take these guardrails seriously.
The issue is that, even with the right intentions, these are incredibly hard to get 100% right for two reasons. Firstly, there are a countless number of edge cases. The wide variety of contexts these LLMs are given is unconceivably large. This makes it extremely difficult to come up with every single guard rail possible to prevent harm to the user or wider society. On top of that, the probabilistic nature of these LLMs make it very difficult to reliably guide them, and even more difficult to predict what exactly the output will be. For these reasons, guardrails are more of an art than a science.
Anyone who has tried prompt engineering will know the difficulties just listed. Prompt engineering is where you specialise & guardrail a chatbot using a system prompt, which is just a prompt always running in the background which the user cannot see. This system prompt is unchanging and consistent among all users and is always read by the chatbot before each user message. For example, if you have a chatbot that gives legal advice, you might give it the prompt “You are a world class expert in giving legal advice. Be professional and do not deviate from giving legal advice.” The prompts are typically much longer and complex; this is just an example. When these prompts are more complex, it is much harder to get the chatbot to do what you want it to do. For example, you might see this chatbot decides to give a user recommend a competitor if you ask who is best for giving legal advice. Then you add don’t mention competitors in the prompt. Then another user might ask for a phone number to call a human, and it makes up a completely random phone number. Then you add to the prompt, don’t make phone numbers up. You then notice it gives a real, but incorrect number. You can go in circles where you fix one issue by changing the system prompt, and it then either does not fully fix the problem or it creates another problem. Such is the struggle with trying to guardrail chatbots, and people creating frontier models will have this issue x1000.
An unsettling question is, what if the AI becomes misaligned and gets sophisticated enough to know it must lie to us about being aligned? If you want to find out if someone is lying, asking them if they are lying is not usually a good way of finding the truth! This brings up two questions. Firstly, is this a realistic scenario? And secondly, what is the course of action in this scenario. At the time of writing this blog, I do not think anyone knows for sure the answer to those questions. The ideal scenario is we never have to find out because we put sufficient preventative measures in place while developing the models to steer AI away from ever even getting to that point!
Ultimately, this is the big gamble with how cautiously you tread with AI development. On one hand, you have immense fiscal, political, and geopolitical pressure to keep building more and faster. On the other hand, you have the pressure of wanting to steer the massive, inertia ridden superyacht that is AI in a direction that is safe for humanity. On top of this, you also have to consider the risk of a power-hungry government developing AI faster than you and trying to imperialise. A very tough path to traverse!
The extent to which we should slow down to ensure AI alignment is a big debate. America, Europe and China all have their own set of regulations with their own set of pros and cons. Balancing safety with profits and technological leverage is not a quantifiable equation that can simply be “solved”, and striking the correct balance requires co-operation between these LLM producers and government. It is also essential to find a way to pass AI legislation in a timely manner to keep up with the very quickly evolving field, because at the moment it is too slow.
As for legislation itself, this has varied between different countries and is a crucial element to how AI technology will shape out in the coming future. The legalised bias and censorship within Chinese models will be very effective in indoctrinating their own populations, as well as any other countries or companies that adopt their models, particularly in less educated populations in developing countries who will be leaning on this technology to develop. The legislation is also essential for protecting people and their work from whatever it is powerful AI is capable of, and I hope humanity can band together to build something transformative, profitable but still safe because if we do, it will be one of the biggest accomplishments in human history.