<aside>
đź’ˇ Disclaimer:
I will never be able to include all domains and open challenges.
I am not an expert in all of those fields, in fact, rather in none, nor will I probably ever be with the amount of knowledge out there.
This list is supposed to be an inspiration to explore the vast amount of different domains in ML and give a feeling for what open challenges there are or could be.
Enjoy! đź’›
P.S. If you have any suggestions for improving this list, feel free to email me! ([email protected])
</aside>
<aside>
đź’ˇ Reinforcement Learning
Reinforcement learning is a way to teach computers (agents) how to make good choices through trial and error. It’s like a baby learning how walk. The baby gets a reward (smiling parents) for each step it takes, and it learns to move it’s body to get the biggest reward. 🤖đź§

There are still many challenges but also possibilities with RL!
- Efficient training: It’s very sample inefficient, i.e. the agent takes many random steps that don’t teach it much or even anything.
- Different areas of application: RL can be applied to many different problems! Most notably of course robotics, but also Language Modelling (RLHF), and even controlling an actual plasma for nuclear fusion!! If there is a simulator, we can apply RL.
- Robotics includes autonomous driving, controlling an arm, very delicate control of finger movement, and pretty much everything that science fiction has come up with.
- Different input modalities: RL can work with a variety of different inputs which all require specific care and research!
- Audio: imagine an autonomous car on the roads and all the sudden an ambulance activates it’s sirens. The can can’t see it, so it has to know how to interpret the audio, i.e. what the sound is, where it is coming from, and how to react accordingly.
- Vision: Tesla is a very big proponent of driverless cars with only vision (instead of lidar sensors). Can an agent process pure vision data to make decisions? Or do we first have to preprocess it through segmentation?
- Touch& Pressure: imagine hands trying to pick up something delicate.
- All other sensory data.
- Developing new science: People had very polarised views on RL because it was so inefficient at learning. But now more and more people are arguing that RL is necessary for AI agents to develop NEW things. E.g. an LLM can only get so far with discovering new mathematical proofs by learning from our existing knowledge. To go beyond that, LLMs need to explore new domains. That is where RL might play a significant role!
</aside>
<aside>
đź’ˇ NLP
Natural Language Processing is a way to teach computers how to understand and talk like humans. It helps computers understand what we write, and then respond in a way that makes sense. It’s like teaching a robot how to talk like us! 🤖🗣️đź§

- Hallucinations: This is a very common problem. Large Language Models sometimes tend to be confidently wrong. That means they come up with facts that are simply false and “are convinced” that they are true. It currently is still a challenge to teach them to abstain, i.e. admit they don’t know something. (Which is not exactly the same as generally coming up with facts)
- Reliability: Similar to hallucinations, reliability often refers to the model not being so sensitive to the input, i.e. for similar inputs the response/ output should still be the same/ equally similar.
- Reasoning on Large Sequence Lengths: The input lengths models can handle are becoming longer and longer. Even so long that they can process entire books at once! But with so long sequences, they tend to not always find the information that is required to follow and instruction, e.g. answer a question, where the answer is indeed in the input context.
- Efficiency: Traditional Transformer based LLMs scale quadratically with the input length. There have already been amazing developments in making Transformers more and more efficient, but there is always room for improvement!
- NLP for Low-Resource Scenarios: This is a simple but challenging problem. Current LLMs work very well on resource rich languages, e.g. english, i.e. languages with a lot of text on the internet. But languages that are rarer, are less present in the training data of LLMs and thus those LLMs perform poorly on respective benchmarks.
- Evaluation: Speaking of evaluation, the current benchmarks that LLMs are tested on are pretty much all flawed. Even one of the most respected ones, MMLU, has a lot of nonsensical questions and answers. Not only are the benchmarks themselves not the best, but the metrics are also not the best. Imagine you are trying to evaluate an LLM on question answering tasks. The LLM might give the right answer, but it is not exactly the same as the ground truth label. Then the score might be lower (or higher) that it actually should be.
</aside>
<aside>
đź’ˇ Computer Vision
Computer Vision is the field of teaching computers to see the world. We want computers to recognise objects and tell them apart. We want them to we able to be able to interact with the world based on what they see and what we want them to do. One of the biggest challenges is performing CV reliably and efficiently. A driving car should always be able to recognise a stop sign, no matter whether it is day or night, sunny or raining, or whether the sign is slightly tilted or not.
.jpeg)
- Object detection: You might think this is a solved problem. Just apply YOLO and we are good. But this is far from true. Depending on the benchmark current object detection models still struggle. This is especially true for out of distribution examples, i.e. examples that the model was never trained on. Furthermore, it is annoying to always have a fixed set of classed to detect. Once we want to add one more, we have to retrain the whole model! There is still much to do!
- Semantic Segmentation: Semantic segmentation is the task of detecting different objects in an image by exactly outlining their shape. This is very challenging, especially because we here need a lot of expensive annotations. Developing techniques to make models more efficient, make data generation more efficient, or even creating reliable synthetic data are interesting areas to explore!
- Video processing: Now imagine all tasks that are performed on images, but now on video. The sheer increase in compute is a huge challenge. You might want to develop new methods the make use of the consistency present in videos. E.g. Object tracking is the task of detecting and tracking and object in a video.
- Depth estimation: This one is very overlooked although it is one of the most important areas for VR and AR! If you want to be able to have objects interact with your surrounding, they have to know the 3D structure of this surrounding potentially only by having a 2D video stream!
</aside>
<aside>
đź’ˇ Audio Processing
Audio processing deals with teaching a model how to handle audio data. This includes having audio as an input, but also audio as an output!
What is most interesting, in my opinion, is how to develop techniques that go from audio to audio without parsing the input audio to text (as an intermediate step) and then based on the text generate audio. This might be particularly useful for voice assistants!
.jpeg)
- Audio Classification: This task speaks for itself. You want to classify sounds. Nothing to crazy, but depending on the data you have at hand, can definitely pose a challenge.
- Automatic speech recognition: ASR uses ML to process human speech into readable text.
- Audio generation: Imagine you could write out a text prompt that describes the music you would like to generate! This is already possible (to some extent). Current models perform good, but we can definitely still tell that the music was AI generated. Also the models can only generate up to 30, 60, or 120 seconds.
Audio generation can also mean generating voice from text! That field is also quite promising, but also dangerous! As with image generation (see multimodal learning), we want to also develop techniques to watermark AI generated voices or even music.
- Real time translation: Up until now, most tasks involve audio to text. But what if we want to directly process a voice as input and output another voice? Do we need to first parse the speech into text and then generate the speech in a new voice based on the text? What about directly going from audio to audio, without the intermediate step of text? Think of the magical devices where you can speak into a microphone and the model can directly output what you said, in another language, in real time as far as the language allows it due to it’s grammar.
</aside>
<aside>
đź’ˇ Multimodal Learning
Multimodal learning just means teaching a model to understand multiple modalities (text, image, video, audio, depth, different sensor data, etc.) and solve tasks accordingly. Most research focuses on two modalities but there is already progress in teaching models to handle even more modalities. But everything should be tackled step-by-step!
.jpeg)
- Text to Image generation: Although this field is rapidly evolving are we are almost at the point of not being able to differentiate a generated image from a real one, we not quite there yet. Also, a challenge that come with this problem, is exactly that, not being able to tell what image is real and what is fake. Furthermore, it might be interesting to have different ways of conditioning the generation process. E.g. sketching something and generating it following a further prompt.
- Recognising generated data: As mentioned, we are arriving at the point of being able to generate an image of a famous person doing something he or she has actually never done before. That is a problem that needs to be tackled!
- More multimodal generation: You already know the deal, I don’t think I need to elaborate more. Text to video, text to audio, audio to video, depth map to image. Get creative :)
- Image to text: Hereby I mean a variety of different image to text tasks such as image captioning, image question answering, image instruction following, chatting about an image, sophisticated few shot learning. All of these tasks are still open to be properly solved. And once again, once you go to resource poor languages, performance of pretty much any large existing model takes a big hit. So that is always an option for research and development.
- Video to text: Same as image to text, but even more challenging, simply because it is completely unfeasible to process and entire 2 minute, 30 minute, 120 minute video and prompt an LLM to answer a question. At least for now. On top of that, there are tasks like video summarisation, chaptering, or (dense) captioning.
- Moment retrieval in videos: You have a video or even movie and now want to retrieve a moment given a text query. E.g. you have a 2h long recording of your summer vacation, and want to find the moments where you were swimming in the pool.
- Cross modal retrieval: Imagine you have a database of images and want to retrieve certain ones given a text query. Now you can take that further and imagine any type of modality as query input and output.
</aside>
<aside>
đź’ˇ Graph Neural Networks
Graph Neural Networks (GNNs) are a type of deep learning model that can learn from graph-structured data, such as social networks, molecular structures, or traffic networks. GNNs have shown great potential for various tasks, such as node classification, link prediction, graph generation, and graph reasoning.
.jpeg)
- Scaling of GNNs: How to design efficient and scalable GNNs that can handle large-scale and dynamic graphs? I recommend reading the linked paper, there you can learn much more than I can possibly cover here.
- Generalization: How to improve the expressive power and generalization ability of GNNs for complex graph problems. For now, GNNs are always applied to one specific task and domain. Their generalization to new domains and tasks is still very under-explored!
- GNNs + X: How to combine GNNs with other deep learning models, such as convolutional neural networks, recurrent neural networks, or reinforcement learning? Many GNNs are applied to static structures that don’t change over time. But traffic networks for example change every second, so we would ideally want a GNN that can handle data that changes over time while considering the past to predict the future.
</aside>
<aside>
đź’ˇ Applied AI
This domain if AI focuses more on applying existing frameworks to real world data. The problems you will be solving will mostly include data engineering, data analysis, and making it reliable. The latter is the most challenging differentiator between academic ML research, where it is okay to have a 90% accuracy, but when you have an autonomous car, you really want 99.99%. Those last few percentage points are often the most difficult!
.jpeg)
- Chatbots in industry: Imagine you don’t have to wait for 30 minutes on the phone to get through to your doctor to make an appointment or ask for another prescription… Imagine you could just have a call, or write with an AI assistant that has access to the patient database and can handle that with you. No more waiting. The same can be applied for literally everything that has something like a customer service. Lawyers can have an legal assistant to help them speed up their work. LLMs for education (personal tutors), for finance, engineering, science… Just be creative here.
- CV in Industry: The easiest example is to think of the possibilities of CV in medicine. How much cheaper and faster do you think a treatment could be if you don’t need a radiologist that looks at dozens of images per patient? CV can be used in agriculture to automate harvesting, reducing labor cost and thus making produce cheaper. Again, creativity is the limit.
- GNNs in industry: Everything that can be represented as a graph needs specific treatment. E.g. the street network is a graph, think of AI for routing, better ETA estimation, estimating traffic flow, predicting traffic congestions. Molecules are graphs; Predicting toxicity of a molecule, predicting it’s physical properties, predicting it’s effectiveness or properties when reacting with a different molecule. All of that is important in all the stages of drug development.
</aside>
<aside>
đź’ˇ Evolutionary Learning
Evolutionary learning is a type of machine learning that uses evolutionary algorithms to optimize the parameters, structure, or behaviour of learning models. Who said gradient descent is the only and right way to learn? Evolutionary algorithms are inspired by natural evolution and use mechanisms such as selection, mutation, crossover, and reproduction. Evolutionary learning can be applied to various machine learning tasks, such as neural networks, reinforcement learning, clustering, and ensemble methods.
.jpeg)
I would love to elaborate more on the open challenges, but I am too unfamiliar with the topic. I recommend to read the article I reference and do further research.
</aside>
<aside>
đź’ˇ Meta Learning
Meta learning is a type of machine learning that learns how to learn. It can use the output or the experience of other machine learning models to improve its own performance. Meta learning can be used for tasks like ensemble learning, model selection, algorithm tuning, and multi-task learning.
.jpeg)
The same applies to Meta Learning. I am far from familiar with the topic. The reference above is an article that I can recommend reading!
</aside>