AI Detectors 101: Essential Technical Terms You Should Know
Are you looking to get to the technical side of AI detectors and how they work? Or maybe you just want to keep your best foot forward for an interview. It’s not good if you sound generic or go all around in loops because of nervousness (been there, hope it doesn’t happen to you too).
Well, then this blog is a good place to start or at least get a basic technical grasp of a few concepts. More importantly, you can determine the quality of an AI detector by dwelling deeper into these concepts. So, let’s start with the first term to keep in mind:
Tokenisation
Tokenization is the process with which words are broken down into smaller units, which are words or subwords (tokens). This is broken down further into smaller units. These tokens are converted into vectors (the numerical form of these words). From here, the next part of the process begins, which is…
Embeddings
In simple words, embeddings help AI understand the nuanced meanings of words, which change according to context. This also adds on to understanding the nuances of text, helping AI detectors distinguish between AI and human content better. This is done by changing words into vectors in a high-dimensional space where each vector represents one aspect of the word. These words are converted to vectors with the help of different techniques like Word2Vec, GloVe, and BERT. (What are these? That’s a story for another time)
Perplexity
If you were thinking of splitting the words to understand the meaning, you are off to a bad start. Perplexity refers to the predictability of the provided text. AI detectors check the perplexity to determine if the content is AI-generated or not. AI-generated content is usually low on perplexity as opposed to content created by humans. (Low perplexity = high predicatability and vice versa.) Since content created by humans has a variable vocabulary and context is varied, human content is high on Perplexity.
Burstiness
Burstiness refers to the variability of the sentence’s lengths and consequently the structures of the given text. Human content varies in sentence length and structures as opposed to AI-generated content. In other words, human content is more bursty compared to AI content. This is another aspect that AI detectors looks into when flagging if a given text is AI-generated or otherwise.
Overfitting
Overfitting refers to the property of AI detectors, where the AI learns the training data too well but performs poorly on unseen new data. If you thought if there was ever a concept of underfitting as well, you are absolutely right. Underfitting is when the AI fails to recognize the nuance of human content. Both these concepts are extremely important. It determines the quality of the AI detector.
Cosine similarity
This metric measures the angle between two vectors in the high-dimensional space we spoke of previously. Since AI interprets the world around it mathematically, it uses metrics like this to understand the semantic closeness of the words or phrases. This means that vectors with similar meanings will be close to each other, and those with different meanings will have them far apart.
This is just a start or a peek into the world of AI detectors. While human beings can understand the meaning of words through the five senses, AI needs to convert it to a numeric form to process the different types of data. If you just want to test out how AI works, try out HireQuotient’s AI detector; you don’t need to sign up; you can input a text of up to 25000 words, and it is forever free.