<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.1">Jekyll</generator><link href="https://praful932.dev/feed.xml" rel="self" type="application/atom+xml" /><link href="https://praful932.dev/" rel="alternate" type="text/html" /><updated>2026-04-04T08:32:45+00:00</updated><id>https://praful932.dev/feed.xml</id><title type="html">Praful’s Almanac</title><subtitle>Personal Blog where I share about things that I experience and learn</subtitle><author><name>Praful Mohanan</name></author><entry><title type="html">Find better generation parameters for your LLMs using llmsearch</title><link href="https://praful932.dev/blog-4-llmsearch/" rel="alternate" type="text/html" title="Find better generation parameters for your LLMs using llmsearch" /><published>2024-06-09T00:00:00+00:00</published><updated>2024-06-09T00:00:00+00:00</updated><id>https://praful932.dev/blog-4-llmsearch</id><content type="html" xml:base="https://praful932.dev/blog-4-llmsearch/"><![CDATA[<p align="center">
<img src="/assets/images/blog-4-llmsearch/teaser.png" alt="llmsearch teaser" style="width:1000px;" />
</p>

<h2 id="the-backstory">The Backstory</h2>
<p>Back when <a href="https://huggingface.co/EleutherAI/gpt-j-6b">GPT-J from EleutherAI</a> had released I remember using it for a question answer extraction task from a span of text using few shot learning(you provide few examples in the prompt before the actual question that you want get answerd). It was a small 6B model and in my initial trials it did not work really great, Then I started playing with the generation parameters of the model. I tried multiple of them manually until I reached a configuration which seemed to do much better that what I originally started with. These are the set of generation parameters that I manually found for the task.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span>
    <span class="sh">"</span><span class="s">max_new_tokens</span><span class="sh">"</span> <span class="p">:</span> <span class="mi">15</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">min_new_tokens</span><span class="sh">"</span> <span class="p">:</span> <span class="mi">5</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">num_beams</span><span class="sh">"</span> <span class="p">:</span> <span class="mi">3</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">use_cache</span><span class="sh">"</span> <span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">no_repeat_ngram_size</span><span class="sh">"</span> <span class="p">:</span> <span class="mi">4</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>
<p>I thought to myself, there should be an easier way to do this.
Generation Parameters are more than an icing on the cake for a language model particularly small ones, it can make or break your model, in-fact a lot of the latest model releases nowadays include a predefined set of generation params that the authors recommend, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B/commit/1460c22666392e470910ce3d44ffeb2ab7dbd4df">here</a> is an example for LLAMA 3 8B that was released on huggingface.</p>

<p>This motivated me to build <code class="language-plaintext highlighter-rouge">llmsearch</code> , An easier way of finding generation parameters using the familiar <code class="language-plaintext highlighter-rouge">scikit-learn</code> interface.</p>

<p>Repository - <a href="https://github.com/Praful932/llmsearch"><img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&amp;logo=github&amp;logoColor=white" alt="GitHub" /></a></p>

<p>Documentation - <a href="https://llmsearch.netlify.app"><img src="https://img.shields.io/badge/netlify-%23000000.svg?style=for-the-badge&amp;logo=netlify&amp;logoColor=#00C7B7" alt="Netlify" /></a></p>

<h2 id="main-arc-step-by-step-guide-to-use-llmsearch"><del>Main Arc</del> Step-by-Step Guide to use <code class="language-plaintext highlighter-rouge">llmsearch</code></h2>

<p>Following Example will show an example a LLAMA-3 Model specifically <code class="language-plaintext highlighter-rouge">casperhansen/llama-3-8b-instruct-awq</code> on the infamous <code class="language-plaintext highlighter-rouge">samsum</code> dataset. We will use a quantized <code class="language-plaintext highlighter-rouge">AWQ</code> model.</p>

<p>Notebook <a href="https://colab.research.google.com/github/Praful932/llmsearch/blob/main/examples/llmsearch_quickstart.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" /></a> if you want to follow along.</p>

<h3 id="install-dependencies">Install dependencies</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># install llmsearch</span>
<span class="o">!</span>pip <span class="nb">install </span>llmsearch[pynvml] <span class="nt">-q</span>

<span class="c"># pinning to specific versions to avoid import issues - https://github.com/casper-hansen/AutoAWQ/issues/374</span>
<span class="c"># only required if using awq model</span>
<span class="o">!</span>pip <span class="nb">install </span><span class="nv">transformers</span><span class="o">==</span>4.38.2 <span class="nt">-q</span>
<span class="o">!</span>pip <span class="nb">install </span>torch@https://download.pytorch.org/whl/cu121/torch-2.2.0%2Bcu121-cp310-cp310-linux_x86_64.whl#sha256<span class="o">=</span>c441021672ebe2e5afbdb34817aa85e6d32130f94df2da9ad4cb78a9d4b81370 <span class="nt">-q</span>
<span class="o">!</span>pip <span class="nb">install </span><span class="nv">autoawq</span><span class="o">==</span>0.2.4 <span class="nv">autoawq_kernels</span><span class="o">==</span>0.0.6 <span class="nt">-q</span>

<span class="c"># install dependencies required for this example</span>
<span class="o">!</span>pip <span class="nb">install </span><span class="nv">accelerate</span><span class="o">==</span>0.30.1 <span class="nv">py7zr</span><span class="o">==</span>0.21.0 <span class="nv">evaluate</span><span class="o">==</span>0.4.0 <span class="nv">rouge_score</span><span class="o">==</span>0.1.2 <span class="nt">-q</span>
</code></pre></div></div>

<h3 id="import-required-libraries">Import required libraries</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Autocompletion
</span><span class="o">%</span><span class="n">config</span> <span class="n">Completer</span><span class="p">.</span><span class="n">use_jedi</span> <span class="o">=</span> <span class="bp">False</span>

<span class="c1"># Autoreload
</span><span class="o">%</span><span class="n">load_ext</span> <span class="n">autoreload</span>
<span class="o">%</span><span class="n">autoreload</span> <span class="mi">2</span>

<span class="kn">import</span> <span class="n">awq</span>
<span class="kn">import</span> <span class="n">torch</span>
<span class="kn">import</span> <span class="n">transformers</span>
<span class="kn">import</span> <span class="n">llmsearch</span>
<span class="kn">import</span> <span class="n">evaluate</span>
<span class="kn">import</span> <span class="n">datasets</span>
<span class="kn">import</span> <span class="n">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="kn">from</span> <span class="n">awq</span> <span class="kn">import</span> <span class="n">AutoAWQForCausalLM</span>
<span class="kn">from</span> <span class="n">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span>
<span class="kn">from</span> <span class="n">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">AutoModelForCausalLM</span><span class="p">,</span> <span class="n">StoppingCriteriaList</span>

<span class="kn">from</span> <span class="n">llmsearch.tuner</span> <span class="kn">import</span> <span class="n">Tuner</span>
<span class="kn">from</span> <span class="n">llmsearch.scripts.stopping_criteria</span> <span class="kn">import</span> <span class="n">MultiTokenStoppingCriteria</span>
</code></pre></div></div>

<p>Set some variables that we will use later.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">seed</span> <span class="o">=</span> <span class="mi">42</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">num_samples</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">device</span> <span class="o">=</span> <span class="sh">"</span><span class="s">cuda:0</span><span class="sh">"</span>
</code></pre></div></div>

<h3 id="load-model--dataset">Load model &amp; dataset</h3>

<p>Load the <code class="language-plaintext highlighter-rouge">casperhansen/llama-3-8b-instruct-awq</code> model with the <code class="language-plaintext highlighter-rouge">refs/pr/6</code> revision, <a href="https://huggingface.co/casperhansen/llama-3-8b-instruct-awq/discussions/6">This revision</a> has the right <code class="language-plaintext highlighter-rouge">EOS</code> token configured as per the <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main">official LLAMA 3 repository</a>. Not using the correct token mapping produces incorrect output from the model. We will use the <code class="language-plaintext highlighter-rouge">samsum</code> dataset to run generation hyper-parameter search on.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model_id</span> <span class="o">=</span> <span class="sh">"</span><span class="s">casperhansen/llama-3-8b-instruct-awq</span><span class="sh">"</span>
<span class="n">revision</span> <span class="o">=</span> <span class="sh">"</span><span class="s">refs/pr/6</span><span class="sh">"</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="nf">from_pretrained</span><span class="p">(</span><span class="n">model_id</span><span class="p">,</span><span class="n">revision</span> <span class="o">=</span> <span class="n">revision</span><span class="p">)</span>
<span class="n">tokenizer</span><span class="p">.</span><span class="n">padding_side</span> <span class="o">=</span> <span class="sh">"</span><span class="s">left</span><span class="sh">"</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">AutoAWQForCausalLM</span><span class="p">.</span><span class="nf">from_quantized</span><span class="p">(</span>
        <span class="n">model_id</span><span class="p">,</span> <span class="n">fuse_layers</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">device_map</span><span class="o">=</span><span class="p">{</span><span class="sh">""</span><span class="p">:</span> <span class="n">device</span><span class="p">},</span> <span class="n">revision</span> <span class="o">=</span> <span class="n">revision</span>
    <span class="p">)</span>

<span class="n">dataset</span> <span class="o">=</span> <span class="n">datasets</span><span class="p">.</span><span class="nf">load_dataset</span><span class="p">(</span><span class="sh">"</span><span class="s">samsum</span><span class="sh">"</span><span class="p">)[</span><span class="sh">'</span><span class="s">train</span><span class="sh">'</span><span class="p">]</span>
<span class="n">sample_dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="nf">shuffle</span><span class="p">(</span><span class="n">seed</span> <span class="o">=</span> <span class="n">seed</span><span class="p">).</span><span class="nf">select</span><span class="p">(</span><span class="nf">range</span><span class="p">(</span><span class="n">num_samples</span><span class="p">))</span>

<span class="c1"># These are required to make the model end the sequence correctly - https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct#transformers-automodelforcausallm
</span><span class="n">terminators</span> <span class="o">=</span> <span class="p">[</span>
    <span class="mi">128001</span><span class="p">,</span>
    <span class="mi">128009</span><span class="p">,</span>
<span class="p">]</span>
</code></pre></div></div>

<h3 id="define-dataset-preprocessor-and-metric">Define dataset preprocessor and metric</h3>
<p>For a particular dataset, we can define columns in the dataset that will be used for evaluation(<code class="language-plaintext highlighter-rouge">eval_cols</code>) and columns that will be used while running inference(<code class="language-plaintext highlighter-rouge">input_cols</code>).
Once you have decided on a metric, a evaluation function needs to be defined that takes in two arguments <code class="language-plaintext highlighter-rouge">y_true : list</code>   &amp; <code class="language-plaintext highlighter-rouge">y_pred : list</code>. <code class="language-plaintext highlighter-rouge">y_pred</code> is what the model will predict, <code class="language-plaintext highlighter-rouge">y_true</code> contains(for each item in the list) the evaluation columns (<code class="language-plaintext highlighter-rouge">eval_cols</code>) defined in your <code class="language-plaintext highlighter-rouge">Tuner</code> object, more on this later.</p>

<p>Your dataset preprocessor should take in single item from your dataset and return a <code class="language-plaintext highlighter-rouge">string</code> which is ready to be tokenized and can be passed directly into the model. In this example we convert an item of the dataset into the <a href="https://huggingface.co/docs/transformers/main/en/chat_templating">chat template</a> format. The dataset preprocessor function should take in a <code class="language-plaintext highlighter-rouge">tokenizer</code> and <code class="language-plaintext highlighter-rouge">kwargs</code> , the <code class="language-plaintext highlighter-rouge">kwargs</code> will contain keys that you have defined as <code class="language-plaintext highlighter-rouge">input_cols</code> when you create the <code class="language-plaintext highlighter-rouge">Tuner</code> object, more on this in the next section.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># create a function that can be used for evaluation, should take in y_true (list[dict]), y_pred (list) and return a metric
</span><span class="n">rouge</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="sh">'</span><span class="s">rouge</span><span class="sh">'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">get_rouge_score</span><span class="p">(</span><span class="n">y_true</span> <span class="p">:</span> <span class="nb">list</span><span class="p">,</span> <span class="n">y_pred</span> <span class="p">:</span> <span class="nb">list</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="nf">mean</span><span class="p">(</span><span class="n">rouge</span><span class="p">.</span><span class="nf">compute</span><span class="p">(</span><span class="n">predictions</span><span class="o">=</span><span class="n">y_pred</span><span class="p">,</span> <span class="n">references</span><span class="o">=</span><span class="p">[</span><span class="n">item</span><span class="p">[</span><span class="sh">'</span><span class="s">summary</span><span class="sh">'</span><span class="p">]</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">y_true</span><span class="p">],</span> <span class="n">use_stemmer</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">use_aggregator</span><span class="o">=</span><span class="bp">False</span><span class="p">)[</span><span class="sh">'</span><span class="s">rouge2</span><span class="sh">'</span><span class="p">])</span>

<span class="c1"># Define a dataset preprocessor that is called for every example in the dataset separately - Should take in tokenizer &amp; kwargs and return a string that can be input directly to the model, here we apply chat template which most decoder models use
</span><span class="k">def</span> <span class="nf">sample_to_chat_format</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
    <span class="n">messages</span> <span class="o">=</span> <span class="p">[</span>
        <span class="p">{</span>
            <span class="sh">'</span><span class="s">role</span><span class="sh">'</span> <span class="p">:</span> <span class="sh">"</span><span class="s">system</span><span class="sh">"</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">content</span><span class="sh">'</span> <span class="p">:</span> <span class="sh">"</span><span class="s">You are a helpful AI assistant.</span><span class="sh">"</span>
        <span class="p">},</span>
        <span class="p">{</span>
            <span class="sh">'</span><span class="s">role</span><span class="sh">'</span> <span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span>
            <span class="sh">'</span><span class="s">content</span><span class="sh">'</span> <span class="p">:</span> <span class="sa">f</span><span class="sh">"</span><span class="s">Summarize the following text in less than 50 words: </span><span class="si">{</span><span class="n">kwargs</span><span class="p">[</span><span class="sh">'</span><span class="s">dialogue</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span>
        <span class="p">}</span>
    <span class="p">]</span>
    <span class="k">return</span> <span class="n">tokenizer</span><span class="p">.</span><span class="nf">apply_chat_template</span><span class="p">(</span><span class="n">messages</span><span class="p">,</span> <span class="n">tokenize</span> <span class="o">=</span> <span class="bp">False</span><span class="p">,</span> <span class="n">add_generation_prompt</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="define-tuner-object">Define <code class="language-plaintext highlighter-rouge">Tuner</code> object</h3>

<p>This is the important object and where most of the magic happens, This takes in what you have defined till now and abstracts into a <code class="language-plaintext highlighter-rouge">Tuner</code> object. It also preprocesses the dataset so that you are ready to do inference. The <code class="language-plaintext highlighter-rouge">column_mapping</code> is what is used to identify what are columns in the dataset that will be used for preprocessing/inference (<code class="language-plaintext highlighter-rouge">input_cols</code>) and which one will be used for evaluation (<code class="language-plaintext highlighter-rouge">eval_cols</code>). This is how <code class="language-plaintext highlighter-rouge">Tuner</code> knows what arguments to send to the <code class="language-plaintext highlighter-rouge">sample_preprocessor</code> function (to preprocess the dataset) and which ones to <code class="language-plaintext highlighter-rouge">scorer</code> (to evaluate the model).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># define tuner object, this preprocesses the dataset and creates an LLMEstimator that can be run with GridSearchCV / RandomizedSearchCV of scikit-learn
</span><span class="n">tuner_ob</span> <span class="o">=</span> <span class="nc">Tuner</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
    <span class="n">tokenizer</span><span class="o">=</span><span class="n">tokenizer</span><span class="p">,</span>
    <span class="n">dataset</span><span class="o">=</span><span class="n">sample_dataset</span><span class="p">,</span>
    <span class="n">device</span><span class="o">=</span><span class="sh">"</span><span class="s">cuda:0</span><span class="sh">"</span><span class="p">,</span>
    <span class="c1"># the tuner module automatically reduces the batch size while running inference if it goes OOM
</span>    <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span>
    <span class="n">tokenizer_encode_args</span><span class="o">=</span><span class="p">{</span><span class="sh">"</span><span class="s">padding</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">longest</span><span class="sh">"</span><span class="p">,</span><span class="sh">'</span><span class="s">truncation</span><span class="sh">'</span> <span class="p">:</span> <span class="bp">True</span><span class="p">,</span> <span class="sh">"</span><span class="s">add_special_tokens</span><span class="sh">"</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> <span class="sh">'</span><span class="s">max_length</span><span class="sh">'</span> <span class="p">:</span> <span class="mi">1024</span><span class="p">},</span>
    <span class="n">tokenizer_decode_args</span><span class="o">=</span><span class="p">{</span><span class="sh">"</span><span class="s">spaces_between_special_tokens</span><span class="sh">"</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> <span class="sh">'</span><span class="s">skip_special_tokens</span><span class="sh">'</span> <span class="p">:</span> <span class="bp">True</span><span class="p">},</span>
    <span class="c1"># pass in the scorer that we will be used to evaluate (input to this function is a batch)
</span>    <span class="n">scorer</span><span class="o">=</span><span class="n">get_rouge_score</span><span class="p">,</span>
    <span class="c1"># pass in `dataset` preprocessor, this is run on the passed in dataset before feeding into the model, input of this function is a single example
</span>    <span class="n">sample_preprocessor</span><span class="o">=</span><span class="n">sample_to_chat_format</span><span class="p">,</span>
    <span class="n">seed</span><span class="o">=</span><span class="n">seed</span><span class="p">,</span>
    <span class="c1"># column mapping used to identify input and evaluation columns (these columns are passed in to the evaluation function (scorer) &amp; the dataset preprocessor(sample_preprocessor))
</span>    <span class="n">column_mapping</span><span class="o">=</span><span class="p">{</span><span class="sh">"</span><span class="s">input_cols</span><span class="sh">"</span><span class="p">:</span> <span class="p">[</span><span class="sh">"</span><span class="s">dialogue</span><span class="sh">"</span><span class="p">],</span> <span class="sh">"</span><span class="s">eval_cols</span><span class="sh">"</span><span class="p">:</span> <span class="p">[</span><span class="sh">"</span><span class="s">summary</span><span class="sh">"</span><span class="p">]},</span>
<span class="p">)</span>
</code></pre></div></div>

<p>You can examine if the dataset was preprocessed correctly, <code class="language-plaintext highlighter-rouge">Tuner</code> preprocessed the dataset and stores the input and output at <code class="language-plaintext highlighter-rouge">_X</code> &amp; <code class="language-plaintext highlighter-rouge">_y</code> respectively.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="c1"># Check to see if dataset is processed as expected, `Tune` populates `_X` with the processed input and `_y` with `column_mapping.eval_cols`
</span><span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Inputs: </span><span class="sh">"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_x</span><span class="p">,</span> <span class="n">_y</span> <span class="ow">in</span> <span class="nf">zip</span><span class="p">(</span><span class="n">tuner_ob</span><span class="p">.</span><span class="n">dataset</span><span class="p">[</span><span class="sh">'</span><span class="s">_X</span><span class="sh">'</span><span class="p">][:</span><span class="mi">3</span><span class="p">],</span> <span class="n">tuner_ob</span><span class="p">.</span><span class="n">dataset</span><span class="p">[</span><span class="sh">'</span><span class="s">_y</span><span class="sh">'</span><span class="p">][:</span><span class="mi">3</span><span class="p">]):</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Input: </span><span class="si">{</span><span class="n">_x</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
    <span class="nf">print</span><span class="p">(</span><span class="sh">'</span><span class="se">\n</span><span class="sh">'</span><span class="p">)</span>
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Output: </span><span class="si">{</span><span class="n">_y</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

    <span class="nf">print</span><span class="p">(</span><span class="sh">'</span><span class="se">\n\n</span><span class="sh">'</span><span class="p">)</span>
    <span class="nf">print</span><span class="p">(</span><span class="sh">'</span><span class="s">---</span><span class="sh">'</span> <span class="o">*</span> <span class="mi">15</span><span class="p">,</span><span class="sh">'</span><span class="se">\n\n</span><span class="sh">'</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Inputs</span><span class="p">:</span>
<span class="n">Input</span><span class="p">:</span> <span class="o">&lt;|</span><span class="n">begin_of_text</span><span class="o">|&gt;&lt;|</span><span class="n">start_header_id</span><span class="o">|&gt;</span><span class="n">system</span><span class="o">&lt;|</span><span class="n">end_header_id</span><span class="o">|&gt;</span>

<span class="n">You</span> <span class="n">are</span> <span class="n">a</span> <span class="n">helpful</span> <span class="n">AI</span> <span class="n">assistant</span><span class="p">.</span><span class="o">&lt;|</span><span class="n">eot_id</span><span class="o">|&gt;&lt;|</span><span class="n">start_header_id</span><span class="o">|&gt;</span><span class="n">user</span><span class="o">&lt;|</span><span class="n">end_header_id</span><span class="o">|&gt;</span>

<span class="n">Summarize</span> <span class="n">the</span> <span class="n">following</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">less</span> <span class="n">than</span> <span class="mi">50</span> <span class="n">words</span><span class="p">:</span> <span class="n">Lucy</span><span class="p">:</span> <span class="n">omg</span> <span class="n">did</span> <span class="n">you</span> <span class="n">see</span> <span class="n">JK</span> <span class="n">this</span> <span class="n">morning</span><span class="err">?</span>
<span class="n">Sue</span><span class="p">:</span> <span class="n">I</span> <span class="k">try</span> <span class="n">to</span> <span class="n">avoid</span> <span class="n">it</span> <span class="n">lol</span>
<span class="n">Lucy</span><span class="p">:</span> <span class="n">you</span> <span class="n">should</span> <span class="n">have</span> <span class="n">seen</span> <span class="n">it</span> <span class="n">it</span> <span class="n">was</span> <span class="n">disgusting</span>
<span class="n">Sue</span><span class="p">:</span> <span class="n">I</span> <span class="n">cant</span> <span class="n">do</span> <span class="n">it</span> <span class="n">anymore</span> <span class="n">i</span> <span class="k">try</span> <span class="n">to</span> <span class="n">listen</span> <span class="n">to</span> <span class="n">the</span> <span class="n">radio</span> <span class="ow">in</span> <span class="n">the</span> <span class="n">mornings</span><span class="p">..</span> <span class="n">jk</span> <span class="n">makes</span> <span class="n">you</span> <span class="n">think</span> <span class="n">the</span> <span class="n">whole</span> <span class="n">world</span> <span class="ow">is</span> <span class="n">full</span> <span class="n">of</span> <span class="n">idiots</span> <span class="n">lol</span>
<span class="n">Lucy</span><span class="p">:</span> <span class="n">you</span> <span class="n">may</span> <span class="n">be</span> <span class="n">right</span> <span class="n">I</span> <span class="n">dont</span> <span class="n">know</span> <span class="n">how</span> <span class="n">some</span> <span class="n">of</span> <span class="n">them</span> <span class="n">can</span> <span class="n">go</span> <span class="n">on</span> <span class="n">there</span> <span class="ow">in</span> <span class="n">public</span> <span class="k">for</span> <span class="n">the</span> <span class="n">world</span> <span class="n">to</span> <span class="n">see</span>
<span class="n">Sue</span><span class="p">:</span> <span class="n">I</span> <span class="n">would</span> <span class="n">die</span> <span class="k">if</span> <span class="n">I</span> <span class="n">got</span> <span class="n">a</span> <span class="n">call</span> <span class="n">to</span> <span class="n">go</span> <span class="n">on</span> <span class="n">there</span> <span class="n">lol</span>
<span class="n">Sue</span><span class="p">:</span> <span class="n">could</span> <span class="n">you</span> <span class="n">imagine</span> <span class="n">ha</span> <span class="n">ha</span>
<span class="n">Lucy</span><span class="p">:</span> <span class="n">I</span> <span class="n">would</span> <span class="n">piss</span> <span class="n">myself</span> <span class="n">If</span> <span class="n">I</span> <span class="n">saw</span> <span class="n">you</span> <span class="ow">and</span> <span class="n">Andy</span> <span class="n">up</span> <span class="n">there</span>
<span class="n">Sue</span><span class="p">:</span> <span class="n">over</span> <span class="n">my</span> <span class="n">dead</span> <span class="n">body</span> <span class="err">!</span><span class="o">&lt;|</span><span class="n">eot_id</span><span class="o">|&gt;&lt;|</span><span class="n">start_header_id</span><span class="o">|&gt;</span><span class="n">assistant</span><span class="o">&lt;|</span><span class="n">end_header_id</span><span class="o">|&gt;</span>

<span class="n">Output</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">summary</span><span class="sh">'</span><span class="p">:</span> <span class="sh">"</span><span class="s">Sue doesn</span><span class="sh">'</span><span class="s">t watch JK any more as it</span><span class="sh">'</span><span class="s">s disgusting.</span><span class="sh">"</span><span class="p">}</span>

<span class="o">---------------------------------------------</span>

<span class="n">Input</span><span class="p">:</span> <span class="o">&lt;|</span><span class="n">begin_of_text</span><span class="o">|&gt;&lt;|</span><span class="n">start_header_id</span><span class="o">|&gt;</span><span class="n">system</span><span class="o">&lt;|</span><span class="n">end_header_id</span><span class="o">|&gt;</span>

<span class="n">You</span> <span class="n">are</span> <span class="n">a</span> <span class="n">helpful</span> <span class="n">AI</span> <span class="n">assistant</span><span class="p">.</span><span class="o">&lt;|</span><span class="n">eot_id</span><span class="o">|&gt;&lt;|</span><span class="n">start_header_id</span><span class="o">|&gt;</span><span class="n">user</span><span class="o">&lt;|</span><span class="n">end_header_id</span><span class="o">|&gt;</span>

<span class="n">Summarize</span> <span class="n">the</span> <span class="n">following</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">less</span> <span class="n">than</span> <span class="mi">50</span> <span class="n">words</span><span class="p">:</span> <span class="n">Wendy</span><span class="p">:</span> <span class="n">What</span><span class="sh">'</span><span class="s">s up?
Simon: Nothing much. I</span><span class="sh">'</span><span class="n">m</span> <span class="n">painting</span> <span class="n">my</span> <span class="n">cupboards</span><span class="p">.</span>
<span class="n">Angela</span><span class="p">:</span> <span class="n">Cool</span> <span class="n">what</span> <span class="n">colour</span><span class="err">?</span>
<span class="n">Simon</span><span class="p">:</span> <span class="n">Green</span><span class="p">.</span>
<span class="n">Ben</span><span class="p">:</span> <span class="n">I</span><span class="sh">'</span><span class="s">m just chilling in the garden.
Angela: Nice weekend! I</span><span class="sh">'</span><span class="n">m</span> <span class="n">about</span> <span class="n">to</span> <span class="n">meet</span> <span class="n">Chris</span><span class="p">.</span>
<span class="n">Wendy</span><span class="p">:</span> <span class="n">Say</span> <span class="n">hello</span> <span class="k">from</span> <span class="n">me</span><span class="err">!</span>
<span class="n">Angela</span><span class="p">:</span> <span class="n">Will</span> <span class="n">do</span><span class="err">!</span> <span class="n">And</span> <span class="n">how</span> <span class="ow">is</span> <span class="n">your</span> <span class="n">weekend</span><span class="p">,</span> <span class="n">Wendy</span><span class="err">?</span>
<span class="n">Wendy</span><span class="p">:</span> <span class="n">Very</span> <span class="n">lazy</span><span class="p">...</span> <span class="n">The</span> <span class="n">week</span> <span class="n">was</span> <span class="n">hard</span> <span class="n">at</span> <span class="n">work</span><span class="p">,</span> <span class="n">I</span> <span class="n">really</span> <span class="n">needed</span> <span class="n">some</span> <span class="n">rest</span><span class="p">.</span>
<span class="n">Ben</span><span class="p">:</span> <span class="n">We</span> <span class="n">should</span> <span class="nb">all</span> <span class="n">come</span> <span class="ow">and</span> <span class="n">visit</span> <span class="n">Simon</span> <span class="ow">in</span> <span class="n">his</span> <span class="n">new</span> <span class="n">apartment</span><span class="err">!</span>
<span class="n">Simon</span><span class="p">:</span> <span class="n">You</span> <span class="n">are</span> <span class="n">welcome</span><span class="p">,</span> <span class="n">guys</span><span class="err">!</span> <span class="n">Whenever</span> <span class="n">you</span> <span class="n">wish</span><span class="p">.</span>
<span class="n">Ben</span><span class="p">:</span> <span class="n">I</span> <span class="n">should</span> <span class="n">be</span> <span class="ow">in</span> <span class="n">Bournemouth</span> <span class="nb">next</span> <span class="n">week</span><span class="p">.</span>
<span class="n">Simon</span><span class="p">:</span> <span class="n">I</span><span class="sh">'</span><span class="s">m not going anywhere :-)
Ben: Cool, I</span><span class="sh">'</span><span class="n">ll</span> <span class="n">call</span> <span class="n">you</span> <span class="nb">next</span> <span class="n">week</span><span class="p">.</span><span class="o">&lt;|</span><span class="n">eot_id</span><span class="o">|&gt;&lt;|</span><span class="n">start_header_id</span><span class="o">|&gt;</span><span class="n">assistant</span><span class="o">&lt;|</span><span class="n">end_header_id</span><span class="o">|&gt;</span>

<span class="n">Output</span><span class="p">:</span> <span class="p">{</span><span class="sh">'</span><span class="s">summary</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">This weekend Wendy is very lazy because she worked hard at work, and Angela is meeting Chris. Simon is chilling in the garden and painting his cupboards green. Next week, Ben, Angela, Chris and Wendy will visit him in his new apartament.</span><span class="sh">'</span><span class="p">}</span>

<span class="o">---------------------------------------------</span>
</code></pre></div></div>

<h3 id="evaluation-before-tuning">Evaluation Before Tuning</h3>

<p>Before running a search, you should evaluate what score default settings provide, since the objective is to find a better score that what you have now.</p>

<p>You can get the score by calling <code class="language-plaintext highlighter-rouge">tuner_ob.get_score</code> with the default parameters. I have used 3 parameters here. <code class="language-plaintext highlighter-rouge">max_new_tokens</code> can be set by estimating the token distribution of the output. <code class="language-plaintext highlighter-rouge">generation_seed</code> is a parameter that is useful to seed outputs before generation and becomes important when you are running hyperparameter search to ensure reproducibility.</p>

<p>Also you do not want to generate tokens indefinitely till you hit the <code class="language-plaintext highlighter-rouge">max_new_tokens</code>, you want to stop when you hit a certain token or a certain sequence of tokens. You can use either <code class="language-plaintext highlighter-rouge">eos_token_id</code> or a <a href="https://github.com/Praful932/llmsearch/blob/main/llmsearch/scripts/stopping_criteria.py">stopping criteria</a>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Get score &amp; outputs using some generation parameters
</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token</span> <span class="o">=</span> <span class="sh">"</span><span class="s">&lt;|end_of_text|&gt;</span><span class="sh">"</span>
<span class="n">gen_params</span> <span class="o">=</span> <span class="p">{</span>
    <span class="sh">'</span><span class="s">max_new_tokens</span><span class="sh">'</span> <span class="p">:</span> <span class="mi">70</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">generation_seed</span><span class="sh">'</span> <span class="p">:</span> <span class="mi">42</span><span class="p">,</span>
    <span class="sh">'</span><span class="s">eos_token_id</span><span class="sh">'</span> <span class="p">:</span> <span class="n">terminators</span><span class="p">,</span>
<span class="p">}</span>

<span class="n">score</span><span class="p">,</span> <span class="n">outputs</span> <span class="o">=</span> <span class="n">tuner_ob</span><span class="p">.</span><span class="nf">get_score</span><span class="p">(</span><span class="n">gen_params</span><span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Score - </span><span class="si">{</span><span class="n">score</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="hyperparameter-search">Hyperparameter Search</h3>

<p>Once you have instantiated the Tuner object, it exposes a <code class="language-plaintext highlighter-rouge">tuner_ob.estimator</code> which is a <code class="language-plaintext highlighter-rouge">scikit-learn</code> compatible <code class="language-plaintext highlighter-rouge">BaseEstimator</code> <a href="https://github.com/scikit-learn/scikit-learn/blob/ea1e8c4b216d4b1e21b02bafe75ee1713ad21079/sklearn/base.py#L152">object</a>. This can be used with <code class="language-plaintext highlighter-rouge">scikit-learn</code> methods. We will use it with <code class="language-plaintext highlighter-rouge">GridSearchCV</code>  to run a hyperparameter search over the generation parameters.</p>

<p>First we define a hyperparameter space and a <code class="language-plaintext highlighter-rouge">GridSearchCV</code>/<code class="language-plaintext highlighter-rouge">RandomSearchCV</code> object and then fit it.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Define your hyperparameter space here for the search
</span><span class="n">hyp_space</span> <span class="o">=</span> <span class="p">{</span>
    <span class="sh">'</span><span class="s">max_new_tokens</span><span class="sh">'</span> <span class="p">:</span> <span class="p">[</span><span class="mi">70</span><span class="p">],</span>
    <span class="sh">'</span><span class="s">generation_seed</span><span class="sh">'</span> <span class="p">:</span> <span class="p">[</span><span class="mi">42</span><span class="p">],</span>
    <span class="sh">'</span><span class="s">do_sample</span><span class="sh">'</span> <span class="p">:</span> <span class="p">[</span><span class="bp">True</span><span class="p">],</span>
    <span class="sh">'</span><span class="s">eos_token_id</span><span class="sh">'</span> <span class="p">:</span> <span class="p">[</span><span class="n">terminators</span><span class="p">],</span>

    <span class="sh">'</span><span class="s">temperature</span><span class="sh">'</span><span class="p">:</span> <span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">],</span>
    <span class="sh">'</span><span class="s">top_k</span><span class="sh">'</span><span class="p">:</span> <span class="p">[</span><span class="mi">50</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="mi">70</span><span class="p">],</span>
    <span class="sh">'</span><span class="s">no_repeat_ngram_size</span><span class="sh">'</span><span class="p">:</span> <span class="p">[</span><span class="mi">0</span><span class="p">],</span>

<span class="p">}</span>

<span class="c1"># Pass in estimator &amp; scorer as you do with the scikit-learn API
</span><span class="n">clf</span> <span class="o">=</span> <span class="nc">GridSearchCV</span><span class="p">(</span>
    <span class="n">estimator</span> <span class="o">=</span> <span class="n">tuner_ob</span><span class="p">.</span><span class="n">estimator</span><span class="p">,</span>
    <span class="n">param_grid</span><span class="o">=</span><span class="n">hyp_space</span><span class="p">,</span>
    <span class="n">scoring</span> <span class="o">=</span> <span class="n">tuner_ob</span><span class="p">.</span><span class="n">scorer</span><span class="p">,</span>
    <span class="n">cv</span> <span class="o">=</span> <span class="mi">2</span><span class="p">,</span>
    <span class="n">n_jobs</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span> <span class="c1"># we will run this sequentially
</span>    <span class="n">verbose</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<p>The fit will take time depending on the number of fits that are expected to happen and the inference time per fit.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># fit on the dataset
</span><span class="n">clf</span><span class="p">.</span><span class="nf">fit</span><span class="p">(</span><span class="n">X</span><span class="o">=</span><span class="n">tuner_ob</span><span class="p">.</span><span class="n">dataset</span><span class="p">[</span><span class="sh">"</span><span class="s">_X</span><span class="sh">"</span><span class="p">],</span> <span class="n">y</span><span class="o">=</span><span class="n">tuner_ob</span><span class="p">.</span><span class="n">dataset</span><span class="p">[</span><span class="sh">'</span><span class="s">_y</span><span class="sh">'</span><span class="p">])</span>
</code></pre></div></div>

<p>Once the model is fit you can view the best generation parameters from the search</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># print out the best parameters
</span><span class="nf">print</span><span class="p">(</span><span class="n">clf</span><span class="p">.</span><span class="n">best_params_</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="evaluation-after-tuning">Evaluation After Tuning</h3>
<p>Once you have the best parameters you can evaluate it on the full dataset using the <code class="language-plaintext highlighter-rouge">tuner_ob.get_score</code> method</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">scores</span><span class="p">,</span> <span class="n">outputs</span> <span class="o">=</span> <span class="n">tuner_ob</span><span class="p">.</span><span class="nf">get_score</span><span class="p">(</span><span class="n">clf</span><span class="p">.</span><span class="n">best_params_</span><span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Scores - </span><span class="si">{</span><span class="n">scores</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="additional-utilities">Additional Utilities</h3>

<ul>
  <li>
    <p>Logging Utilities - You can set the logging level of the library using this module</p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kn">from</span> <span class="n">llmsearch.utils.logging_utils</span> <span class="kn">import</span> <span class="n">set_verbosity_info</span><span class="p">,</span> <span class="n">set_verbosity_warning</span><span class="p">,</span> <span class="n">set_verbosity_debug</span>

  <span class="c1"># set verbosity to debug, useful to debug model outputs
</span>  <span class="nf">set_verbosity_debug</span><span class="p">()</span>
</code></pre></div>    </div>

    <p>The <code class="language-plaintext highlighter-rouge">DEBUG</code> level is useful to see what is happening inside the library, for eg you want to see the text that is passed in to the model and the output that you get, here’s an example</p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="c1"># Example Logs from the get score function - Calculate score on a different dataset
</span>  <span class="n">scores</span><span class="p">,</span> <span class="n">outputs</span> <span class="o">=</span> <span class="n">tuner_ob</span><span class="p">.</span><span class="nf">get_score</span><span class="p">(</span><span class="n">gen_params</span><span class="p">,</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">datasets</span><span class="p">.</span><span class="n">Dataset</span><span class="p">.</span><span class="nf">from_dict</span><span class="p">(</span><span class="n">sample_dataset</span><span class="p">[:</span><span class="mi">2</span><span class="p">]))</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>Output</p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="mi">2024</span><span class="o">-</span><span class="mi">06</span><span class="o">-</span><span class="mi">05</span> <span class="mi">18</span><span class="p">:</span><span class="mi">19</span><span class="p">:</span><span class="mf">26.099</span> <span class="o">-</span> <span class="n">llmsearch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">mem_utils</span><span class="p">:</span><span class="mi">154</span> <span class="o">-</span> <span class="n">INFO</span> <span class="o">-</span> <span class="n">Starting</span> <span class="n">inference</span> <span class="k">with</span> <span class="n">generation</span> <span class="n">parameters</span> <span class="o">-</span> <span class="p">{</span><span class="sh">'</span><span class="s">max_new_tokens</span><span class="sh">'</span><span class="p">:</span> <span class="mi">70</span><span class="p">,</span> <span class="sh">'</span><span class="s">generation_seed</span><span class="sh">'</span><span class="p">:</span> <span class="mi">42</span><span class="p">,</span> <span class="sh">'</span><span class="s">eos_token_id</span><span class="sh">'</span><span class="p">:</span> <span class="p">[</span><span class="mi">128001</span><span class="p">,</span> <span class="mi">128009</span><span class="p">]}</span>
  <span class="mi">2024</span><span class="o">-</span><span class="mi">06</span><span class="o">-</span><span class="mi">05</span> <span class="mi">18</span><span class="p">:</span><span class="mi">19</span><span class="p">:</span><span class="mf">26.101</span> <span class="o">-</span> <span class="n">llmsearch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">mem_utils</span><span class="p">:</span><span class="mi">158</span> <span class="o">-</span> <span class="n">INFO</span> <span class="o">-</span> <span class="n">Performing</span> <span class="n">inference</span> <span class="k">with</span> <span class="n">batch_size</span> <span class="o">-</span> <span class="mi">2</span>
  <span class="mi">2024</span><span class="o">-</span><span class="mi">06</span><span class="o">-</span><span class="mi">05</span> <span class="mi">18</span><span class="p">:</span><span class="mi">19</span><span class="p">:</span><span class="mf">26.103</span> <span class="o">-</span> <span class="n">llmsearch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">model_utils</span><span class="p">:</span><span class="mi">98</span> <span class="o">-</span> <span class="n">INFO</span> <span class="o">-</span> <span class="n">Detected</span> <span class="n">generation</span> <span class="nb">type</span> <span class="o">-</span> <span class="n">Greedy</span> <span class="n">Decoding</span>
  <span class="mi">2024</span><span class="o">-</span><span class="mi">06</span><span class="o">-</span><span class="mi">05</span> <span class="mi">18</span><span class="p">:</span><span class="mi">19</span><span class="p">:</span><span class="mf">29.759</span> <span class="o">-</span> <span class="n">llmsearch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">model_utils</span><span class="p">:</span><span class="mi">149</span> <span class="o">-</span> <span class="n">DEBUG</span> <span class="o">-</span> <span class="n">Input</span> <span class="o">-</span> <span class="sh">'</span><span class="s">&lt;|begin_of_text|&gt;&lt;|start_header_id|&gt;system&lt;|end_header_id|&gt;</span><span class="se">\n\n</span><span class="s">You are a helpful AI assistant.&lt;|eot_id|&gt;&lt;|start_header_id|&gt;user&lt;|end_header_id|&gt;</span><span class="se">\n\n</span><span class="s">Summarize the following text in less than 50 words: Lucy: omg did you see JK this morning?</span><span class="se">\r\n</span><span class="s">Sue: I try to avoid it lol</span><span class="se">\r\n</span><span class="s">Lucy: you should have seen it it was disgusting</span><span class="se">\r\n</span><span class="s">Sue: I cant do it anymore i try to listen to the radio in the mornings.. jk makes you think the whole world is full of idiots lol</span><span class="se">\r\n</span><span class="s">Lucy: you may be right I dont know how some of them can go on there in public for the world to see</span><span class="se">\r\n</span><span class="s">Sue: I would die if I got a call to go on there lol</span><span class="se">\r\n</span><span class="s">Sue: could you imagine ha ha </span><span class="se">\r\n</span><span class="s">Lucy: I would piss myself If I saw you and Andy up there</span><span class="se">\r\n</span><span class="s">Sue: over my dead body !&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;</span><span class="se">\n\n</span><span class="sh">'</span>
  <span class="mi">2024</span><span class="o">-</span><span class="mi">06</span><span class="o">-</span><span class="mi">05</span> <span class="mi">18</span><span class="p">:</span><span class="mi">19</span><span class="p">:</span><span class="mf">29.763</span> <span class="o">-</span> <span class="n">llmsearch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">model_utils</span><span class="p">:</span><span class="mi">150</span> <span class="o">-</span> <span class="n">DEBUG</span> <span class="o">-</span> <span class="n">Model</span> <span class="n">Output</span> <span class="o">-</span> <span class="sh">'</span><span class="s">The conversation is about a TV show </span><span class="sh">"</span><span class="s">JK</span><span class="sh">"</span><span class="s"> that Lucy and Sue dislike. They</span><span class="se">\'</span><span class="s">re making fun of the show</span><span class="se">\'</span><span class="s">s content and the people who appear on it, calling them </span><span class="sh">"</span><span class="s">idiots.</span><span class="sh">"</span><span class="s"> They</span><span class="se">\'</span><span class="s">re joking about how they wouldn</span><span class="se">\'</span><span class="s">t want to be on the show themselves.</span><span class="sh">'</span>
  <span class="mi">2024</span><span class="o">-</span><span class="mi">06</span><span class="o">-</span><span class="mi">05</span> <span class="mi">18</span><span class="p">:</span><span class="mi">19</span><span class="p">:</span><span class="mf">29.766</span> <span class="o">-</span> <span class="n">llmsearch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">model_utils</span><span class="p">:</span><span class="mi">149</span> <span class="o">-</span> <span class="n">DEBUG</span> <span class="o">-</span> <span class="n">Input</span> <span class="o">-</span> <span class="sh">"</span><span class="s">&lt;|begin_of_text|&gt;&lt;|start_header_id|&gt;system&lt;|end_header_id|&gt;</span><span class="se">\n\n</span><span class="s">You are a helpful AI assistant.&lt;|eot_id|&gt;&lt;|start_header_id|&gt;user&lt;|end_header_id|&gt;</span><span class="se">\n\n</span><span class="s">Summarize the following text in less than 50 words: Wendy: What</span><span class="sh">'</span><span class="s">s up?</span><span class="se">\r\n</span><span class="s">Simon: Nothing much. I</span><span class="sh">'</span><span class="s">m painting my cupboards. </span><span class="se">\r\n</span><span class="s">Angela: Cool what colour?</span><span class="se">\r\n</span><span class="s">Simon: Green.</span><span class="se">\r\n</span><span class="s">Ben: I</span><span class="sh">'</span><span class="s">m just chilling in the garden. </span><span class="se">\r\n</span><span class="s">Angela: Nice weekend! I</span><span class="sh">'</span><span class="s">m about to meet Chris.</span><span class="se">\r\n</span><span class="s">Wendy: Say hello from me!</span><span class="se">\r\n</span><span class="s">Angela: Will do! And how is your weekend, Wendy?</span><span class="se">\r\n</span><span class="s">Wendy: Very lazy... The week was hard at work, I really needed some rest. </span><span class="se">\r\n</span><span class="s">Ben: We should all come and visit Simon in his new apartment!</span><span class="se">\r\n</span><span class="s">Simon: You are welcome, guys! Whenever you wish.</span><span class="se">\r\n</span><span class="s">Ben: I should be in Bournemouth next week. </span><span class="se">\r\n</span><span class="s">Simon: I</span><span class="sh">'</span><span class="s">m not going anywhere :-)</span><span class="se">\r\n</span><span class="s">Ben: Cool, I</span><span class="sh">'</span><span class="s">ll call you next week.&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;</span><span class="se">\n\n</span><span class="sh">"</span>
  <span class="mi">2024</span><span class="o">-</span><span class="mi">06</span><span class="o">-</span><span class="mi">05</span> <span class="mi">18</span><span class="p">:</span><span class="mi">19</span><span class="p">:</span><span class="mf">29.767</span> <span class="o">-</span> <span class="n">llmsearch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">model_utils</span><span class="p">:</span><span class="mi">150</span> <span class="o">-</span> <span class="n">DEBUG</span> <span class="o">-</span> <span class="n">Model</span> <span class="n">Output</span> <span class="o">-</span> <span class="sh">"</span><span class="s">A group of friends chat about their weekends. Simon is painting his cupboards green, Angela is meeting Chris, and Ben is relaxing in the garden. They discuss visiting Simon</span><span class="sh">'</span><span class="s">s new apartment and make plans to catch up soon.</span><span class="sh">"</span>
  <span class="mi">2024</span><span class="o">-</span><span class="mi">06</span><span class="o">-</span><span class="mi">05</span> <span class="mi">18</span><span class="p">:</span><span class="mi">19</span><span class="p">:</span><span class="mf">30.159</span> <span class="o">-</span> <span class="n">llmsearch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">mem_utils</span><span class="p">:</span><span class="mi">176</span> <span class="o">-</span> <span class="n">DEBUG</span> <span class="o">-</span> <span class="n">Setting</span> <span class="n">batch_size</span> <span class="n">cache</span> <span class="n">value</span> <span class="o">-</span> <span class="mi">2</span> <span class="k">for</span> <span class="n">this</span> <span class="n">particular</span> <span class="n">configuration</span>
  <span class="mi">2024</span><span class="o">-</span><span class="mi">06</span><span class="o">-</span><span class="mi">05</span> <span class="mi">18</span><span class="p">:</span><span class="mi">19</span><span class="p">:</span><span class="mf">30.161</span> <span class="o">-</span> <span class="n">llmsearch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">mem_utils</span><span class="p">:</span><span class="mi">188</span> <span class="o">-</span> <span class="n">INFO</span> <span class="o">-</span> <span class="n">Finished</span> <span class="n">running</span> <span class="n">inference</span><span class="p">,</span> <span class="n">took</span> <span class="mf">4.057762</span> <span class="n">secs</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>Multi Token Stopping Criteria - There could be a use-case where you want to stop your generation at a specific token other than the <code class="language-plaintext highlighter-rouge">eos_token</code> or you want to stop the generation when a certain sequences of tokens occurs in the input. You can use the <code class="language-plaintext highlighter-rouge">MultiTokenStoppingCriteria</code> available in llmsearch</p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="kn">from</span> <span class="n">transformers</span> <span class="kn">import</span> <span class="n">StoppingCriteriaList</span>
  <span class="kn">from</span> <span class="n">llmsearch.scripts.stopping_criteria</span> <span class="kn">import</span> <span class="n">MultiTokenStoppingCriteria</span>

  <span class="c1"># specify what sequence to stop the generation on
</span>  <span class="n">multi_token_stop_criteria_ob</span> <span class="o">=</span> <span class="nc">MultiTokenStoppingCriteria</span><span class="p">(</span><span class="n">sequence_ids</span><span class="o">=</span><span class="p">[</span><span class="mi">32000</span><span class="p">])</span>
  <span class="n">stopping_criteria</span> <span class="o">=</span> <span class="nc">StoppingCriteriaList</span><span class="p">([</span><span class="n">multi_token_stop_criteria_ob</span><span class="p">])</span>
  <span class="n">callbacks_after_inference</span> <span class="o">=</span> <span class="p">[</span><span class="n">multi_token_stop_criteria_ob</span><span class="p">.</span><span class="n">reset</span><span class="p">]</span>

  <span class="n">tuner_ob</span> <span class="o">=</span> <span class="nc">Tuner</span><span class="p">(</span>
  		<span class="bp">...</span>
  		<span class="n">callbacks_after_inference</span><span class="o">=</span><span class="n">callbacks_after_inference</span><span class="p">,</span>
  		<span class="bp">...</span>
  <span class="p">)</span>
</code></pre></div>    </div>

    <p><code class="language-plaintext highlighter-rouge">MultiTokenStoppingCriteria</code> has the ability to operate on batches of input. It maintains a state for each batch that goes through it which helps it know where to look and what sequences in the batch have finished. This state is cleared after each inference run using the callback.</p>
  </li>
</ul>

<h3 id="conclusion-️">Conclusion ☕️</h3>

<p>In this blog you got to know how you can utilize <code class="language-plaintext highlighter-rouge">llmsearch</code> to run hyperparameter search on generation parameters using <code class="language-plaintext highlighter-rouge">scikit-learn.</code> Would love to hear what the community does with it. In case of any feedback do not hesitate to reach out. <code class="language-plaintext highlighter-rouge">llmsearch</code> has multiple improvements planned as part of the v1.0.0. Stay tuned!</p>

<p><a href="https://github.com/Praful932/llmsearch"><img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&amp;logo=github&amp;logoColor=white" alt="GitHub" /></a>
<a href="https://llmsearch.netlify.app"><img src="https://img.shields.io/badge/netlify-%23000000.svg?style=for-the-badge&amp;logo=netlify&amp;logoColor=#00C7B7" alt="Netlify" /></a></p>]]></content><author><name>Praful Mohanan</name></author><category term="llm" /><category term="code" /><summary type="html"><![CDATA[Find better generation parameters for your LLMs 🦾]]></summary></entry><entry><title type="html">Understanding the F1 Score metric for evaluating Grammar Error Correction Systems</title><link href="https://praful932.dev/blog-3-f1-score-gec/" rel="alternate" type="text/html" title="Understanding the F1 Score metric for evaluating Grammar Error Correction Systems" /><published>2023-03-05T00:00:00+00:00</published><updated>2023-03-05T00:00:00+00:00</updated><id>https://praful932.dev/blog-3-f1-score-gec</id><content type="html" xml:base="https://praful932.dev/blog-3-f1-score-gec/"><![CDATA[<p><strong>Grammar Error Correction</strong> (GEC) in NLP is the task of making erroneous/grammatically incorrect sentences correct by performing a certain set of <em>operations</em> on the corrupted sentence.</p>

<p align="center">
<img src="/assets/images/blog-3-f1-score-gec/gec_system_eg.png" alt="Black formatting" style="width:200px;" />
</p>
<p style="text-align: center; font-size: 15px;">
    <em>Classic GEC System</em>
</p>

<p>These <em>operations</em> can be:</p>
<ol>
  <li>Replacement - You may replace a corrupted word with a corrected version of it.</li>
  <li>Insertion - You may insert a missing word in the sentence.</li>
  <li>Deletion - You may delete an unwanted word from the sentence.</li>
</ol>

<p><strong>F1 Score</strong> is a metric that is generally used to measure the performance of binary classification models. In this article, we will be understanding how you can use the F1 Score metric to evaluate GEC Systems as well. Some terminologies:</p>

<ul>
  <li>Input/Corrupted Sentence - This is the corrupted sentence that we want to correct.</li>
  <li>Ground Truth - This is the corrected version of the Input Sentence.</li>
  <li>Hypothesis/Model Prediction  - This is the sentence that your model predicted.</li>
</ul>

<p>Now you want to measure your model performance w.r.t to the Ground Truth that you have. Before we jump to the metric, we need to understand what the M2 format is and how it relates to - Understanding F1 Metric for GEC.</p>

<h3 id="the-m2-format">The M2 Format</h3>

<p>This is a standard data format that is used in GEC tasks. Any annotation/model prediction of GEC can be expressed in this format, which has a corrupted sentence and the corrected version of it in terms of annotations/edits.</p>

<p>The below illustration explains what different parts of the format mean:</p>

<p align="center">
<img src="/assets/images/blog-3-f1-score-gec/m2_format.png" alt="Black formatting" style="width:900px;" />
</p>
<p style="text-align: center; font-size: 15px;">
    <em>M2 Format</em>
</p>

<p><strong>S</strong> - denotes Source Sentence<br />
<strong>A</strong> - denotes Annotations/Edits, These can be Model predictions as well<br />
A sentence can have more than one annotation -&gt; More than one possible way to correct it.<br />
There can be more than one edit in an annotation -&gt; A correction with more than a single edit.</p>

<hr />

<p>Few examples of the M2 Format:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>S This are a sentence .
A 1 2|||R:VERB:SVA|||is|||-REQUIRED-|||NONE|||0
</code></pre></div></div>

<ul>
  <li>Original sentence - This are a sentence.</li>
  <li>Annotations (1 Annotation)
    <ul>
      <li>This <strong>is</strong> a sentence.</li>
    </ul>
  </li>
  <li>The original sentence here is <code class="language-plaintext highlighter-rouge">This are a sentence</code>.</li>
  <li>The corrected version is <code class="language-plaintext highlighter-rouge">This is a sentence .</code>
where <code class="language-plaintext highlighter-rouge">are</code> (at token offset - <code class="language-plaintext highlighter-rouge">[1:2]</code>)is replaced by <code class="language-plaintext highlighter-rouge">is</code>.</li>
</ul>

<hr />

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>S A dog over the wall .
A 2 2|||M:ADP|||jumped|||-REQUIRED-|||NONE|||0
A 1 2|||R:ADP|||cat|||-REQUIRED-|||NONE|||1
A 2 2|||M:ADP|||jumped|||-REQUIRED-|||NONE|||1
</code></pre></div></div>

<ul>
  <li>Original Sentence - A dog over the wall</li>
  <li>Annotations (2 Annotations, 1st with 1 edit, 2nd with 2 edits)
    <ul>
      <li>A dog <strong>jumped</strong> over the wall.</li>
      <li>A <strong>cat jumped</strong> over the wall.</li>
    </ul>
  </li>
</ul>

<hr />

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>S The boys played a game .
A 1 2 |||R:NOUN|||girls|||-REQUIRED-|||NONE|||0
A -1 -1|||noop|||-NONE-|||-REQUIRED-|||NONE|||1
</code></pre></div></div>

<ul>
  <li>Original Sentence - The boys played a game.</li>
  <li>Annotations (2 Annotations, 1st with 1 edit, 2nd with noop edit)
    <ul>
      <li>The <code class="language-plaintext highlighter-rouge">girls</code> played a game.</li>
      <li>None (<em>Here this means that the sentence is correct so Annotator/Model 1 annotated it as noop/having no annotation.</em>)</li>
    </ul>
  </li>
</ul>

<hr />

<p><a href="https://github.com/chrisjbryant/errant">ERRANT</a>(Error Annotation Toolkit) is one of the tools that you can use to get your output to the M2 format.  While evaluating your model, you will have <strong>2</strong> M2 format annotations.</p>

<ul>
  <li><em>Ground Truth M2</em> - an M2 Format Annotation between the Corrupted Sentence and the Ground Truth.</li>
  <li><em>Hypothesis M2</em> - an M2 Format Annotation between the Corrupted Sentence and the Hypothesis.</li>
</ul>

<p>Now that you have understood the M2 format, Let’s take an example to see how we can calculate the metrics</p>

<ul>
  <li>Corrupted Sentence - I am not play game .</li>
  <li>Ground Truth - I am not playing games .</li>
  <li>Hypothesis - I am not playing game .</li>
</ul>

<p>Respective M2s(just the annotations):</p>

<ol>
  <li>
    <p>Ground Truth M2</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> A 3 4 |||R:VERB|||playing|||-REQUIRED-|||NONE|||0
 A 4 5 |||R:NOUN|||games|||-REQUIRED-|||NONE|||0
</code></pre></div>    </div>
  </li>
  <li>
    <p>Hypothesis M2</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> A 3 4 |||R:VERB|||playing|||-REQUIRED-|||NONE|||0
</code></pre></div>    </div>
  </li>
</ol>

<h3 id="calculating-the-metrics">Calculating the Metrics</h3>

<p>Once we have the M2 format from the Ground Truth &amp; Hypothesis M2, we can calculate the F1 metrics. In Grammar Error Correction, if you look at evaluation metrics, you will notice that often the F0.5 metric is mentioned. This is the F-Beta score with Beta=0.5 instead of F1(Beta=1). The lower Beta is the more you weigh Precision over Recall.</p>

<p align="center">
<img src="/assets/images/blog-3-f1-score-gec/f_beta.png" alt="F-Beta Formula" style="width:400px;" />
</p>
<p style="text-align: center;">
    <em>F-Beta Formula</em>
</p>
<p>In GEC you want to prevent introducing more false positives than identify every other error, <em>that’s why we give more weight to precision than recall</em>.<br />
Here’s some pseudocode that explains how we calculate the <strong>F0.5</strong> score from the M2s that we have.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># We think each of the edit in ground truth &amp; hypothesis M2 to be the categories that we want to predict
</span>
<span class="c1"># Initialize these to 0, There's no TN because we do not care for "non-errors"
</span><span class="n">tp</span><span class="p">,</span> <span class="n">fp</span><span class="p">,</span> <span class="n">fn</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span>

<span class="c1"># For each m2_edit in hypothesis_M2
</span><span class="k">for</span> <span class="n">m2_edit</span> <span class="ow">in</span> <span class="n">hypothesis_M2</span><span class="p">:</span>
    <span class="c1"># Check if the hypothesis is a noop(nothing) edit, we don't include that in the metric calculation
</span>    <span class="k">if</span> <span class="n">m2_edit</span> <span class="o">=</span> <span class="n">noop_edit</span><span class="p">:</span>
        <span class="k">continue</span>
    <span class="c1"># Check if the same exact edit is present in ground_truth_M2, it's a True Positive if it exists
</span>    <span class="c1"># Otherwise it's an FP (the edit that model suggested is incorrect)
</span>    <span class="k">if</span> <span class="n">m2_edit</span> <span class="ow">in</span> <span class="n">ground_truth_M2</span><span class="p">:</span>
        <span class="n">tp</span> <span class="o">+</span> <span class="o">=</span><span class="mi">1</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">fp</span> <span class="o">+=</span> <span class="mi">1</span>

<span class="c1"># For each m2_edit in ground_truth_M2
</span><span class="k">for</span> <span class="n">m2_edit</span> <span class="ow">in</span> <span class="n">ground_truth_M2</span><span class="p">:</span>
    <span class="c1"># Check if the ground_truth_M2 is a noop(nothing) edit, we don't include that in the metric calculation
</span>    <span class="k">if</span> <span class="n">m2_edit</span> <span class="o">=</span> <span class="n">noop_edit</span><span class="p">:</span>
        <span class="k">continue</span>
    <span class="c1"># For things that were supposed to be predicted but weren't, we put those in the False Negatives bucket
</span>    <span class="k">if</span> <span class="n">m2_edit</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">hypothesis_m2</span><span class="p">:</span>
        <span class="n">fn</span> <span class="o">+=</span> <span class="mi">1</span>

<span class="c1"># Now you can calculate the metrics easily
</span><span class="n">precision</span><span class="p">,</span> <span class="n">recall</span> <span class="o">=</span> <span class="n">tp</span> <span class="o">/</span> <span class="p">(</span><span class="n">tp</span> <span class="o">+</span> <span class="n">fp</span><span class="p">),</span> <span class="n">tp</span> <span class="o">/</span> <span class="p">(</span><span class="n">tp</span> <span class="o">+</span> <span class="n">fn</span><span class="p">)</span>
<span class="c1"># Calculate F-0.5
</span><span class="n">f1_score</span> <span class="o">=</span> <span class="nf">float</span><span class="p">((</span><span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="n">beta</span><span class="o">**</span><span class="mi">2</span><span class="p">))</span> <span class="o">*</span> <span class="n">precision</span> <span class="o">*</span> <span class="n">recall</span><span class="p">)</span> <span class="o">/</span> <span class="p">(((</span><span class="n">beta</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">precision</span><span class="p">)</span> <span class="o">+</span> <span class="n">recall</span><span class="p">)</span> <span class="k">if</span> <span class="n">precision</span> <span class="o">+</span> <span class="n">recall</span> <span class="k">else</span> <span class="mf">0.0</span>
</code></pre></div></div>

<h3 id="in-this-article-️">In this article ☕️</h3>

<ul>
  <li>You understood what the M2 format is and how it is used in GEC</li>
  <li>Got to know how the F1 metric is applied to problems like GEC and not just vanilla classification problems.</li>
</ul>

<h3 id="references">References</h3>

<ul>
  <li><a href="https://github.com/chrisjbryant/errant">ERRANT repository</a></li>
  <li><a href="https://www.cl.cam.ac.uk/research/nl/bea2019st/">BEA-2019 Shared Task on GEC</a></li>
</ul>]]></content><author><name>Praful Mohanan</name></author><category term="nlp" /><category term="ml" /><summary type="html"><![CDATA[Primer on GEC and its Evaluation 🛠️]]></summary></entry><entry><title type="html">Using pre-commit hooks to write better code</title><link href="https://praful932.dev/blog-2-pre-commit-hooks/" rel="alternate" type="text/html" title="Using pre-commit hooks to write better code" /><published>2023-01-08T00:00:00+00:00</published><updated>2023-01-08T00:00:00+00:00</updated><id>https://praful932.dev/blog-2-pre-commit-hooks</id><content type="html" xml:base="https://praful932.dev/blog-2-pre-commit-hooks/"><![CDATA[<p>Pre-commit hooks are scripts that run before you commit your code to the codebase.
These hooks for instance: can be <em>autoformatters</em> - which format &amp; make your code pretty ✨ according to a defined standard; <em>linters</em> - which point out mistakes in your code, or it could be even your very own custom code/unit test scripts which run every time you run a <code class="language-plaintext highlighter-rouge">git commit</code> command.</p>

<p>These <strong>scripts/hooks</strong>(<em>I’ll use the term hook for consistency’s sake throughout the article</em>) are set up and run in an isolated manner(except for local hooks; more on this later) by the <code class="language-plaintext highlighter-rouge">pre-commit</code> package. So a hook written in another language can be set up and run as well independent of the development environment. In the context of pre-commit, these hooks are mainly git repositories that expose an executable.
The advantage of having these packages all packed up into the <strong>pre-commit</strong> ecosystem are:</p>
<ul>
  <li>having a <em>single file(the pre-commit config file)</em> which manages the configuration for all of your <strong>hooks</strong>.</li>
  <li>Letting pre-commit itself handle the setup for such hooks; For eg : a hook that is made for some programming language may not always be itself written in the same language, which may require additional effort in setting it up.</li>
</ul>

<p>pre-commit can be installed via <code class="language-plaintext highlighter-rouge">pip</code>, <code class="language-plaintext highlighter-rouge">brew</code> or <code class="language-plaintext highlighter-rouge">conda</code>, Using <code class="language-plaintext highlighter-rouge">pip</code> the command would be</p>

<p><code class="language-plaintext highlighter-rouge">pip install pre-commit</code></p>

<h2 id="the-pre-commit-config-file-">The pre-commit config file 📃</h2>

<p>Post installation, you may need to set up the config file. Once you have the config file setup, all you need to do is run <code class="language-plaintext highlighter-rouge">pre-commit run</code> to let it do its magic 🪄.
The file which manages the configuration of all your hooks is the <code class="language-plaintext highlighter-rouge">.pre-commit-config.yaml</code> file. The configuration file follows the YAML syntax. There can be more than one hook associated with a pre-commit configuration file. This file describes what hooks the project will be using.</p>

<p>This config file has total <strong>3</strong> levels of configuration. This is how a pre-commit config file is structured :</p>

<p align="center">
<img src="/assets/images/blog-2-pre-commit-hooks/pre-commit-config-file-structure.png" alt="Pre-commit config file structure" style="width:700px;" />
</p>
<p style="text-align: center;">
    <em>pre-commit config file structure</em>
</p>

<p><strong>Top level configuration</strong><br />
These are the global-level configurations that apply to your whole pre-commit setup. These settings mainly revolve around the set of files that you want to run pre-commit on and a few knobs on how pre-commit behaves.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">###### Top-level configuration</span>
<span class="na">exclude </span><span class="pi">:</span> <span class="s">^wip</span>                  <span class="c1"># Exclude files from the pre-commit checks which match this pattern</span>
<span class="na">files </span><span class="pi">:</span> <span class="s">.py$.</span>                   <span class="c1"># Only run pre-commit checks on this particular file pattern</span>
<span class="na">fail_fast</span><span class="pi">:</span> <span class="kc">false</span>                <span class="c1"># If True, If one hook fails, stops the run without executing the consecutive hooks</span>
<span class="c1">######</span>
<span class="na">repos</span><span class="pi">:</span>
  <span class="s">....</span>
</code></pre></div></div>
<p><br />
<strong>Repo level configuration</strong><br />
This configuration tells pre-commit where(i.e. which repo) to look for the code of the hooks that it will run on the codebase. You define a set of repos that pre-commit will use to set up the hooks. As mentioned earlier, pre-commit hooks are set up and run in an isolated manner. It is certainly possible that you need to run a custom hook(eg unit tests, dynamic checks) which is directly/indirectly dependent on the state of the codebase(through the virtual environment, build output, etc). Setting <code class="language-plaintext highlighter-rouge">repo</code> to <code class="language-plaintext highlighter-rouge">local</code> is a decent hack to achieve this(We will look into this in depth soon).</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">....</span>
<span class="na">fail_fast</span><span class="pi">:</span> <span class="kc">false</span>
<span class="c1">###### Repo-level configuration</span>
<span class="na">repos</span><span class="pi">:</span>                          <span class="c1"># List of repos that contain the hooks</span>
<span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s1">'</span><span class="s">'</span>                      <span class="c1"># Repository URL</span>
  <span class="na">rev</span><span class="pi">:</span> <span class="s">1.0.0</span>
  <span class="na">hooks</span><span class="pi">:</span>                        <span class="c1"># Hooks that we want from the repository (There could be more than one hook in a repo)</span>
    <span class="s">.....</span>
<span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s">local</span>                   <span class="c1"># Local hook</span>
  <span class="na">hooks</span><span class="pi">:</span>
    <span class="s">.....</span>
<span class="c1">######</span>
</code></pre></div></div>
<p><br />
<strong>Hook level configuration</strong><br />
This is where the magic happens, for each of the repo configurations, you’ll define which hooks you want from the repository and the additional parameters that the hook needs.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">....</span>
<span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s1">'</span><span class="s">'</span>                      <span class="c1"># Repository URL</span>
  <span class="na">rev</span><span class="pi">:</span> <span class="s">2.0.0</span>
<span class="c1">######  Hook level configuration</span>
  <span class="na">hooks</span><span class="pi">:</span>                        <span class="c1"># List of hooks to use from the repository</span>
  <span class="pi">-</span>   <span class="na">id</span><span class="pi">:</span> <span class="s">hook2</span>                 <span class="c1"># ID of the hook to use from the repository</span>
      <span class="na">name</span><span class="pi">:</span> <span class="s">hook2-py</span>            <span class="c1"># Name to be shown during hook execution</span>
<span class="c1">######</span>

<span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s">local</span>
<span class="c1">###### Hook-level configuration</span>
  <span class="na">hooks</span><span class="pi">:</span>
  <span class="pi">-</span>   <span class="na">id</span><span class="pi">:</span> <span class="s">my-local-script</span>       <span class="c1"># Random ID for the hook</span>
      <span class="na">name</span><span class="pi">:</span> <span class="s">my-local-script</span>     <span class="c1"># Name to be shown during execution</span>
      <span class="na">entry</span><span class="pi">:</span> <span class="s">python tests.py</span>    <span class="c1"># executable to run the hook</span>
      <span class="na">language</span><span class="pi">:</span> <span class="s">python</span>          <span class="c1"># how to install the hook, could be python, ruby, dart depending upon the nature of the hook</span>
      <span class="na">files </span><span class="pi">:</span> <span class="s">\.py$</span>             <span class="c1"># files to run on</span>
<span class="c1">######</span>
</code></pre></div></div>

<p>Every pre-commit hook(except <code class="language-plaintext highlighter-rouge">repo : local</code> ones) should have an <code class="language-plaintext highlighter-rouge">id</code> attribute, this is what pre-commit uses to determine which hook to use, this can be found out via the <a href="https://github.com/asottile/pyupgrade/blob/97ed6fb3cf2e650d4f762ba231c3f04c41797710/.pre-commit-hooks.yaml#L1">.pre-commit-hooks.yaml</a> file of the respective <code class="language-plaintext highlighter-rouge">repo</code>.</p>

<p>Every hook of a local repo(<code class="language-plaintext highlighter-rouge">repo : local</code>) should have the following attributes:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">id</code> : For a local hook this can be any valid string</li>
  <li><code class="language-plaintext highlighter-rouge">name</code> : Hook name shown during execution</li>
  <li><code class="language-plaintext highlighter-rouge">language</code> : This tells pre-commit how to install the hook, keeping this as <code class="language-plaintext highlighter-rouge">system</code> will not create any isolated environment for this hook and will use the project’s environment instead. <em>This also means that local hook should have their dependencies as part of the project itself.</em></li>
  <li><code class="language-plaintext highlighter-rouge">entry</code>  : Tells pre-commit to run this executable to run the hook, it could be a python script or event something like <code class="language-plaintext highlighter-rouge">pytest tests/test_db.py</code></li>
  <li><code class="language-plaintext highlighter-rouge">files</code> : Pattern of files to run on</li>
</ul>

<h2 id="tidy-up-your-code">Tidy up your code</h2>

<p>Now that we have looked at the different components of the config file, we’ll look at three of the hooks that I have found useful and how we can use them to tidy up our code</p>

<ul>
  <li>black</li>
  <li>pyupgrade</li>
  <li>pylint</li>
</ul>

<p>All of these are individual python packages that can be installed(<code class="language-plaintext highlighter-rouge">pip install pkg_name</code>) and used separately as well via their command line options.
For demonstration, we’ll go through each of the packages and then look at a pre-commit config file that encompasses all of these in one to avoid the need of running them via the command line.</p>

<h3 id="black">Black</h3>

<p>Black is an automatic code formatting tool for python files. It aims at standardizing the code style for python syntax so that diff is less, code is easier to read and review. Black uses <a href="https://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees/">concrete syntax trees</a> internally to parse and format the code. The style that Black uses is a strict subset of PEP 8 with few knobs to turn.</p>

<p>Here is an example of how black formats code</p>
<p align="center">
<img src="/assets/images/blog-2-pre-commit-hooks/black-formatting-example.png" alt="Black formatting" style="width:800px;" />
</p>
<p style="text-align: center;">
    <em>Before Black (Left), After Black formatting (Right)</em>
</p>

<p>You’ll notice how the code got auto-formatted to a uniform structure. This particularly helps in MR review, so the reviewer’s sole focus is on just what changed, not stray commas, newlines and whitespaces.
Can be used so:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">repos</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s">https://github.com/ambv/black</span>       <span class="c1"># Repo URL</span>
  <span class="na">rev</span><span class="pi">:</span> <span class="s">22.3.0</span>                               <span class="c1"># Version</span>
  <span class="na">hooks</span><span class="pi">:</span>                                    <span class="c1"># Hooks</span>
    <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">black</span>                             <span class="c1"># ID of the hook</span>
      <span class="na">name</span><span class="pi">:</span> <span class="s">black-py</span>                        <span class="c1"># Name to display</span>
</code></pre></div></div>

<h3 id="pyupgrade">Pyupgrade</h3>

<p>This is a small &amp; sweet hook that automatically converts syntax to newer versions of the python language.</p>

<p>Few examples:</p>
<ul>
  <li>Dict comprehension
    <ul>
      <li><code class="language-plaintext highlighter-rouge">dict((a, b) for a, b in y)</code> → <code class="language-plaintext highlighter-rouge">{a: b for a, b in y}</code></li>
    </ul>
  </li>
  <li>Set Literals
    <ul>
      <li><code class="language-plaintext highlighter-rouge">set(x for x in y)</code> → <code class="language-plaintext highlighter-rouge">{x for x in y}</code></li>
    </ul>
  </li>
  <li>
    <p>Super Class call</p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   <span class="k">class</span> <span class="nc">C</span><span class="p">(</span><span class="n">Base</span><span class="p">):</span>
       <span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
  <span class="o">-</span>        <span class="o">**</span><span class="nf">super</span><span class="p">(</span><span class="n">C</span><span class="p">,</span> <span class="n">self</span><span class="p">).</span><span class="nf">f</span><span class="p">()</span><span class="o">**</span>
  <span class="o">+</span>        <span class="o">**</span><span class="nf">super</span><span class="p">().</span><span class="nf">f</span><span class="p">()</span><span class="o">**</span>
</code></pre></div>    </div>
  </li>
</ul>

<p>This hook helps in taking care of some of the breaking changes in the python API.
Can be used so:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s">https://github.com/asottile/pyupgrade</span>
  <span class="na">rev</span><span class="pi">:</span> <span class="s">v2.32.0</span>
  <span class="na">hooks</span><span class="pi">:</span>
  <span class="pi">-</span>   <span class="na">id</span><span class="pi">:</span> <span class="s">pyupgrade</span>
      <span class="na">name</span><span class="pi">:</span> <span class="s">pyupgrade-py</span>
</code></pre></div></div>

<h3 id="pylint">Pylint</h3>

<p>This is my favorite, it’s not just a linter but also a static code analyser. Static code analyzers are those tools that check your code without actually executing them.</p>

<p>Pylint has several built-in components which make it powerful to even infer actual values from code. After analyzing the code, pylint outputs messages(5 types) to inform you how the code can be made better. These 5 types are:</p>

<ol>
  <li><strong>(C)</strong> Convention, for programming standard violation</li>
  <li><strong>(R)</strong> Refactor, for bad code smell</li>
  <li><strong>(W)</strong> Warning, for python specific problems</li>
  <li><strong>(E)</strong> Error, for probable bugs in the code</li>
  <li><strong>(F)</strong> Fatal, if an error occurred which prevented pylint from doing further processing.</li>
</ol>

<p>Let’s look at how pylint does on a sample snippet of python code</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="sh">"""</span><span class="s">script.py</span><span class="sh">"""</span>
<span class="kn">import</span> <span class="n">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="k">def</span> <span class="nf">MapFeature</span><span class="p">(</span><span class="n">X1</span><span class="p">,</span> <span class="n">X2</span><span class="p">):</span>
    <span class="n">degree</span> <span class="o">=</span> <span class="mi">6</span>
    <span class="n">out</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">ones</span><span class="p">((</span><span class="n">m</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">degree</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
            <span class="n">out</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">hstack</span><span class="p">(</span>
                <span class="p">(</span><span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nf">power</span><span class="p">(</span><span class="n">X1</span><span class="p">,</span> <span class="n">i</span> <span class="o">-</span> <span class="n">j</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="nf">power</span><span class="p">(</span><span class="n">X2</span><span class="p">,</span> <span class="n">j</span><span class="p">))[:,</span> <span class="n">np</span><span class="p">.</span><span class="n">newaxis</span><span class="p">])</span>
            <span class="p">)</span>
    <span class="k">if</span> <span class="n">out</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">out</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="mi">0</span>
    <span class="k">return</span> <span class="n">out</span>

<span class="k">def</span> <span class="nf">get_dict_sum</span><span class="p">():</span>
    <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="sh">"</span><span class="s">a</span><span class="sh">"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="sh">"</span><span class="s">b</span><span class="sh">"</span><span class="p">:</span> <span class="mi">20</span><span class="p">,</span> <span class="sh">"</span><span class="s">c</span><span class="sh">"</span><span class="p">:</span> <span class="mi">30</span><span class="p">}</span>
    <span class="n">res</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
        <span class="n">res</span> <span class="o">+=</span> <span class="n">v</span>

<span class="n">res</span> <span class="o">=</span> <span class="nf">get_dict_sum</span><span class="p">()</span>
</code></pre></div></div>

<p>This is the output that pylint provides when run(via command-line <code class="language-plaintext highlighter-rouge">pylint script.py</code>) on the above snippet of code</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>************* Module script
script.py:1:0: C0114: Missing module docstring (missing-module-docstring)
script.py:4:0: C0116: Missing function or method docstring (missing-function-docstring)
script.py:4:0: C0103: Function name "MapFeature" doesn't conform to snake_case naming style (invalid-name)
script.py:4:15: C0103: Argument name "X1" doesn't conform to snake_case naming style (invalid-name)
script.py:4:19: C0103: Argument name "X2" doesn't conform to snake_case naming style (invalid-name)
script.py:5:4: C0103: Variable name "d" doesn't conform to snake_case naming style (invalid-name)
script.py:6:19: E0602: Undefined variable 'm' (undefined-variable)
script.py:12:4: R1705: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it (no-else-return)
script.py:19:0: C0116: Missing function or method docstring (missing-function-docstring)
script.py:22:4: E1141: Unpacking a dictionary in iteration without calling .items() (dict-iter-missing-items)
script.py:22:11: C0103: Variable name "v" doesn't conform to snake_case naming style (invalid-name)
script.py:22:8: W0612: Unused variable 'k' (unused-variable)
script.py:26:0: E1111: Assigning result of a function call, where the function has no return (assignment-from-no-return)
script.py:26:0: C0103: Constant name "r" doesn't conform to UPPER_CASE naming style (invalid-name)

------------------------------------------------------------------
Your code has been rated at 0.00/10 (previous run: 0.00/10, +0.00)
</code></pre></div></div>

<p>The output of pylint is structured in a specific format where each line in the output points to a specific type of <strong>message code</strong>(one of the 5 types). The below example shows a message of type <strong>Warning</strong>(W).</p>

<p align="center">
<img src="/assets/images/blog-2-pre-commit-hooks/pylint-message-structure.png" alt="Pylint message structure" style="width:800px;" />
</p>
<p style="text-align: center;">
    <em>Pylint Message Structure</em>
</p>

<p>You can view in-depth detail of the message code by running:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>pylint <span class="nt">--help-msg</span><span class="o">=</span>W0612
:unused-variable <span class="o">(</span>W0612<span class="o">)</span>: <span class="k">*</span>Unused variable %r<span class="k">*</span>
  Used when a variable is defined but not used. This message belongs to the
  variables checker.
</code></pre></div></div>

<p>You may have noticed how noisy sometimes the output of <code class="language-plaintext highlighter-rouge">pylint</code> on a piece of code can be. For eg - you may not want to always name a variable a certain way, or your function is self-explanatory and you don’t want a docstring. You can always silence a specific error code by passing an argument.</p>

<p><code class="language-plaintext highlighter-rouge">pylint —disable=C0114</code></p>

<p>or even disable an entire message code as well</p>

<p><code class="language-plaintext highlighter-rouge">pylint —disable=C</code></p>

<p>pylint can be used as pre-commit hook by adding it as so:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s">https://github.com/PyCQA/pylint</span>
  <span class="na">rev</span><span class="pi">:</span> <span class="s">v2.15.9</span>
  <span class="na">hooks</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">pylint</span>
</code></pre></div></div>

<h3 id="final-pre-commit-configyaml-">Final pre-commit-config.yaml 📝</h3>

<p>Here is the final sample YAML file which combines all of the hooks that we saw so far and also with some useful tweaks, particularly for <code class="language-plaintext highlighter-rouge">pylint</code>.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># .pre-commit-config.yml</span>
<span class="na">repos</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s">https://github.com/ambv/black</span>
  <span class="na">rev</span><span class="pi">:</span> <span class="s">22.3.0</span>
  <span class="na">hooks</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">black</span>
      <span class="na">name</span><span class="pi">:</span> <span class="s">black-py</span>
<span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s">https://github.com/asottile/pyupgrade</span>
  <span class="na">rev</span><span class="pi">:</span> <span class="s">v2.32.0</span>
  <span class="na">hooks</span><span class="pi">:</span>
  <span class="pi">-</span>   <span class="na">id</span><span class="pi">:</span> <span class="s">pyupgrade</span>
      <span class="na">name</span><span class="pi">:</span> <span class="s">pyupgrade-py</span>

<span class="pi">-</span> <span class="na">repo</span><span class="pi">:</span> <span class="s">local</span>
  <span class="na">hooks</span><span class="pi">:</span>
  <span class="pi">-</span>   <span class="na">id</span><span class="pi">:</span> <span class="s">pylint</span>
      <span class="na">name</span><span class="pi">:</span> <span class="s">pylint-py</span>
      <span class="c1"># Add project root path</span>
      <span class="na">entry</span><span class="pi">:</span> <span class="s">pylint --init-hook="import sys,os; sys.path.append(os.getcwd())"</span>
      <span class="na">args </span><span class="pi">:</span> <span class="pi">[</span>
        <span class="c1"># black handles this except for string(C0301)</span>
        <span class="c1"># similar lines in multiple files(R0801)</span>
        <span class="c1"># attribute defined outside __init__(W0201)</span>
        <span class="s2">"</span><span class="s">--disable=C0301,R0801,W0201"</span><span class="pi">,</span>
        <span class="c1"># Allow 2-30 char variables</span>
        <span class="s2">"</span><span class="s">--variable-rgx=[a-z_][a-z0-9_]{1,30}$"</span><span class="pi">,</span>
        <span class="c1"># Allow 2-30 char attributes,args</span>
        <span class="s2">"</span><span class="s">--attr-rgx=[a-zA-Z_][a-zA-Z0-9_]{1,30}$"</span><span class="pi">,</span>
        <span class="s2">"</span><span class="s">--argument-rgx=[a-z_][a-z0-9_]{1,30}$"</span><span class="pi">,</span>
        <span class="c1">#  Exclude module member access for E1101</span>
        <span class="s2">"</span><span class="s">--generated-members=torch.*,pandas.*,Levenshtein.*"</span><span class="pi">,</span>
        <span class="c1"># Max local variables</span>
        <span class="s2">"</span><span class="s">--max-locals=25"</span><span class="pi">,</span>
        <span class="c1"># Exclusion for source unavailable pkgs</span>
        <span class="s2">"</span><span class="s">--extension-pkg-whitelist=lxml,pydantic"</span><span class="pi">,</span>
        <span class="c1"># Max Attributes for a class</span>
        <span class="s2">"</span><span class="s">--max-attributes=20"</span><span class="pi">,</span>
      <span class="pi">]</span>
      <span class="na">language</span><span class="pi">:</span> <span class="s">system</span>
      <span class="na">files </span><span class="pi">:</span> <span class="s">\.py$</span>
      <span class="na">require_serial</span><span class="pi">:</span> <span class="kc">true</span>
</code></pre></div></div>

<p><strong>Few Details</strong></p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">repo : local</code>
Define pylint to be a local repo instead of providing the url</li>
  <li><code class="language-plaintext highlighter-rouge">language : system</code>
pre-commit won’t set up a new environment but use the existing one</li>
  <li><code class="language-plaintext highlighter-rouge">entry: pylint --init-hook="import sys,os; sys.path.append(os.getcwd())"</code>
As we saw earlier local hooks need to have the entry point defined. Using the <code class="language-plaintext highlighter-rouge">init_hook</code> parameter we add the root project path. This helps with the import error <code class="language-plaintext highlighter-rouge">pylint</code> would have thrown if the code had any local modules imported.</li>
</ul>

<p>Run pre-commit(<code class="language-plaintext highlighter-rouge">pre-commit run</code>) using the above config file to see it work its magic 🪄</p>

<p><strong>Note</strong> : You will need <code class="language-plaintext highlighter-rouge">pylint</code> already installed since <code class="language-plaintext highlighter-rouge">repo : local</code> &amp; <code class="language-plaintext highlighter-rouge">language : system</code> are defined.</p>

<h2 id="in-this-article-️">In this article ☕️</h2>

<ul>
  <li>You understood why pre-commit is useful</li>
  <li>How a pre-commit config file is structured</li>
  <li>You looked at various hooks(black, pyupgrade and pylint) and how they can be used to tidy up your code.</li>
</ul>

<p>I hope this article was useful, for any doubts, do comment below.
Find the snippets of this blog and the config file that I generally use <a href="https://github.com/Praful932/blog/tree/main/blog-artifacts/blog-2-pre-commit-hooks">here</a> : )</p>

<h2 id="references">References</h2>
<ul>
  <li><a href="https://pre-commit.com/">pre-commit Documentation</a></li>
  <li><a href="https://pylint.pycqa.org/en/latest/">Pylint Documentation</a></li>
</ul>]]></content><author><name>Praful Mohanan</name></author><category term="python" /><category term="tech" /><category term="code" /><summary type="html"><![CDATA[Make you code smell less 😵‍💫]]></summary></entry><entry><title type="html">Ordering of set() when dealing with strings in python</title><link href="https://praful932.dev/blog-1-ordered-sets/" rel="alternate" type="text/html" title="Ordering of set() when dealing with strings in python" /><published>2022-12-18T00:00:00+00:00</published><updated>2022-12-18T00:00:00+00:00</updated><id>https://praful932.dev/blog-1-ordered-sets</id><content type="html" xml:base="https://praful932.dev/blog-1-ordered-sets/"><![CDATA[<p>While working on a baseline ML model for a side-project, I found that across different runs 🧪 of my experiments, the results that my model was generating were not exactly reproducible i.e. I was not getting the same performance metrics for the same model configuration, despite having all the knobs in place.</p>

<p>After debugging for quite some time, I found that this snippet was the root of my problems :</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># create list of unique tokens using set
</span><span class="n">unique_tokens</span><span class="p">.</span><span class="nf">extend</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="nf">set</span><span class="p">(</span><span class="n">itertools</span><span class="p">.</span><span class="nf">chain</span><span class="p">(</span><span class="o">*</span><span class="n">train_df</span><span class="p">.</span><span class="n">tokens</span><span class="p">.</span><span class="nf">to_list</span><span class="p">()))))</span>

<span class="n">config</span><span class="p">.</span><span class="n">VOCAB_SIZE</span> <span class="o">=</span> <span class="nf">len</span><span class="p">(</span><span class="n">unique_tokens</span><span class="p">)</span>

<span class="c1"># create tokenizer mapping
</span><span class="n">token2id</span> <span class="o">=</span> <span class="p">{</span><span class="n">token</span> <span class="p">:</span> <span class="n">idx</span> <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">token</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">unique_tokens</span><span class="p">)}</span>
<span class="n">id2token</span> <span class="o">=</span> <span class="nf">dict</span><span class="p">(</span><span class="nf">enumerate</span><span class="p">(</span><span class="n">unique_tokens</span><span class="p">))</span>
</code></pre></div></div>

<p>I was constructing the tokenizer mapping using the <code class="language-plaintext highlighter-rouge">set()</code> operation, this caused the same model i/o to be encoded &amp; decoded differently each time.
And we’ll see why.</p>

<h3 id="how-set-works">How set() works</h3>

<p>First, we need to understand how <code class="language-plaintext highlighter-rouge">set()</code> is implemented in python. Internally a <code class="language-plaintext highlighter-rouge">set()</code> data structure is implemented using a hash table. A hash table by definition has a hash function, which takes in the input and maps the data to a unique bucket using the hash value, this is how it can do membership checking in <code class="language-plaintext highlighter-rouge">O(1)</code>.</p>

<p>When you call a <code class="language-plaintext highlighter-rouge">set()</code> on a <code class="language-plaintext highlighter-rouge">list</code> object, it returns unique values for the input that you provided. Internally to distinguish this <strong>“uniqueness”</strong> it uses the hash function we discussed above.
<img src="/assets/images/blog-1-ordered-sets/hash-table.jpeg" alt="Hash Table" /></p>
<p style="text-align: center;">
    <em>Hash Table</em>
</p>
<p>Converting a <code class="language-plaintext highlighter-rouge">list</code> into a <code class="language-plaintext highlighter-rouge">set</code> is easy, since for two similar values, both of them will map to the same exact hash bucket. However, this hash function is not always deterministic, particularly when dealing with string objects across two different python <strong>invocations</strong>. Let’s look at a few examples</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="sh">"""</span><span class="s">snippet1.py</span><span class="sh">"""</span>
<span class="c1"># Snippet to get hash values
</span>
<span class="n">a</span> <span class="o">=</span> <span class="sh">"</span><span class="s">1</span><span class="sh">"</span>
<span class="n">b</span> <span class="o">=</span> <span class="sh">"</span><span class="s">abcde</span><span class="sh">"</span>
<span class="n">c</span> <span class="o">=</span> <span class="mi">1234</span>
<span class="n">d</span> <span class="o">=</span> <span class="mf">6.4512</span>

<span class="n">hv1</span> <span class="o">=</span> <span class="nf">hash</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">hv2</span> <span class="o">=</span> <span class="nf">hash</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>
<span class="n">hv3</span> <span class="o">=</span> <span class="nf">hash</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="n">hv4</span> <span class="o">=</span> <span class="nf">hash</span><span class="p">(</span><span class="n">d</span><span class="p">)</span>

<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Hash value of </span><span class="si">{</span><span class="n">a</span><span class="si">}</span><span class="s"> - </span><span class="si">{</span><span class="n">hv1</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Hash value of </span><span class="si">{</span><span class="n">b</span><span class="si">}</span><span class="s"> - </span><span class="si">{</span><span class="n">hv2</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Hash value of </span><span class="si">{</span><span class="n">c</span><span class="si">}</span><span class="s"> - </span><span class="si">{</span><span class="n">hv3</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Hash value of </span><span class="si">{</span><span class="n">d</span><span class="si">}</span><span class="s"> - </span><span class="si">{</span><span class="n">hv4</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>This is what I got from two different invocations of the script</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python snippet1.py

Hash value of 1 - 1981388520896787279
Hash value of abcde - 4943320557970621589
Hash value of 1234 - 1234
Hash value of 6.4512 - 1040396365757218822
</code></pre></div></div>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python snippet1.py

Hash value of 1 - <span class="nt">-9001918643517506909</span>
Hash value of abcde - <span class="nt">-757009308147773598</span>
Hash value of 1234 - 1234
Hash value of 6.4512 - 1040396365757218822
</code></pre></div></div>

<p>You can notice how I got different outputs across two different <strong>invocations</strong> of the script for the variables that are <code class="language-plaintext highlighter-rouge">string</code>. While the hash values for the numbers remained constant.</p>

<p>This is because of how internally hash function is implemented. For values of <code class="language-plaintext highlighter-rouge">str</code> and <code class="language-plaintext highlighter-rouge">byte</code> objects, the input to the hash function is salted with a random value to protect against certain denial of service attacks(<a href="https://docs.python.org/3.8/reference/datamodel.html#object.__hash__">source</a>). For the same python invocation, the value remains the same, as this <strong>“salting”</strong> only happens at the first time you call the python executable.</p>

<p><strong>But how do these hash values link to the ordering of the sets 🤔</strong></p>

<p>In the <code class="language-plaintext highlighter-rouge">set()</code> data structure, after hashing is done for an object, python takes the last <strong>N</strong> bits of the hash value and uses them as <strong>indices</strong> to place the object in the memory. And when these values are retrieved from the memory, <em>they are yielded in the order that they exist in the memory <u>not the way they were put in.</u></em></p>

<p><strong>And what happens to the order when you have different hash values across different python invocations?</strong></p>

<p>Here’s an example to make the concept concrete:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="sh">"""</span><span class="s">snippet2.py</span><span class="sh">"""</span>

<span class="n">l1</span> <span class="o">=</span> <span class="p">[</span><span class="mi">9</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">]</span>
<span class="n">l2</span> <span class="o">=</span> <span class="p">[</span><span class="sh">"</span><span class="s">def</span><span class="sh">"</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="sh">"</span><span class="s">abc</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">abc</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">deg</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">xyz</span><span class="sh">"</span><span class="p">]</span>

<span class="n">s1</span> <span class="o">=</span> <span class="nf">set</span><span class="p">(</span><span class="n">l1</span><span class="p">)</span>
<span class="n">s2</span> <span class="o">=</span> <span class="nf">set</span><span class="p">(</span><span class="n">l2</span><span class="p">)</span>

<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Set 1 - </span><span class="si">{</span><span class="nf">set</span><span class="p">(</span><span class="n">s1</span><span class="p">)</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Set 2 - </span><span class="si">{</span><span class="nf">set</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>Output from two different invocations</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">python</span> <span class="n">snippet2</span><span class="p">.</span><span class="n">py</span>

<span class="n">Set</span> <span class="mi">1</span> <span class="o">-</span> <span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">9</span><span class="p">}</span>
<span class="n">Set</span> <span class="mi">2</span> <span class="o">-</span> <span class="p">{</span><span class="sh">'</span><span class="s">xyz</span><span class="sh">'</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="sh">'</span><span class="s">deg</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">def</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">abc</span><span class="sh">'</span><span class="p">}</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">python</span> <span class="n">snippet2</span><span class="p">.</span><span class="n">py</span>

<span class="n">Set</span> <span class="mi">1</span> <span class="o">-</span> <span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">9</span><span class="p">}</span>
<span class="n">Set</span> <span class="mi">2</span> <span class="o">-</span> <span class="p">{</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="sh">'</span><span class="s">def</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">abc</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">xyz</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">deg</span><span class="sh">'</span><span class="p">}</span>
</code></pre></div></div>

<p>You’ll notice how for set 2 the ordering is different.</p>

<p>For the two separate runs, since the strings have different hash values, they have been mapped to different locations in the memory which then affected the ordering when it was yeilded from the memory. 💡</p>

<h3 id="can-this-be-fixed">Can this be fixed?</h3>

<p>By virtue, python sets are <a href="https://docs.python.org/3/tutorial/datastructures.html#sets">unordered</a>, so it is better if alternatives are explored,
As of Python 3.7+, <a href="https://docs.python.org/3.7/library/stdtypes.html#mapping-types-dict">dicts</a> are ordered, so a hack like this would work:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sample_list</span> <span class="o">=</span> <span class="p">[</span><span class="sh">"</span><span class="s">def</span><span class="sh">"</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="sh">"</span><span class="s">abc</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">abc</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">deg</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">xyz</span><span class="sh">"</span><span class="p">]</span>

<span class="n">sample_set</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="nb">dict</span><span class="p">.</span><span class="nf">fromkeys</span><span class="p">(</span><span class="n">sample_list</span><span class="p">))</span>
</code></pre></div></div>

<p>This is how I modified my code</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># create list of unique tokens using dict
</span><span class="n">unique_tokens</span><span class="p">.</span><span class="nf">extend</span><span class="p">(</span>
    <span class="nf">list</span><span class="p">(</span><span class="nb">dict</span><span class="p">.</span><span class="nf">fromkeys</span><span class="p">(</span><span class="n">itertools</span><span class="p">.</span><span class="nf">chain</span><span class="p">(</span><span class="o">*</span><span class="n">_df</span><span class="p">.</span><span class="n">tokens</span><span class="p">.</span><span class="nf">to_list</span><span class="p">())))</span>
<span class="p">)</span>

<span class="n">config</span><span class="p">.</span><span class="n">VOCAB_SIZE</span> <span class="o">=</span> <span class="nf">len</span><span class="p">(</span><span class="n">unique_tokens</span><span class="p">)</span>

<span class="c1"># create tokenizer mapping
</span><span class="n">token2id</span> <span class="o">=</span> <span class="p">{</span><span class="n">token</span><span class="p">:</span> <span class="n">idx</span> <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">token</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">unique_tokens</span><span class="p">)}</span>
<span class="n">id2token</span> <span class="o">=</span> <span class="nf">dict</span><span class="p">(</span><span class="nf">enumerate</span><span class="p">(</span><span class="n">unique_tokens</span><span class="p">))</span>
</code></pre></div></div>

<p>If you still need to use <code class="language-plaintext highlighter-rouge">set</code> and preserve ordering across different runs(<a href="https://docs.python.org/3.8/reference/datamodel.html#object.__hash__">not recommended</a>), the env variable <code class="language-plaintext highlighter-rouge">PYTHONHASHSEED</code> can be <a href="https://docs.python.org/3.5/using/cmdline.html#envvar-PYTHONHASHSEED">set</a> to <code class="language-plaintext highlighter-rouge">‘0’</code>  to disable randomization.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">os</span>
<span class="kn">import</span> <span class="n">sys</span>
<span class="n">hash_seed</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="nf">getenv</span><span class="p">(</span><span class="sh">'</span><span class="s">PYTHONHASHSEED</span><span class="sh">'</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">hash_seed</span><span class="p">:</span>
    <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="sh">'</span><span class="s">PYTHONHASHSEED</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="sh">'</span><span class="s">0</span><span class="sh">'</span>
    <span class="c1"># Spaw a new/child process and run the same file
</span>    <span class="n">os</span><span class="p">.</span><span class="nf">execv</span><span class="p">(</span><span class="n">sys</span><span class="p">.</span><span class="n">executable</span><span class="p">,</span> <span class="p">[</span><span class="n">sys</span><span class="p">.</span><span class="n">executable</span><span class="p">]</span> <span class="o">+</span> <span class="n">sys</span><span class="p">.</span><span class="n">argv</span><span class="p">)</span>

<span class="c1"># Your code below
</span>
<span class="n">l1</span> <span class="o">=</span> <span class="p">[</span><span class="mi">9</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">]</span>
<span class="n">l2</span> <span class="o">=</span> <span class="p">[</span><span class="sh">"</span><span class="s">def</span><span class="sh">"</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="sh">"</span><span class="s">abc</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">abc</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">deg</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">xyz</span><span class="sh">"</span><span class="p">]</span>

<span class="n">s1</span> <span class="o">=</span> <span class="nf">set</span><span class="p">(</span><span class="n">l1</span><span class="p">)</span>
<span class="n">s2</span> <span class="o">=</span> <span class="nf">set</span><span class="p">(</span><span class="n">l2</span><span class="p">)</span>

<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Set 1 - </span><span class="si">{</span><span class="nf">set</span><span class="p">(</span><span class="n">s1</span><span class="p">)</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
<span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Set 2 - </span><span class="si">{</span><span class="nf">set</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>
<p>This snippet will turn off the randomization/salting that happens. This is done by setting a <code class="language-plaintext highlighter-rouge">env</code> variable and then spawning a new/child process which runs the same python file again. So that the new python invocation will use the value of the set <code class="language-plaintext highlighter-rouge">env</code> variable.
Running this snippet will give you the same ordering each time. Try it out : )</p>

<h3 id="in-this-article-️">In this article ☕️</h3>

<ul>
  <li>You understood how &amp; why sets are unordered</li>
  <li>How you can make them ordered</li>
  <li>Alternatives to preserve ordering and get unique values</li>
</ul>

<p>Find the snippets from this blog over <a href="https://github.com/Praful932/blog/tree/main/blog-artifacts/blog-1-ordered-sets">here</a> : )</p>

<h3 id="references">References</h3>
<ol>
  <li><a href="https://docs.python.org/3.4/reference/datamodel.html#object.__hash__">Documentation on hash</a></li>
</ol>]]></content><author><name>Praful Mohanan</name></author><category term="python" /><category term="tech" /><category term="code" /><summary type="html"><![CDATA[Why sets are unordered 🤔 and alternatives to order them]]></summary></entry></feed>