Praful’s Almanac

Find better generation parameters for your LLMs using llmsearch

2024-06-09T00:00:00+00:00

The Backstory

Back when GPT-J from EleutherAI had released I remember using it for a question answer extraction task from a span of text using few shot learning(you provide few examples in the prompt before the actual question that you want get answerd). It was a small 6B model and in my initial trials it did not work really great, Then I started playing with the generation parameters of the model. I tried multiple of them manually until I reached a configuration which seemed to do much better that what I originally started with. These are the set of generation parameters that I manually found for the task.

{
    "max_new_tokens" : 15,
    "min_new_tokens" : 5,
    "num_beams" : 3,
    "use_cache" : True,
    "no_repeat_ngram_size" : 4,
}

I thought to myself, there should be an easier way to do this. Generation Parameters are more than an icing on the cake for a language model particularly small ones, it can make or break your model, in-fact a lot of the latest model releases nowadays include a predefined set of generation params that the authors recommend, here is an example for LLAMA 3 8B that was released on huggingface.

This motivated me to build llmsearch , An easier way of finding generation parameters using the familiar scikit-learn interface.

Repository -

Documentation -

Main Arc Step-by-Step Guide to use `llmsearch`

Following Example will show an example a LLAMA-3 Model specifically casperhansen/llama-3-8b-instruct-awq on the infamous samsum dataset. We will use a quantized AWQ model.

Notebook if you want to follow along.

Install dependencies

# install llmsearch
!pip install llmsearch[pynvml] -q

# pinning to specific versions to avoid import issues - https://github.com/casper-hansen/AutoAWQ/issues/374
# only required if using awq model
!pip install transformers==4.38.2 -q
!pip install torch@https://download.pytorch.org/whl/cu121/torch-2.2.0%2Bcu121-cp310-cp310-linux_x86_64.whl#sha256=c441021672ebe2e5afbdb34817aa85e6d32130f94df2da9ad4cb78a9d4b81370 -q
!pip install autoawq==0.2.4 autoawq_kernels==0.0.6 -q

# install dependencies required for this example
!pip install accelerate==0.30.1 py7zr==0.21.0 evaluate==0.4.0 rouge_score==0.1.2 -q

Import required libraries

# Autocompletion
%config Completer.use_jedi = False

# Autoreload
%load_ext autoreload
%autoreload 2

import awq
import torch
import transformers
import llmsearch
import evaluate
import datasets
import numpy as np

from awq import AutoAWQForCausalLM
from sklearn.model_selection import GridSearchCV
from transformers import AutoTokenizer, AutoModelForCausalLM, StoppingCriteriaList

from llmsearch.tuner import Tuner
from llmsearch.scripts.stopping_criteria import MultiTokenStoppingCriteria

Set some variables that we will use later.

seed = 42
batch_size = 2
num_samples = 10
device = "cuda:0"

Load model & dataset

Load the casperhansen/llama-3-8b-instruct-awq model with the refs/pr/6 revision, This revision has the right EOS token configured as per the official LLAMA 3 repository. Not using the correct token mapping produces incorrect output from the model. We will use the samsum dataset to run generation hyper-parameter search on.

model_id = "casperhansen/llama-3-8b-instruct-awq"
revision = "refs/pr/6"
tokenizer = AutoTokenizer.from_pretrained(model_id,revision = revision)
tokenizer.padding_side = "left"
model = AutoAWQForCausalLM.from_quantized(
        model_id, fuse_layers=True, device_map={"": device}, revision = revision
    )

dataset = datasets.load_dataset("samsum")['train']
sample_dataset = dataset.shuffle(seed = seed).select(range(num_samples))

# These are required to make the model end the sequence correctly - https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct#transformers-automodelforcausallm
terminators = [
    128001,
    128009,
]

Define dataset preprocessor and metric

For a particular dataset, we can define columns in the dataset that will be used for evaluation(eval_cols) and columns that will be used while running inference(input_cols). Once you have decided on a metric, a evaluation function needs to be defined that takes in two arguments y_true : list & y_pred : list. y_pred is what the model will predict, y_true contains(for each item in the list) the evaluation columns (eval_cols) defined in your Tuner object, more on this later.

Your dataset preprocessor should take in single item from your dataset and return a string which is ready to be tokenized and can be passed directly into the model. In this example we convert an item of the dataset into the chat template format. The dataset preprocessor function should take in a tokenizer and kwargs , the kwargs will contain keys that you have defined as input_cols when you create the Tuner object, more on this in the next section.

# create a function that can be used for evaluation, should take in y_true (list[dict]), y_pred (list) and return a metric
rouge = evaluate.load('rouge')
def get_rouge_score(y_true : list, y_pred : list):
    return np.mean(rouge.compute(predictions=y_pred, references=[item['summary'] for item in y_true], use_stemmer=True, use_aggregator=False)['rouge2'])

# Define a dataset preprocessor that is called for every example in the dataset separately - Should take in tokenizer & kwargs and return a string that can be input directly to the model, here we apply chat template which most decoder models use
def sample_to_chat_format(tokenizer, **kwargs):
    messages = [
        {
            'role' : "system",
            'content' : "You are a helpful AI assistant."
        },
        {
            'role' : "user",
            'content' : f"Summarize the following text in less than 50 words: {kwargs['dialogue']}"
        }
    ]
    return tokenizer.apply_chat_template(messages, tokenize = False, add_generation_prompt = True)

Define `Tuner` object

This is the important object and where most of the magic happens, This takes in what you have defined till now and abstracts into a Tuner object. It also preprocesses the dataset so that you are ready to do inference. The column_mapping is what is used to identify what are columns in the dataset that will be used for preprocessing/inference (input_cols) and which one will be used for evaluation (eval_cols). This is how Tuner knows what arguments to send to the sample_preprocessor function (to preprocess the dataset) and which ones to scorer (to evaluate the model).

# define tuner object, this preprocesses the dataset and creates an LLMEstimator that can be run with GridSearchCV / RandomizedSearchCV of scikit-learn
tuner_ob = Tuner(
    model=model,
    tokenizer=tokenizer,
    dataset=sample_dataset,
    device="cuda:0",
    # the tuner module automatically reduces the batch size while running inference if it goes OOM
    batch_size=batch_size,
    tokenizer_encode_args={"padding": "longest",'truncation' : True, "add_special_tokens": False, 'max_length' : 1024},
    tokenizer_decode_args={"spaces_between_special_tokens": False, 'skip_special_tokens' : True},
    # pass in the scorer that we will be used to evaluate (input to this function is a batch)
    scorer=get_rouge_score,
    # pass in `dataset` preprocessor, this is run on the passed in dataset before feeding into the model, input of this function is a single example
    sample_preprocessor=sample_to_chat_format,
    seed=seed,
    # column mapping used to identify input and evaluation columns (these columns are passed in to the evaluation function (scorer) & the dataset preprocessor(sample_preprocessor))
    column_mapping={"input_cols": ["dialogue"], "eval_cols": ["summary"]},
)

You can examine if the dataset was preprocessed correctly, Tuner preprocessed the dataset and stores the input and output at _X & _y respectively.

# Check to see if dataset is processed as expected, `Tune` populates `_X` with the processed input and `_y` with `column_mapping.eval_cols`
print(f"Inputs: ")
for _x, _y in zip(tuner_ob.dataset['_X'][:3], tuner_ob.dataset['_y'][:3]):
    print(f"Input: {_x}")
    print('\n')
    print(f"Output: {_y}")

    print('\n\n')
    print('---' * 15,'\n\n')

Inputs:
Input: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Summarize the following text in less than 50 words: Lucy: omg did you see JK this morning?
Sue: I try to avoid it lol
Lucy: you should have seen it it was disgusting
Sue: I cant do it anymore i try to listen to the radio in the mornings.. jk makes you think the whole world is full of idiots lol
Lucy: you may be right I dont know how some of them can go on there in public for the world to see
Sue: I would die if I got a call to go on there lol
Sue: could you imagine ha ha
Lucy: I would piss myself If I saw you and Andy up there
Sue: over my dead body !<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Output: {'summary': "Sue doesn't watch JK any more as it's disgusting."}

---------------------------------------------

Input: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Summarize the following text in less than 50 words: Wendy: What's up?
Simon: Nothing much. I'm painting my cupboards.
Angela: Cool what colour?
Simon: Green.
Ben: I'm just chilling in the garden.
Angela: Nice weekend! I'm about to meet Chris.
Wendy: Say hello from me!
Angela: Will do! And how is your weekend, Wendy?
Wendy: Very lazy... The week was hard at work, I really needed some rest.
Ben: We should all come and visit Simon in his new apartment!
Simon: You are welcome, guys! Whenever you wish.
Ben: I should be in Bournemouth next week.
Simon: I'm not going anywhere :-)
Ben: Cool, I'll call you next week.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Output: {'summary': 'This weekend Wendy is very lazy because she worked hard at work, and Angela is meeting Chris. Simon is chilling in the garden and painting his cupboards green. Next week, Ben, Angela, Chris and Wendy will visit him in his new apartament.'}

---------------------------------------------

Evaluation Before Tuning

Before running a search, you should evaluate what score default settings provide, since the objective is to find a better score that what you have now.

You can get the score by calling tuner_ob.get_score with the default parameters. I have used 3 parameters here. max_new_tokens can be set by estimating the token distribution of the output. generation_seed is a parameter that is useful to seed outputs before generation and becomes important when you are running hyperparameter search to ensure reproducibility.

Also you do not want to generate tokens indefinitely till you hit the max_new_tokens, you want to stop when you hit a certain token or a certain sequence of tokens. You can use either eos_token_id or a stopping criteria.

# Get score & outputs using some generation parameters
tokenizer.pad_token = "<|end_of_text|>"
gen_params = {
    'max_new_tokens' : 70,
    'generation_seed' : 42,
    'eos_token_id' : terminators,
}

score, outputs = tuner_ob.get_score(gen_params)
print(f"Score - {score}")

Hyperparameter Search

Once you have instantiated the Tuner object, it exposes a tuner_ob.estimator which is a scikit-learn compatible BaseEstimator object. This can be used with scikit-learn methods. We will use it with GridSearchCV to run a hyperparameter search over the generation parameters.

First we define a hyperparameter space and a GridSearchCV/RandomSearchCV object and then fit it.

# Define your hyperparameter space here for the search
hyp_space = {
    'max_new_tokens' : [70],
    'generation_seed' : [42],
    'do_sample' : [True],
    'eos_token_id' : [terminators],

    'temperature': [0.1, 0.2],
    'top_k': [50, 60, 70],
    'no_repeat_ngram_size': [0],

}

# Pass in estimator & scorer as you do with the scikit-learn API
clf = GridSearchCV(
    estimator = tuner_ob.estimator,
    param_grid=hyp_space,
    scoring = tuner_ob.scorer,
    cv = 2,
    n_jobs = None, # we will run this sequentially
    verbose=3,
)

The fit will take time depending on the number of fits that are expected to happen and the inference time per fit.

# fit on the dataset
clf.fit(X=tuner_ob.dataset["_X"], y=tuner_ob.dataset['_y'])

Once the model is fit you can view the best generation parameters from the search

# print out the best parameters
print(clf.best_params_)

Evaluation After Tuning

Once you have the best parameters you can evaluate it on the full dataset using the tuner_ob.get_score method

scores, outputs = tuner_ob.get_score(clf.best_params_)
print(f"Scores - {scores}")

Additional Utilities

Logging Utilities - You can set the logging level of the library using this module

  from llmsearch.utils.logging_utils import set_verbosity_info, set_verbosity_warning, set_verbosity_debug

  # set verbosity to debug, useful to debug model outputs
  set_verbosity_debug()

The DEBUG level is useful to see what is happening inside the library, for eg you want to see the text that is passed in to the model and the output that you get, here’s an example

  # Example Logs from the get score function - Calculate score on a different dataset
  scores, outputs = tuner_ob.get_score(gen_params, dataset = datasets.Dataset.from_dict(sample_dataset[:2]))

Output

  2024-06-05 18:19:26.099 - llmsearch.utils.mem_utils:154 - INFO - Starting inference with generation parameters - {'max_new_tokens': 70, 'generation_seed': 42, 'eos_token_id': [128001, 128009]}
  2024-06-05 18:19:26.101 - llmsearch.utils.mem_utils:158 - INFO - Performing inference with batch_size - 2
  2024-06-05 18:19:26.103 - llmsearch.utils.model_utils:98 - INFO - Detected generation type - Greedy Decoding
  2024-06-05 18:19:29.759 - llmsearch.utils.model_utils:149 - DEBUG - Input - '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSummarize the following text in less than 50 words: Lucy: omg did you see JK this morning?\r\nSue: I try to avoid it lol\r\nLucy: you should have seen it it was disgusting\r\nSue: I cant do it anymore i try to listen to the radio in the mornings.. jk makes you think the whole world is full of idiots lol\r\nLucy: you may be right I dont know how some of them can go on there in public for the world to see\r\nSue: I would die if I got a call to go on there lol\r\nSue: could you imagine ha ha \r\nLucy: I would piss myself If I saw you and Andy up there\r\nSue: over my dead body !<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
  2024-06-05 18:19:29.763 - llmsearch.utils.model_utils:150 - DEBUG - Model Output - 'The conversation is about a TV show "JK" that Lucy and Sue dislike. They\'re making fun of the show\'s content and the people who appear on it, calling them "idiots." They\'re joking about how they wouldn\'t want to be on the show themselves.'
  2024-06-05 18:19:29.766 - llmsearch.utils.model_utils:149 - DEBUG - Input - "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nSummarize the following text in less than 50 words: Wendy: What's up?\r\nSimon: Nothing much. I'm painting my cupboards. \r\nAngela: Cool what colour?\r\nSimon: Green.\r\nBen: I'm just chilling in the garden. \r\nAngela: Nice weekend! I'm about to meet Chris.\r\nWendy: Say hello from me!\r\nAngela: Will do! And how is your weekend, Wendy?\r\nWendy: Very lazy... The week was hard at work, I really needed some rest. \r\nBen: We should all come and visit Simon in his new apartment!\r\nSimon: You are welcome, guys! Whenever you wish.\r\nBen: I should be in Bournemouth next week. \r\nSimon: I'm not going anywhere :-)\r\nBen: Cool, I'll call you next week.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
  2024-06-05 18:19:29.767 - llmsearch.utils.model_utils:150 - DEBUG - Model Output - "A group of friends chat about their weekends. Simon is painting his cupboards green, Angela is meeting Chris, and Ben is relaxing in the garden. They discuss visiting Simon's new apartment and make plans to catch up soon."
  2024-06-05 18:19:30.159 - llmsearch.utils.mem_utils:176 - DEBUG - Setting batch_size cache value - 2 for this particular configuration
  2024-06-05 18:19:30.161 - llmsearch.utils.mem_utils:188 - INFO - Finished running inference, took 4.057762 secs

Multi Token Stopping Criteria - There could be a use-case where you want to stop your generation at a specific token other than the eos_token or you want to stop the generation when a certain sequences of tokens occurs in the input. You can use the MultiTokenStoppingCriteria available in llmsearch
```
  from transformers import StoppingCriteriaList
  from llmsearch.scripts.stopping_criteria import MultiTokenStoppingCriteria

  # specify what sequence to stop the generation on
  multi_token_stop_criteria_ob = MultiTokenStoppingCriteria(sequence_ids=[32000])
  stopping_criteria = StoppingCriteriaList([multi_token_stop_criteria_ob])
  callbacks_after_inference = [multi_token_stop_criteria_ob.reset]

  tuner_ob = Tuner(
  		...
  		callbacks_after_inference=callbacks_after_inference,
  		...
  )
```
MultiTokenStoppingCriteria has the ability to operate on batches of input. It maintains a state for each batch that goes through it which helps it know where to look and what sequences in the batch have finished. This state is cleared after each inference run using the callback.

Conclusion ☕️

In this blog you got to know how you can utilize llmsearch to run hyperparameter search on generation parameters using scikit-learn. Would love to hear what the community does with it. In case of any feedback do not hesitate to reach out. llmsearch has multiple improvements planned as part of the v1.0.0. Stay tuned!

Understanding the F1 Score metric for evaluating Grammar Error Correction Systems

2023-03-05T00:00:00+00:00

Grammar Error Correction (GEC) in NLP is the task of making erroneous/grammatically incorrect sentences correct by performing a certain set of operations on the corrupted sentence.

Classic GEC System

These operations can be:

Replacement - You may replace a corrupted word with a corrected version of it.
Insertion - You may insert a missing word in the sentence.
Deletion - You may delete an unwanted word from the sentence.

F1 Score is a metric that is generally used to measure the performance of binary classification models. In this article, we will be understanding how you can use the F1 Score metric to evaluate GEC Systems as well. Some terminologies:

Input/Corrupted Sentence - This is the corrupted sentence that we want to correct.
Ground Truth - This is the corrected version of the Input Sentence.
Hypothesis/Model Prediction - This is the sentence that your model predicted.

Now you want to measure your model performance w.r.t to the Ground Truth that you have. Before we jump to the metric, we need to understand what the M2 format is and how it relates to - Understanding F1 Metric for GEC.

The M2 Format

This is a standard data format that is used in GEC tasks. Any annotation/model prediction of GEC can be expressed in this format, which has a corrupted sentence and the corrected version of it in terms of annotations/edits.

The below illustration explains what different parts of the format mean:

M2 Format

S - denotes Source Sentence
A - denotes Annotations/Edits, These can be Model predictions as well
A sentence can have more than one annotation -> More than one possible way to correct it.
There can be more than one edit in an annotation -> A correction with more than a single edit.

Few examples of the M2 Format:

S This are a sentence .
A 1 2|||R:VERB:SVA|||is|||-REQUIRED-|||NONE|||0

Original sentence - This are a sentence.
Annotations (1 Annotation)
- This is a sentence.
The original sentence here is This are a sentence.
The corrected version is This is a sentence . where are (at token offset - [1:2])is replaced by is.

S A dog over the wall .
A 2 2|||M:ADP|||jumped|||-REQUIRED-|||NONE|||0
A 1 2|||R:ADP|||cat|||-REQUIRED-|||NONE|||1
A 2 2|||M:ADP|||jumped|||-REQUIRED-|||NONE|||1

Original Sentence - A dog over the wall
Annotations (2 Annotations, 1st with 1 edit, 2nd with 2 edits)
- A dog jumped over the wall.
- A cat jumped over the wall.

S The boys played a game .
A 1 2 |||R:NOUN|||girls|||-REQUIRED-|||NONE|||0
A -1 -1|||noop|||-NONE-|||-REQUIRED-|||NONE|||1

Original Sentence - The boys played a game.
Annotations (2 Annotations, 1st with 1 edit, 2nd with noop edit)
- The girls played a game.
- None (Here this means that the sentence is correct so Annotator/Model 1 annotated it as noop/having no annotation.)

ERRANT(Error Annotation Toolkit) is one of the tools that you can use to get your output to the M2 format. While evaluating your model, you will have 2 M2 format annotations.

Ground Truth M2 - an M2 Format Annotation between the Corrupted Sentence and the Ground Truth.
Hypothesis M2 - an M2 Format Annotation between the Corrupted Sentence and the Hypothesis.

Now that you have understood the M2 format, Let’s take an example to see how we can calculate the metrics

Corrupted Sentence - I am not play game .
Ground Truth - I am not playing games .
Hypothesis - I am not playing game .

Respective M2s(just the annotations):

Ground Truth M2

 A 3 4 |||R:VERB|||playing|||-REQUIRED-|||NONE|||0
 A 4 5 |||R:NOUN|||games|||-REQUIRED-|||NONE|||0

Hypothesis M2

 A 3 4 |||R:VERB|||playing|||-REQUIRED-|||NONE|||0

Calculating the Metrics

Once we have the M2 format from the Ground Truth & Hypothesis M2, we can calculate the F1 metrics. In Grammar Error Correction, if you look at evaluation metrics, you will notice that often the F0.5 metric is mentioned. This is the F-Beta score with Beta=0.5 instead of F1(Beta=1). The lower Beta is the more you weigh Precision over Recall.

F-Beta Formula

In GEC you want to prevent introducing more false positives than identify every other error, that’s why we give more weight to precision than recall.
Here’s some pseudocode that explains how we calculate the F0.5 score from the M2s that we have.

# We think each of the edit in ground truth & hypothesis M2 to be the categories that we want to predict

# Initialize these to 0, There's no TN because we do not care for "non-errors"
tp, fp, fn = 0, 0, 0

# For each m2_edit in hypothesis_M2
for m2_edit in hypothesis_M2:
    # Check if the hypothesis is a noop(nothing) edit, we don't include that in the metric calculation
    if m2_edit = noop_edit:
        continue
    # Check if the same exact edit is present in ground_truth_M2, it's a True Positive if it exists
    # Otherwise it's an FP (the edit that model suggested is incorrect)
    if m2_edit in ground_truth_M2:
        tp + =1
    else:
        fp += 1

# For each m2_edit in ground_truth_M2
for m2_edit in ground_truth_M2:
    # Check if the ground_truth_M2 is a noop(nothing) edit, we don't include that in the metric calculation
    if m2_edit = noop_edit:
        continue
    # For things that were supposed to be predicted but weren't, we put those in the False Negatives bucket
    if m2_edit not in hypothesis_m2:
        fn += 1

# Now you can calculate the metrics easily
precision, recall = tp / (tp + fp), tp / (tp + fn)
# Calculate F-0.5
f1_score = float((1 + (beta**2)) * precision * recall) / (((beta**2) * precision) + recall) if precision + recall else 0.0

In this article ☕️

You understood what the M2 format is and how it is used in GEC
Got to know how the F1 metric is applied to problems like GEC and not just vanilla classification problems.

References

Using pre-commit hooks to write better code

2023-01-08T00:00:00+00:00

Pre-commit hooks are scripts that run before you commit your code to the codebase. These hooks for instance: can be autoformatters - which format & make your code pretty ✨ according to a defined standard; linters - which point out mistakes in your code, or it could be even your very own custom code/unit test scripts which run every time you run a git commit command.

These scripts/hooks(I’ll use the term hook for consistency’s sake throughout the article) are set up and run in an isolated manner(except for local hooks; more on this later) by the pre-commit package. So a hook written in another language can be set up and run as well independent of the development environment. In the context of pre-commit, these hooks are mainly git repositories that expose an executable. The advantage of having these packages all packed up into the pre-commit ecosystem are:

having a single file(the pre-commit config file) which manages the configuration for all of your hooks.
Letting pre-commit itself handle the setup for such hooks; For eg : a hook that is made for some programming language may not always be itself written in the same language, which may require additional effort in setting it up.

pre-commit can be installed via pip, brew or conda, Using pip the command would be

pip install pre-commit

The pre-commit config file 📃

Post installation, you may need to set up the config file. Once you have the config file setup, all you need to do is run pre-commit run to let it do its magic 🪄. The file which manages the configuration of all your hooks is the .pre-commit-config.yaml file. The configuration file follows the YAML syntax. There can be more than one hook associated with a pre-commit configuration file. This file describes what hooks the project will be using.

This config file has total 3 levels of configuration. This is how a pre-commit config file is structured :

pre-commit config file structure

Top level configuration
These are the global-level configurations that apply to your whole pre-commit setup. These settings mainly revolve around the set of files that you want to run pre-commit on and a few knobs on how pre-commit behaves.

###### Top-level configuration
exclude : ^wip                  # Exclude files from the pre-commit checks which match this pattern
files : .py$.                   # Only run pre-commit checks on this particular file pattern
fail_fast: false                # If True, If one hook fails, stops the run without executing the consecutive hooks
######
repos:
  ....

Repo level configuration
This configuration tells pre-commit where(i.e. which repo) to look for the code of the hooks that it will run on the codebase. You define a set of repos that pre-commit will use to set up the hooks. As mentioned earlier, pre-commit hooks are set up and run in an isolated manner. It is certainly possible that you need to run a custom hook(eg unit tests, dynamic checks) which is directly/indirectly dependent on the state of the codebase(through the virtual environment, build output, etc). Setting repo to local is a decent hack to achieve this(We will look into this in depth soon).

....
fail_fast: false
###### Repo-level configuration
repos:                          # List of repos that contain the hooks
- repo: ''                      # Repository URL
  rev: 1.0.0
  hooks:                        # Hooks that we want from the repository (There could be more than one hook in a repo)
    .....
- repo: local                   # Local hook
  hooks:
    .....
######

Hook level configuration
This is where the magic happens, for each of the repo configurations, you’ll define which hooks you want from the repository and the additional parameters that the hook needs.

....
- repo: ''                      # Repository URL
  rev: 2.0.0
######  Hook level configuration
  hooks:                        # List of hooks to use from the repository
  -   id: hook2                 # ID of the hook to use from the repository
      name: hook2-py            # Name to be shown during hook execution
######

- repo: local
###### Hook-level configuration
  hooks:
  -   id: my-local-script       # Random ID for the hook
      name: my-local-script     # Name to be shown during execution
      entry: python tests.py    # executable to run the hook
      language: python          # how to install the hook, could be python, ruby, dart depending upon the nature of the hook
      files : \.py$             # files to run on
######

Every pre-commit hook(except repo : local ones) should have an id attribute, this is what pre-commit uses to determine which hook to use, this can be found out via the .pre-commit-hooks.yaml file of the respective repo.

Every hook of a local repo(repo : local) should have the following attributes:

id : For a local hook this can be any valid string
name : Hook name shown during execution
language : This tells pre-commit how to install the hook, keeping this as system will not create any isolated environment for this hook and will use the project’s environment instead. This also means that local hook should have their dependencies as part of the project itself.
entry : Tells pre-commit to run this executable to run the hook, it could be a python script or event something like pytest tests/test_db.py
files : Pattern of files to run on

Tidy up your code

Now that we have looked at the different components of the config file, we’ll look at three of the hooks that I have found useful and how we can use them to tidy up our code

black
pyupgrade
pylint

All of these are individual python packages that can be installed(pip install pkg_name) and used separately as well via their command line options. For demonstration, we’ll go through each of the packages and then look at a pre-commit config file that encompasses all of these in one to avoid the need of running them via the command line.

Black

Black is an automatic code formatting tool for python files. It aims at standardizing the code style for python syntax so that diff is less, code is easier to read and review. Black uses concrete syntax trees internally to parse and format the code. The style that Black uses is a strict subset of PEP 8 with few knobs to turn.

Here is an example of how black formats code

Before Black (Left), After Black formatting (Right)

You’ll notice how the code got auto-formatted to a uniform structure. This particularly helps in MR review, so the reviewer’s sole focus is on just what changed, not stray commas, newlines and whitespaces. Can be used so:

repos:
- repo: https://github.com/ambv/black       # Repo URL
  rev: 22.3.0                               # Version
  hooks:                                    # Hooks
    - id: black                             # ID of the hook
      name: black-py                        # Name to display

Pyupgrade

This is a small & sweet hook that automatically converts syntax to newer versions of the python language.

Few examples:

Dict comprehension
- dict((a, b) for a, b in y) → {a: b for a, b in y}
Set Literals
- set(x for x in y) → {x for x in y}

Super Class call

   class C(Base):
       def f(self):
  -        **super(C, self).f()**
  +        **super().f()**

This hook helps in taking care of some of the breaking changes in the python API. Can be used so:

- repo: https://github.com/asottile/pyupgrade
  rev: v2.32.0
  hooks:
  -   id: pyupgrade
      name: pyupgrade-py

Pylint

This is my favorite, it’s not just a linter but also a static code analyser. Static code analyzers are those tools that check your code without actually executing them.

Pylint has several built-in components which make it powerful to even infer actual values from code. After analyzing the code, pylint outputs messages(5 types) to inform you how the code can be made better. These 5 types are:

(C) Convention, for programming standard violation
(R) Refactor, for bad code smell
(W) Warning, for python specific problems
(E) Error, for probable bugs in the code
(F) Fatal, if an error occurred which prevented pylint from doing further processing.

Let’s look at how pylint does on a sample snippet of python code

"""script.py"""
import numpy as np

def MapFeature(X1, X2):
    degree = 6
    out = np.ones((m, 1))
    for i in range(1, degree + 1):
        for j in range(i + 1):
            out = np.hstack(
                (out, (np.power(X1, i - j) * np.power(X2, j))[:, np.newaxis])
            )
    if out:
        return out
    else:
        return 0
    return out

def get_dict_sum():
    data = {"a": 10, "b": 20, "c": 30}
    res = 0
    for k, v in data:
        res += v

res = get_dict_sum()

This is the output that pylint provides when run(via command-line pylint script.py) on the above snippet of code

************* Module script
script.py:1:0: C0114: Missing module docstring (missing-module-docstring)
script.py:4:0: C0116: Missing function or method docstring (missing-function-docstring)
script.py:4:0: C0103: Function name "MapFeature" doesn't conform to snake_case naming style (invalid-name)
script.py:4:15: C0103: Argument name "X1" doesn't conform to snake_case naming style (invalid-name)
script.py:4:19: C0103: Argument name "X2" doesn't conform to snake_case naming style (invalid-name)
script.py:5:4: C0103: Variable name "d" doesn't conform to snake_case naming style (invalid-name)
script.py:6:19: E0602: Undefined variable 'm' (undefined-variable)
script.py:12:4: R1705: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it (no-else-return)
script.py:19:0: C0116: Missing function or method docstring (missing-function-docstring)
script.py:22:4: E1141: Unpacking a dictionary in iteration without calling .items() (dict-iter-missing-items)
script.py:22:11: C0103: Variable name "v" doesn't conform to snake_case naming style (invalid-name)
script.py:22:8: W0612: Unused variable 'k' (unused-variable)
script.py:26:0: E1111: Assigning result of a function call, where the function has no return (assignment-from-no-return)
script.py:26:0: C0103: Constant name "r" doesn't conform to UPPER_CASE naming style (invalid-name)

------------------------------------------------------------------
Your code has been rated at 0.00/10 (previous run: 0.00/10, +0.00)

The output of pylint is structured in a specific format where each line in the output points to a specific type of message code(one of the 5 types). The below example shows a message of type Warning(W).

Pylint Message Structure

You can view in-depth detail of the message code by running:

$ pylint --help-msg=W0612
:unused-variable (W0612): *Unused variable %r*
  Used when a variable is defined but not used. This message belongs to the
  variables checker.

You may have noticed how noisy sometimes the output of pylint on a piece of code can be. For eg - you may not want to always name a variable a certain way, or your function is self-explanatory and you don’t want a docstring. You can always silence a specific error code by passing an argument.

pylint —disable=C0114

or even disable an entire message code as well

pylint —disable=C

pylint can be used as pre-commit hook by adding it as so:

- repo: https://github.com/PyCQA/pylint
  rev: v2.15.9
  hooks:
    - id: pylint

Final pre-commit-config.yaml 📝

Here is the final sample YAML file which combines all of the hooks that we saw so far and also with some useful tweaks, particularly for pylint.

# .pre-commit-config.yml
repos:
- repo: https://github.com/ambv/black
  rev: 22.3.0
  hooks:
    - id: black
      name: black-py
- repo: https://github.com/asottile/pyupgrade
  rev: v2.32.0
  hooks:
  -   id: pyupgrade
      name: pyupgrade-py

- repo: local
  hooks:
  -   id: pylint
      name: pylint-py
      # Add project root path
      entry: pylint --init-hook="import sys,os; sys.path.append(os.getcwd())"
      args : [
        # black handles this except for string(C0301)
        # similar lines in multiple files(R0801)
        # attribute defined outside __init__(W0201)
        "--disable=C0301,R0801,W0201",
        # Allow 2-30 char variables
        "--variable-rgx=[a-z_][a-z0-9_]{1,30}$",
        # Allow 2-30 char attributes,args
        "--attr-rgx=[a-zA-Z_][a-zA-Z0-9_]{1,30}$",
        "--argument-rgx=[a-z_][a-z0-9_]{1,30}$",
        #  Exclude module member access for E1101
        "--generated-members=torch.*,pandas.*,Levenshtein.*",
        # Max local variables
        "--max-locals=25",
        # Exclusion for source unavailable pkgs
        "--extension-pkg-whitelist=lxml,pydantic",
        # Max Attributes for a class
        "--max-attributes=20",
      ]
      language: system
      files : \.py$
      require_serial: true

Few Details

repo : local Define pylint to be a local repo instead of providing the url
language : system pre-commit won’t set up a new environment but use the existing one
entry: pylint --init-hook="import sys,os; sys.path.append(os.getcwd())" As we saw earlier local hooks need to have the entry point defined. Using the init_hook parameter we add the root project path. This helps with the import error pylint would have thrown if the code had any local modules imported.

Run pre-commit(pre-commit run) using the above config file to see it work its magic 🪄

Note : You will need pylint already installed since repo : local & language : system are defined.

In this article ☕️

You understood why pre-commit is useful
How a pre-commit config file is structured
You looked at various hooks(black, pyupgrade and pylint) and how they can be used to tidy up your code.

I hope this article was useful, for any doubts, do comment below. Find the snippets of this blog and the config file that I generally use here : )

References

Ordering of set() when dealing with strings in python

2022-12-18T00:00:00+00:00

While working on a baseline ML model for a side-project, I found that across different runs 🧪 of my experiments, the results that my model was generating were not exactly reproducible i.e. I was not getting the same performance metrics for the same model configuration, despite having all the knobs in place.

After debugging for quite some time, I found that this snippet was the root of my problems :

# create list of unique tokens using set
unique_tokens.extend(list(set(itertools.chain(*train_df.tokens.to_list()))))

config.VOCAB_SIZE = len(unique_tokens)

# create tokenizer mapping
token2id = {token : idx for idx, token in enumerate(unique_tokens)}
id2token = dict(enumerate(unique_tokens))

I was constructing the tokenizer mapping using the set() operation, this caused the same model i/o to be encoded & decoded differently each time. And we’ll see why.

How set() works

First, we need to understand how set() is implemented in python. Internally a set() data structure is implemented using a hash table. A hash table by definition has a hash function, which takes in the input and maps the data to a unique bucket using the hash value, this is how it can do membership checking in O(1).

When you call a set() on a list object, it returns unique values for the input that you provided. Internally to distinguish this “uniqueness” it uses the hash function we discussed above.

Hash Table

Converting a list into a set is easy, since for two similar values, both of them will map to the same exact hash bucket. However, this hash function is not always deterministic, particularly when dealing with string objects across two different python invocations. Let’s look at a few examples

"""snippet1.py"""
# Snippet to get hash values

a = "1"
b = "abcde"
c = 1234
d = 6.4512

hv1 = hash(a)
hv2 = hash(b)
hv3 = hash(c)
hv4 = hash(d)

print(f"Hash value of {a} - {hv1}")
print(f"Hash value of {b} - {hv2}")
print(f"Hash value of {c} - {hv3}")
print(f"Hash value of {d} - {hv4}")

This is what I got from two different invocations of the script

$ python snippet1.py

Hash value of 1 - 1981388520896787279
Hash value of abcde - 4943320557970621589
Hash value of 1234 - 1234
Hash value of 6.4512 - 1040396365757218822

$ python snippet1.py

Hash value of 1 - -9001918643517506909
Hash value of abcde - -757009308147773598
Hash value of 1234 - 1234
Hash value of 6.4512 - 1040396365757218822

You can notice how I got different outputs across two different invocations of the script for the variables that are string. While the hash values for the numbers remained constant.

This is because of how internally hash function is implemented. For values of str and byte objects, the input to the hash function is salted with a random value to protect against certain denial of service attacks(source). For the same python invocation, the value remains the same, as this “salting” only happens at the first time you call the python executable.

But how do these hash values link to the ordering of the sets 🤔

In the set() data structure, after hashing is done for an object, python takes the last N bits of the hash value and uses them as indices to place the object in the memory. And when these values are retrieved from the memory, they are yielded in the order that they exist in the memory not the way they were put in.

And what happens to the order when you have different hash values across different python invocations?

Here’s an example to make the concept concrete:

"""snippet2.py"""

l1 = [9,1,1,2,3,4,5,1,1,2]
l2 = ["def",2,3,4,"abc", "abc", "deg", "xyz"]

s1 = set(l1)
s2 = set(l2)

print(f"Set 1 - {set(s1)}")
print(f"Set 2 - {set(s2)}")

Output from two different invocations

$ python snippet2.py

Set 1 - {1, 2, 3, 4, 5, 9}
Set 2 - {'xyz', 2, 3, 4, 'deg', 'def', 'abc'}

$ python snippet2.py

Set 1 - {1, 2, 3, 4, 5, 9}
Set 2 - {2, 3, 4, 'def', 'abc', 'xyz', 'deg'}

You’ll notice how for set 2 the ordering is different.

For the two separate runs, since the strings have different hash values, they have been mapped to different locations in the memory which then affected the ordering when it was yeilded from the memory. 💡

Can this be fixed?

By virtue, python sets are unordered, so it is better if alternatives are explored, As of Python 3.7+, dicts are ordered, so a hack like this would work:

sample_list = ["def",2,3,4,"abc", "abc", "deg", "xyz"]

sample_set = list(dict.fromkeys(sample_list))

This is how I modified my code

# create list of unique tokens using dict
unique_tokens.extend(
    list(dict.fromkeys(itertools.chain(*_df.tokens.to_list())))
)

config.VOCAB_SIZE = len(unique_tokens)

# create tokenizer mapping
token2id = {token: idx for idx, token in enumerate(unique_tokens)}
id2token = dict(enumerate(unique_tokens))

If you still need to use set and preserve ordering across different runs(not recommended), the env variable PYTHONHASHSEED can be set to ‘0’ to disable randomization.

import os
import sys
hash_seed = os.getenv('PYTHONHASHSEED')
if not hash_seed:
    os.environ['PYTHONHASHSEED'] = '0'
    # Spaw a new/child process and run the same file
    os.execv(sys.executable, [sys.executable] + sys.argv)

# Your code below

l1 = [9,1,1,2,3,4,5,1,1,2]
l2 = ["def",2,3,4,"abc", "abc", "deg", "xyz"]

s1 = set(l1)
s2 = set(l2)

print(f"Set 1 - {set(s1)}")
print(f"Set 2 - {set(s2)}")

This snippet will turn off the randomization/salting that happens. This is done by setting a env variable and then spawning a new/child process which runs the same python file again. So that the new python invocation will use the value of the set env variable. Running this snippet will give you the same ordering each time. Try it out : )

In this article ☕️

You understood how & why sets are unordered
How you can make them ordered
Alternatives to preserve ordering and get unique values

Find the snippets from this blog over here : )

References

Documentation on hash

Praful’s Almanac

Find better generation parameters for your LLMs using llmsearch

The Backstory

Main Arc Step-by-Step Guide to use llmsearch

Install dependencies

Import required libraries

Load model & dataset

Define dataset preprocessor and metric

Define Tuner object

Evaluation Before Tuning

Hyperparameter Search

Evaluation After Tuning

Additional Utilities

Conclusion ☕️

Understanding the F1 Score metric for evaluating Grammar Error Correction Systems

The M2 Format

Calculating the Metrics

In this article ☕️

References

Using pre-commit hooks to write better code

The pre-commit config file 📃

Tidy up your code

Black

Pyupgrade

Pylint

Final pre-commit-config.yaml 📝

In this article ☕️

References

Ordering of set() when dealing with strings in python

How set() works

Can this be fixed?

In this article ☕️

References

Main Arc Step-by-Step Guide to use `llmsearch`

Define `Tuner` object