UCL IR Coursework I

What will happen to
Zipfian distribution plot if we remove stop words from our vocabulary?

Deadline: February 26, 2025 at 4pm

Data set: 200 search queries, for each one ≤1,000 passages that were returned

Tasks: text processing and analysis, inverted index implementa,on, (re-)rank the passages for each query based on basic retrieval and query likelihood language models

Give extra attention the following

marking will be par,ally automated

please follow instruction to the latter!

following the insturction also means no peanlties

Python (recommended), Java (permiXed), no notebook submissions, each task asks for specific output

filename/type, a submission will consist of 10 or 11 files exactly

your answers will have a level of stochas,city

do not use external func,ons that can solve end-to-end the tasks of building an inverted index, retrieval and language models

only use unigram (1-gram) text representa,ons

use the ACL LaTeX template for your report

Support Sessions:

Weekly support: Thursdays 16:30-17:30
Q&A session: February 5 at 11am (1st lecture hour)

Communication Guidelines:

Maximum 2 emails per student to v.lampos@ucl.ac.uk
Contact TAs for additional support
No posting about Coursework 1 on forum or public media
Important updates will be posted on Moodle
No questions about Coursework 2

Technical Notes:

Reference runtime: 12 minutes (M1 MacBook Pro) / 6 minutes (M3 Max)
Inverted index implementation requires memory optimization and parallel processing

Understanding

what should I have to do from task 1-4 for each

Coding

can i use numpy or pandas and torch

preprocessing

Experiments

task1

task2

task3

no report

task4

Related lectures

text processing and indexing (tfidf)

Probabilistic models

BM25

candidate-passages-top1000.tsv

72530.2KB

passage-collection.txt

60779.1KB

Prompt


Information Retrieval and Data Mining (COMP0084)
Coursework 1
Tutor: Vasileios Lampos, v.lampos@ucl.ac.uk
Marking penalties specific to this coursework
Marking penalties are denoted with bold font throughout the specification of the coursework.
The maximum amount of marks that can be lost due to penalties (= 20) is the sum of the
following:
• 5 marks for malformed submission file names and / or imposing a directory structure,
• 5 marks for a report that did not follow the formatting guidelines,
• 1 mark for having figures that are not in vector graphics,
• 5 marks for incorrectly formed output files (1 mark per output file), and
• 4 marks for very slow to execute source code files (1 mark per source code file).
Penalty marks are subtracted from your final mark. To avoid the majority of the aforementioned
penalties, please follow the coursework’s guidelines as described below.
Task definition
An information retrieval model is an essential component for many applications such as search,
question answering, and recommendation. In this coursework you are going to develop information retrieval models that solve the problem of passage retrieval, i.e. given a text query, we need
to return a ranked list of short texts (passages). More specifically, in this assignment you are
going to build a passage re-ranking system: given a candidate list of returned passages for a text
query, you should re-rank these returned passages based on an information retrieval model.
qid pid query passage
523270 2818345 toyota of plano plano tx DART’s Red Line runs along North Central…
527433 1537731 types of dysarthria from cerebral palsy In some cases, further testing will also be…
1113437 5194230 what is physical description of spruce Source: *U.S. Rehab Aide Job Description…
833860 5043973 what is the most popular food in switzerland The national currency in Switzerland is the…
1056204 2328890 who was the first steam boat operator a relatively small usually open craft of a…
Figure 1: Sample rows from the candidate-passages-top1000.tsv file.
1
Data
Data can be downloaded from this online repository.
1 The data set consists of 3 files:
• test-queries.tsv is a tab separated file, where each row contains a test query identifier
(qid) and the actual query text.
• passage-collection.txt is a collection of passages, one per row.
• candidate-passages-top1000.tsv is a tab separated file with an initial selection of at
most 1000 passages for each of the queries in test-queries.tsv. The format of this file is
<qid pid query passage>, where pid is the identifier of the passage retrieved, query is
the query text, and passage is the passage text (all tab separated). The passages contained
in this file are the same as the ones in passage-collection.txt. However, there might be
some repetitions, i.e. the same passage could have been returned for more than 1 query.
Figure 1 shows some sample rows from that file.
It is important that you do not change the above filenames, i.e. your source code must use
the above filenames, otherwise we might not be able to assess your code automatically and a
penalty of 5 marks will be applied.
Coursework tasks and deliverables
Please read carefully as all deliverables (D1 to D12) are described within the tasks. If a deliverable requests an answer in a specific format, follow these guidelines carefully as parts of the
marking will be automated. Although we strongly suggest using Python, you could also use
Java. Please make one consistent choice for this coursework. Do not submit your source code as
a Jupyter Notebook or similar. All source code must be your own. You are not allowed to
reuse any code that is available from someone or somewhere else. Do not use external functions or
libraries that can solve (end-to-end) the tasks of building an inverted index, TF-IDF, the vector
space model, BM25, or any of the query likelihood language models. You could use functions or
libraries that enhance numeric operations, data storage, and retrieval when this does not violate
the aforementioned conditions.
However, do make sure that your code is not very inefficient. Very inefficient submissions
will be penalised. The expected completion time also depends on computing power. As an
approximate rule of thumb, the maximum runtime per task should not exceed 40 minutes when
using a fairly modern laptop (e.g. a 2020 MacBook Pro with a 1st generation M1 processor and
16 GB RAM). The equivalent runtime for very efficient implementations could be 5 minutes or
less per task. If we are not able to timely run your source code on our machines (i.e. the code
for a task is still running after 40 minutes, noting that our machines have stronger specifications
than an M1 MacBook Pro), a penalty of 1 mark for each task that is slow will be applied.
Use only unigram (1-gram) text representations to solve this coursework’s tasks.2 Write
your report using the ACL LATEX template,3 and submit 1 PDF file named report.pdf. All
plots/figures in your report must be in vector graphics format (e.g. .pdf).4
If they are not, a
penalty of 1 mark will be applied. Your report should not exceed 6 pages. If there are formatting
1Link to the data, dropbox.com/s/z15xt8og1a7bt3q/coursework-1-data.zip?dl=0
2n-gram representations with n > 1 can improve performance, but are out-of-scope for this coursework.
3The ACL LATEX template is available as an OverLeaf project, overleaf.com/read/crtcwgxzjskr.
4Vector graphics, en.wikipedia.org/wiki/Vector_graphics
2
issues (ACL LATEX template not used, reported exceeds 6 pages etc.) a penalty of 5 marks will
be applied.
In total, your submission should have 4 source code files (task[1-4].py/java), an optional
requirements.txt file, 5 output CSV files (tfidf.csv, bm25.csv, laplace.csv, lidstone.csv,
dirichlet.csv), and 1 PDF file with the report (report.pdf). Do not deploy any internal
directory structure in your submission, i.e. just submit the aforementioned 10 or 11 files. Do
not compress (e.g. zip) your submission files. The data files should not be included in your
submission. You should assume that the data files will be in the same directory as the source
code (again no directory structure). If you impose a directory structure, then we will need to
remove this by editing your source code to then be able to assess your submission automatically.
This will result in a 5 mark penalty. Your source code may generate intermediate files when it
runs (e.g. to transfer an outcome of a previous step to the next one). These should be removed
when submitting your solution.
Task 1 – Text statistics (25 marks). Use the data in passage-collection.txt for this
task. Extract terms (1-grams) from the raw text. In doing so, you can also perform basic text
preprocessing steps. However, you can also choose not to.
D1 Do not remove stop words in the first instance. Describe and justify any other text preprocessing choice(s) if any [4 marks], and report the size of the identified index of terms
(vocabulary) [1 mark].
Then, implement a function (or a block of source code) that counts the number of occurrences of terms in the provided data set, plot (Figure 1) their probability of occurrence
(normalised frequency) against their frequency ranking [1 mark], and qualitatively justify
that these terms follow Zipf’s law [1 mark].
As a reminder, Zipf’s law for text sets s = 1 in the Zipfian distribution defined by
f(k; s, N) = k
−s
PN
i=1 i−s
, (1)
where f(·) denotes the normalised frequency of a term, k denotes the term’s frequency
rank in our corpus (with k = 1 being the highest rank), N is the number of terms in
our vocabulary, and s is a distribution parameter. How does your empirical distribution
compare to the actual Zipf’s law distribution? Use Eq. 1 to explain and to quantify where
their difference is coming from [15 marks], and also compare the two in a log-log plot
(Figure 2) [1 mark].
How will the difference between the two distributions be affected if we also remove stop
words? Justify your answer by depicting (Figure 3) and quantifying this difference [2
marks].
D2 Submit your source code for Task 1 as a single Python or Java file titled task1.py or
task1.java, respectively.
Task 2 – Inverted index (10 marks). Use candidate-passages-top1000.tsv for this task
(unique instances of column pairs pid and passage). Using the vocabulary of terms identified
in Task 1 (you will need to choose between removing or keeping stop words), build an inverted
index for the collection so that you can retrieve passages in an efficient way [3 marks].
3
D3 Provide a description of your approach to generating an inverted index, report the information that you decided to store in it, and justify this choice [7 marks].
Hint: Implementing Tasks 3 and 4 should inform this choice.
D4 Submit the source code for Task 2 as a single Python or Java file titled task2.py or
task2.java, respectively.
Task 3 – Retrieval models (35 marks). For this task, please use test-queries.tsv and
candidate-passages-top1000.tsv. There is no need to describe your approach in the report.
You will be marked based on your output only. If the output does not follow the description (see
below), we will apply a penalty of 1 mark per problematic output file and attempt to fix minor
issues manually (e.g. remove headers, correct spacing etc.). However, if after fixing minor issues,
the output file does not have the exact number of rows as requested (this means that there is a
fundamental flow in your solution), you will receive 0 marks for that part of the coursework. We
will not be able to re-arrange the order of rows that you have provided.
D5 Extract the TF-IDF vector representations of the passages using the inverted index you
have constructed. Using the IDF representation from the corpus of the passages, extract
the TF-IDF representation of the queries. Use a basic vector space model with TF-IDF and
cosine similarity to retrieve at most 100 passages from within the 1000 candidate passages
for each query (some queries have fewer candidate passages). Store the outcomes in a file
named tfidf.csv [20 marks]. Each row of tfidf.csv must have the following format:
qid,pid,score
where qid denotes the query’s identifier, pid denotes the passage identifier, and score is
their cosine similarity score. Note that there should be no spaces between commas. The
output .csv files should not have headers. Strictly report the identifiers from the raw data
(do not replace them with your own). qid,pid pairs should be ranked by similarity score
in descending order, i.e. starting with the highest similarity score. You should first report
the top qid,pid pairs for one qid, then move on to the next, following the same query
order as in the text-queries.tsv file. The same file format should be used for reporting
the outputs of all the retrieval or language models that you will implement as part of
this coursework (i.e. this applies to Task 4 as well). The last column should report the
corresponding similarity/relevance score of each model. Each output .csv file is expected
to have exactly 19,290 rows.
D6 Use the inverted index to also implement BM25 while setting k1 = 1.2, k2 = 100, and
b = 0.75. Retrieve at most 100 passages from within the 1000 candidate passages for each
test query. Store the outcomes in a file named bm25.csv [15 marks].
D7 Submit the source code for Task 3 as a single Python or Java file titled task3.py or
task3.java, respectively.
Task 4 – Query likelihood language models (30 marks). Use test-queries.tsv and
candidate-passages-top1000.tsv for this task. Implement the query likelihood language
model with (a) Laplace smoothing, (b) Lidstone correction with ϵ = 0.1, and (c) Dirichlet
smoothing with µ = 50, and retrieve at most 100 passages from within the 1000 candidate passages for each test query. There is no need to describe your approach in implementing these
language models in the report; you will be marked based on your output only. However, please
4
note that D11 needs to be answered in the report. Similarly to Task 3, if the output does not
follow the description (see Task 3), we will apply a penalty of 1 mark per problematic output
file and attempt to fix minor issues manually (e.g. remove headers, correct spacing etc.). If after
fixing minor issues, the output file does not have the exact number of rows as requested (this
means that there is a fundamental flow in your solution), you will receive 0 marks for that part
of the coursework. We will not be able to re-arrange the order of rows that you have provided.
D8-10 Store the outcomes of (a), (b), and (c) in the files laplace.csv, lidstone.csv, and
dirichlet.csv, respectively. In these files, the 3rd column should report the natural
logarithm of the probability score [15 marks].
D11 Which language model do you expect to work better [1 mark] and why [2 marks]? Give
a few empirical examples based on your data and experiments [2 marks]. The examples
need to be comprehensive to a reader of the report that does not have access to the data
(i.e. do not refer to passage or query IDs).
Which language models are expected to be more similar [1 mark] and why [2 marks]? Give
a few empirical examples based on your data and experiments [2 marks].
Comment on the value of ϵ = 0.1 in the Lidstone correction. Is this a good choice [1 mark],
would there be a better setting (if so, please provide a range of values) [1 mark], and why
[1 mark]?
If we set µ = 5000 in Dirichlet smoothing, would this be a more appropriate value [1 mark],
and why [1 mark]?
D12 Submit the source code for Task 4 as a single Python or Java file titled task4.py or
task4.java, respectively.



Information Retrieval & Data Mining [COMP0084]
Coursework 1 (CW1) Q & A
Vasileios Lampos
Computer Science, UCL
� lampos.net
Preliminaries — About this lecture
2
‣ In this lecture:
— brief overview of Coursework 1 (CW1)
— provide answers to some of your questions
‣ NB: Let’s all try to keep in mind that we cannot reveal the direct solutions of the
coursework! Be mindful as you make your questions!
‣ Textbook references about TF-IDF and cosine similarity
— nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf (search for these terms)
— web.stanford.edu/~jurafsky/slp3/ed3book_Jan25.pdf (Chapter 6; my preferred resource)
COMP0084 - CW1 Q & A
About Coursework 1
3
‣ Link: moodle.ucl.ac.uk/mod/assign/view.php?id=6116763
‣ 50% of the final mark (this is probably the easier half)
‣ Some statistics from previous year
• mean mark 75% (sd: 11%)
• 3% of the submissions were below 50%
‣ As always, start working on this coursework early
‣ Aim: To build a passage (document) re-ranking system: given a candidate list of passages
that are returned for a text query, we should re-rank (reorder from the most to the least
relevant) these candidate passages based on 5 basic retrieval models.
‣ Data set: 200 search queries, for each one ≤1,000 passages that were returned, i.e. in total
at most 200,000 small texts (is this large scale? NO)
COMP0084 - CW1 Q & A
About Coursework 1 — Ques ons, support, basic code of conduct
4
‣ Deadline: February 26, 2024 at 4pm
‣ Office hours: Please send me an email to set up a meeting (v.lampos@ucl.ac.uk)
‣ You can send me up to 2 emails for CW1. Please be concise with your questions.
‣ Do not post anything about Coursework 1 on the course’s forum or in any public medium
(to avoid unintentional “spoilers”)
‣ I might send announcements with clarifications to the entire class. These will be posted
on the forum, so please check your emails and the COMP0084 forum regularly.
‣ Marks are expected to be released by the end of March. Please note that if there is a
substantial amount of EC extensions, the mark release date might be delayed.
COMP0084 - CW1 Q & A
CW1 — Task 1 (25 marks)
5
‣ Text preprocessing steps to form a vocabulary
‣ Perform a basic statistical analysis of text
‣ Comparison between empirical term frequency distribution and Zipf’s law — visual
comparison (plots), but you should also use the equations to quantify the expected
difference (and where it is coming from)
‣ How the difference will be affected if we remove stop words (justify by quantifying the
difference)
‣ Overall, this is the least demanding task from a coding perspective
‣ Might be good to revisit your analysis after performing the next tasks, i.e. inform text
preprocessing choices based on the language models.
COMP0084 - CW1 Q & A
CW1 — Task 2 (10 marks)
6
‣ Inverted index
‣ NB: If implemented inefficiently, it might take hours
‣ What information have you decided to store in the inverted index and why?
COMP0084 - CW1 Q & A
CW1 — Task 3 (35 marks)
7
‣ Retrieval models (TF-IDF & cosine similarity, BM25)
‣ Be very careful in the formatting of your outputs
‣ For this task, there is nothing I expect you to write in the report; you will be marked based
on the output you are going to submit.
COMP0084 - CW1 Q & A
CW1 — Task 4 (30 marks)
8
‣ Query likelihood language models with Laplace, Lidstone correction, and Dirichlet
smoothing
‣ No description of the language models is required; you will be marked based on your
output
‣ In addition, there are a few more things that you need to do:
— Which language model do you expect to work better / similarly
— Provide evidence by reporting examples from your data and experiments
— Impact of language model hyperparameters / different settings / justification
COMP0084 - CW1 Q & A
CW1 — Common ques ons
9
1. What is the desirable structure and style / format of the report?
2. Can we use external libraries for some of the tasks in the coursework?
3. Can our source code produce intermediate files? Can operations be shared among
different tasks?
4. What is “vector graphics”?
5. I have little programming experience. Will I cope with this coursework?
6. What will we mark automatically?
7. Are we going to run your source code? How?
8. Should we include comprehensive comments in our source code?
9. Should we compute language model parameters based on one query’s passages or the
entire corpus of passages? The latter.
10. How should we define term frequency (TF) in TF-IDF? Normalised or binary count are my
general suggestions, but using the raw count should also have the same effect (with the
normalised approach) when deploying the cosine similarity.
COMP0084 - CW1 Q & A
Next lectures with me
10
Introduction to machine learning and data mining
‣ February 12 (3 hours)
Topic models and vector semantics (word embeddings)
‣ March 12 (2 hours)
Modelling COVID-19 using web search activity
‣ March 26? (1 or 2 hours — TBD)
COMP0084 - CW1 Q & A

UCL IR Coursework I

What will happen to Zipfian distribution plot if we remove stop words from our vocabulary?

Understanding

Coding

Experiments

Related lectures

Prompt

Recommendations

What will happen to
Zipfian distribution plot if we remove stop words from our vocabulary?