UCL IR Coursework I

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jan 22 10:55
Editor
Edited
Edited
2025 Feb 26 0:47
Refs
Refs

What will happen to
Zipfian distribution
plot if we remove stop words from our vocabulary?

  • Deadline: February 26, 2025 at 4pm
  • Data set: 200 search queries, for each one ≤1,000 passages that were returned
  • Tasks: text processing and analysis, inverted index implementa,on, (re-)rank the passages for each query based on basic retrieval and query likelihood language models
Give extra attention the following
  • marking will be par,ally automated
  • please follow instruction to the latter!
  • following the insturction also means no peanlties
  • Python (recommended), Java (permiXed), no notebook submissions, each task asks for specific output
  • filename/type, a submission will consist of 10 or 11 files exactly
  • your answers will have a level of stochas,city
  • do not use external func,ons that can solve end-to-end the tasks of building an inverted index, retrieval and language models
  • only use unigram (1-gram) text representa,ons
  • use the ACL LaTeX template for your report
  • Support Sessions:
    • Weekly support: Thursdays 16:30-17:30
    • Q&A session: February 5 at 11am (1st lecture hour)
  • Communication Guidelines:
    • Maximum 2 emails per student to v.lampos@ucl.ac.uk
    • Contact TAs for additional support
    • No posting about Coursework 1 on forum or public media
    • Important updates will be posted on Moodle
    • No questions about Coursework 2
  • Technical Notes:
    • Reference runtime: 12 minutes (M1 MacBook Pro) / 6 minutes (M3 Max)
    • Inverted index implementation requires memory optimization and parallel processing
 
 

Understanding

what should I have to do from task 1-4 for each

Coding

can i use numpy or pandas and torch
preprocessing

Experiments

task1
task2
task3
no report
task4

Related lectures

  • text processing and indexing (tfidf)
  • Probabilistic models
  • BM25
 
 
 
 

Prompt

Information Retrieval and Data Mining (COMP0084) Coursework 1 Tutor: Vasileios Lampos, v.lampos@ucl.ac.uk Marking penalties specific to this coursework Marking penalties are denoted with bold font throughout the specification of the coursework. The maximum amount of marks that can be lost due to penalties (= 20) is the sum of the following: • 5 marks for malformed submission file names and / or imposing a directory structure, • 5 marks for a report that did not follow the formatting guidelines, • 1 mark for having figures that are not in vector graphics, • 5 marks for incorrectly formed output files (1 mark per output file), and • 4 marks for very slow to execute source code files (1 mark per source code file). Penalty marks are subtracted from your final mark. To avoid the majority of the aforementioned penalties, please follow the coursework’s guidelines as described below. Task definition An information retrieval model is an essential component for many applications such as search, question answering, and recommendation. In this coursework you are going to develop information retrieval models that solve the problem of passage retrieval, i.e. given a text query, we need to return a ranked list of short texts (passages). More specifically, in this assignment you are going to build a passage re-ranking system: given a candidate list of returned passages for a text query, you should re-rank these returned passages based on an information retrieval model. qid pid query passage 523270 2818345 toyota of plano plano tx DART’s Red Line runs along North Central… 527433 1537731 types of dysarthria from cerebral palsy In some cases, further testing will also be… 1113437 5194230 what is physical description of spruce Source: *U.S. Rehab Aide Job Description… 833860 5043973 what is the most popular food in switzerland The national currency in Switzerland is the… 1056204 2328890 who was the first steam boat operator a relatively small usually open craft of a… Figure 1: Sample rows from the candidate-passages-top1000.tsv file. 1 Data Data can be downloaded from this online repository. 1 The data set consists of 3 files: • test-queries.tsv is a tab separated file, where each row contains a test query identifier (qid) and the actual query text. • passage-collection.txt is a collection of passages, one per row. • candidate-passages-top1000.tsv is a tab separated file with an initial selection of at most 1000 passages for each of the queries in test-queries.tsv. The format of this file is <qid pid query passage>, where pid is the identifier of the passage retrieved, query is the query text, and passage is the passage text (all tab separated). The passages contained in this file are the same as the ones in passage-collection.txt. However, there might be some repetitions, i.e. the same passage could have been returned for more than 1 query. Figure 1 shows some sample rows from that file. It is important that you do not change the above filenames, i.e. your source code must use the above filenames, otherwise we might not be able to assess your code automatically and a penalty of 5 marks will be applied. Coursework tasks and deliverables Please read carefully as all deliverables (D1 to D12) are described within the tasks. If a deliverable requests an answer in a specific format, follow these guidelines carefully as parts of the marking will be automated. Although we strongly suggest using Python, you could also use Java. Please make one consistent choice for this coursework. Do not submit your source code as a Jupyter Notebook or similar. All source code must be your own. You are not allowed to reuse any code that is available from someone or somewhere else. Do not use external functions or libraries that can solve (end-to-end) the tasks of building an inverted index, TF-IDF, the vector space model, BM25, or any of the query likelihood language models. You could use functions or libraries that enhance numeric operations, data storage, and retrieval when this does not violate the aforementioned conditions. However, do make sure that your code is not very inefficient. Very inefficient submissions will be penalised. The expected completion time also depends on computing power. As an approximate rule of thumb, the maximum runtime per task should not exceed 40 minutes when using a fairly modern laptop (e.g. a 2020 MacBook Pro with a 1st generation M1 processor and 16 GB RAM). The equivalent runtime for very efficient implementations could be 5 minutes or less per task. If we are not able to timely run your source code on our machines (i.e. the code for a task is still running after 40 minutes, noting that our machines have stronger specifications than an M1 MacBook Pro), a penalty of 1 mark for each task that is slow will be applied. Use only unigram (1-gram) text representations to solve this coursework’s tasks.2 Write your report using the ACL LATEX template,3 and submit 1 PDF file named report.pdf. All plots/figures in your report must be in vector graphics format (e.g. .pdf).4 If they are not, a penalty of 1 mark will be applied. Your report should not exceed 6 pages. If there are formatting 1Link to the data, dropbox.com/s/z15xt8og1a7bt3q/coursework-1-data.zip?dl=0 2n-gram representations with n > 1 can improve performance, but are out-of-scope for this coursework. 3The ACL LATEX template is available as an OverLeaf project, overleaf.com/read/crtcwgxzjskr. 4Vector graphics, en.wikipedia.org/wiki/Vector_graphics 2 issues (ACL LATEX template not used, reported exceeds 6 pages etc.) a penalty of 5 marks will be applied. In total, your submission should have 4 source code files (task[1-4].py/java), an optional requirements.txt file, 5 output CSV files (tfidf.csv, bm25.csv, laplace.csv, lidstone.csv, dirichlet.csv), and 1 PDF file with the report (report.pdf). Do not deploy any internal directory structure in your submission, i.e. just submit the aforementioned 10 or 11 files. Do not compress (e.g. zip) your submission files. The data files should not be included in your submission. You should assume that the data files will be in the same directory as the source code (again no directory structure). If you impose a directory structure, then we will need to remove this by editing your source code to then be able to assess your submission automatically. This will result in a 5 mark penalty. Your source code may generate intermediate files when it runs (e.g. to transfer an outcome of a previous step to the next one). These should be removed when submitting your solution. Task 1 – Text statistics (25 marks). Use the data in passage-collection.txt for this task. Extract terms (1-grams) from the raw text. In doing so, you can also perform basic text preprocessing steps. However, you can also choose not to. D1 Do not remove stop words in the first instance. Describe and justify any other text preprocessing choice(s) if any [4 marks], and report the size of the identified index of terms (vocabulary) [1 mark]. Then, implement a function (or a block of source code) that counts the number of occurrences of terms in the provided data set, plot (Figure 1) their probability of occurrence (normalised frequency) against their frequency ranking [1 mark], and qualitatively justify that these terms follow Zipf’s law [1 mark]. As a reminder, Zipf’s law for text sets s = 1 in the Zipfian distribution defined by f(k; s, N) = k −s PN i=1 i−s , (1) where f(·) denotes the normalised frequency of a term, k denotes the term’s frequency rank in our corpus (with k = 1 being the highest rank), N is the number of terms in our vocabulary, and s is a distribution parameter. How does your empirical distribution compare to the actual Zipf’s law distribution? Use Eq. 1 to explain and to quantify where their difference is coming from [15 marks], and also compare the two in a log-log plot (Figure 2) [1 mark]. How will the difference between the two distributions be affected if we also remove stop words? Justify your answer by depicting (Figure 3) and quantifying this difference [2 marks]. D2 Submit your source code for Task 1 as a single Python or Java file titled task1.py or task1.java, respectively. Task 2 – Inverted index (10 marks). Use candidate-passages-top1000.tsv for this task (unique instances of column pairs pid and passage). Using the vocabulary of terms identified in Task 1 (you will need to choose between removing or keeping stop words), build an inverted index for the collection so that you can retrieve passages in an efficient way [3 marks]. 3 D3 Provide a description of your approach to generating an inverted index, report the information that you decided to store in it, and justify this choice [7 marks]. Hint: Implementing Tasks 3 and 4 should inform this choice. D4 Submit the source code for Task 2 as a single Python or Java file titled task2.py or task2.java, respectively. Task 3 – Retrieval models (35 marks). For this task, please use test-queries.tsv and candidate-passages-top1000.tsv. There is no need to describe your approach in the report. You will be marked based on your output only. If the output does not follow the description (see below), we will apply a penalty of 1 mark per problematic output file and attempt to fix minor issues manually (e.g. remove headers, correct spacing etc.). However, if after fixing minor issues, the output file does not have the exact number of rows as requested (this means that there is a fundamental flow in your solution), you will receive 0 marks for that part of the coursework. We will not be able to re-arrange the order of rows that you have provided. D5 Extract the TF-IDF vector representations of the passages using the inverted index you have constructed. Using the IDF representation from the corpus of the passages, extract the TF-IDF representation of the queries. Use a basic vector space model with TF-IDF and cosine similarity to retrieve at most 100 passages from within the 1000 candidate passages for each query (some queries have fewer candidate passages). Store the outcomes in a file named tfidf.csv [20 marks]. Each row of tfidf.csv must have the following format: qid,pid,score where qid denotes the query’s identifier, pid denotes the passage identifier, and score is their cosine similarity score. Note that there should be no spaces between commas. The output .csv files should not have headers. Strictly report the identifiers from the raw data (do not replace them with your own). qid,pid pairs should be ranked by similarity score in descending order, i.e. starting with the highest similarity score. You should first report the top qid,pid pairs for one qid, then move on to the next, following the same query order as in the text-queries.tsv file. The same file format should be used for reporting the outputs of all the retrieval or language models that you will implement as part of this coursework (i.e. this applies to Task 4 as well). The last column should report the corresponding similarity/relevance score of each model. Each output .csv file is expected to have exactly 19,290 rows. D6 Use the inverted index to also implement BM25 while setting k1 = 1.2, k2 = 100, and b = 0.75. Retrieve at most 100 passages from within the 1000 candidate passages for each test query. Store the outcomes in a file named bm25.csv [15 marks]. D7 Submit the source code for Task 3 as a single Python or Java file titled task3.py or task3.java, respectively. Task 4 – Query likelihood language models (30 marks). Use test-queries.tsv and candidate-passages-top1000.tsv for this task. Implement the query likelihood language model with (a) Laplace smoothing, (b) Lidstone correction with ϵ = 0.1, and (c) Dirichlet smoothing with µ = 50, and retrieve at most 100 passages from within the 1000 candidate passages for each test query. There is no need to describe your approach in implementing these language models in the report; you will be marked based on your output only. However, please 4 note that D11 needs to be answered in the report. Similarly to Task 3, if the output does not follow the description (see Task 3), we will apply a penalty of 1 mark per problematic output file and attempt to fix minor issues manually (e.g. remove headers, correct spacing etc.). If after fixing minor issues, the output file does not have the exact number of rows as requested (this means that there is a fundamental flow in your solution), you will receive 0 marks for that part of the coursework. We will not be able to re-arrange the order of rows that you have provided. D8-10 Store the outcomes of (a), (b), and (c) in the files laplace.csv, lidstone.csv, and dirichlet.csv, respectively. In these files, the 3rd column should report the natural logarithm of the probability score [15 marks]. D11 Which language model do you expect to work better [1 mark] and why [2 marks]? Give a few empirical examples based on your data and experiments [2 marks]. The examples need to be comprehensive to a reader of the report that does not have access to the data (i.e. do not refer to passage or query IDs). Which language models are expected to be more similar [1 mark] and why [2 marks]? Give a few empirical examples based on your data and experiments [2 marks]. Comment on the value of ϵ = 0.1 in the Lidstone correction. Is this a good choice [1 mark], would there be a better setting (if so, please provide a range of values) [1 mark], and why [1 mark]? If we set µ = 5000 in Dirichlet smoothing, would this be a more appropriate value [1 mark], and why [1 mark]? D12 Submit the source code for Task 4 as a single Python or Java file titled task4.py or task4.java, respectively. Information Retrieval & Data Mining [COMP0084] Coursework 1 (CW1) Q & A Vasileios Lampos Computer Science, UCL � lampos.net Preliminaries — About this lecture 2 ‣ In this lecture: — brief overview of Coursework 1 (CW1) — provide answers to some of your questions ‣ NB: Let’s all try to keep in mind that we cannot reveal the direct solutions of the coursework! Be mindful as you make your questions! ‣ Textbook references about TF-IDF and cosine similarity — nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf (search for these terms) — web.stanford.edu/~jurafsky/slp3/ed3book_Jan25.pdf (Chapter 6; my preferred resource) COMP0084 - CW1 Q & A About Coursework 1 3 ‣ Link: moodle.ucl.ac.uk/mod/assign/view.php?id=6116763 ‣ 50% of the final mark (this is probably the easier half) ‣ Some statistics from previous year • mean mark 75% (sd: 11%) • 3% of the submissions were below 50% ‣ As always, start working on this coursework early ‣ Aim: To build a passage (document) re-ranking system: given a candidate list of passages that are returned for a text query, we should re-rank (reorder from the most to the least relevant) these candidate passages based on 5 basic retrieval models. ‣ Data set: 200 search queries, for each one ≤1,000 passages that were returned, i.e. in total at most 200,000 small texts (is this large scale? NO) COMP0084 - CW1 Q & A About Coursework 1 — Ques ons, support, basic code of conduct 4 ‣ Deadline: February 26, 2024 at 4pm ‣ Office hours: Please send me an email to set up a meeting (v.lampos@ucl.ac.uk) ‣ You can send me up to 2 emails for CW1. Please be concise with your questions. ‣ Do not post anything about Coursework 1 on the course’s forum or in any public medium (to avoid unintentional “spoilers”) ‣ I might send announcements with clarifications to the entire class. These will be posted on the forum, so please check your emails and the COMP0084 forum regularly. ‣ Marks are expected to be released by the end of March. Please note that if there is a substantial amount of EC extensions, the mark release date might be delayed. COMP0084 - CW1 Q & A CW1 — Task 1 (25 marks) 5 ‣ Text preprocessing steps to form a vocabulary ‣ Perform a basic statistical analysis of text ‣ Comparison between empirical term frequency distribution and Zipf’s law — visual comparison (plots), but you should also use the equations to quantify the expected difference (and where it is coming from) ‣ How the difference will be affected if we remove stop words (justify by quantifying the difference) ‣ Overall, this is the least demanding task from a coding perspective ‣ Might be good to revisit your analysis after performing the next tasks, i.e. inform text preprocessing choices based on the language models. COMP0084 - CW1 Q & A CW1 — Task 2 (10 marks) 6 ‣ Inverted index ‣ NB: If implemented inefficiently, it might take hours ‣ What information have you decided to store in the inverted index and why? COMP0084 - CW1 Q & A CW1 — Task 3 (35 marks) 7 ‣ Retrieval models (TF-IDF & cosine similarity, BM25) ‣ Be very careful in the formatting of your outputs ‣ For this task, there is nothing I expect you to write in the report; you will be marked based on the output you are going to submit. COMP0084 - CW1 Q & A CW1 — Task 4 (30 marks) 8 ‣ Query likelihood language models with Laplace, Lidstone correction, and Dirichlet smoothing ‣ No description of the language models is required; you will be marked based on your output ‣ In addition, there are a few more things that you need to do: — Which language model do you expect to work better / similarly — Provide evidence by reporting examples from your data and experiments — Impact of language model hyperparameters / different settings / justification COMP0084 - CW1 Q & A CW1 — Common ques ons 9 1. What is the desirable structure and style / format of the report? 2. Can we use external libraries for some of the tasks in the coursework? 3. Can our source code produce intermediate files? Can operations be shared among different tasks? 4. What is “vector graphics”? 5. I have little programming experience. Will I cope with this coursework? 6. What will we mark automatically? 7. Are we going to run your source code? How? 8. Should we include comprehensive comments in our source code? 9. Should we compute language model parameters based on one query’s passages or the entire corpus of passages? The latter. 10. How should we define term frequency (TF) in TF-IDF? Normalised or binary count are my general suggestions, but using the raw count should also have the same effect (with the normalised approach) when deploying the cosine similarity. COMP0084 - CW1 Q & A Next lectures with me 10 Introduction to machine learning and data mining ‣ February 12 (3 hours) Topic models and vector semantics (word embeddings) ‣ March 12 (2 hours) Modelling COVID-19 using web search activity ‣ March 26? (1 or 2 hours — TBD) COMP0084 - CW1 Q & A
 
 
 
 
 

Recommendations