Une première PR qui gère plein de bugs détectés par Nicolas (#630)

linogaliana · web-flow · commit 7006f605a804 · 2025-07-28T14:20:47.000+02:00
* lien manipulation pandas close #617 * numpy random number generation * numbers * multidiemnsional * close #623 * close #624 * close #630
diff --git a/_quarto.yml b/_quarto.yml
@@ -3,46 +3,8 @@ project:
   render:
     - index.qmd
     - 404.qmd
-    - content/getting-started/index.qmd
-    - content/getting-started/01_environment.qmd
-    - content/getting-started/02_data_analysis.qmd
-    - content/getting-started/03_revisions.qmd
     - content/manipulation/index.qmd
     - content/manipulation/01_numpy.qmd
-    - content/manipulation/02_pandas_intro.qmd
-    - content/manipulation/02_pandas_suite.qmd
-    - content/manipulation/02a_pandas_tutorial.qmd
-    - content/manipulation/02b_pandas_TP.qmd
-    - content/manipulation/03_geopandas_intro.qmd
-    - content/manipulation/03_geopandas_tutorial.qmd
-    - content/manipulation/03_geopandas_TP.qmd
-    - content/manipulation/04a_webscraping_TP.qmd
-    - content/manipulation/04c_API_TP.qmd
-    - content/manipulation/04b_regex_TP.qmd
-    - content/manipulation/05_parquet_s3.qmd
-    - content/visualisation/index.qmd
-    - content/visualisation/matplotlib.qmd
-    - content/visualisation/maps.qmd
-    - content/modelisation/index.qmd
-    - content/modelisation/0_preprocessing.qmd
-    - content/modelisation/1_modelevaluation.qmd
-    - content/modelisation/2_classification.qmd
-    - content/modelisation/3_regression.qmd
-    - content/modelisation/4_featureselection.qmd
-    - content/modelisation/5_clustering.qmd
-    - content/modelisation/6_pipeline.qmd
-    - content/modelisation/7_mlapi.qmd
-    - content/NLP/index.qmd
-    - content/NLP/01_intro.qmd
-    - content/NLP/02_exoclean.qmd
-    - content/NLP/03_embedding.qmd
-    - content/modern-ds/s3.qmd
-    - content/git/index.qmd
-    - content/git/introgit.qmd
-    - content/git/exogit.qmd
-    - content/annexes/about.qmd
-    - content/annexes/evaluation.qmd
-    - content/annexes/corrections.qmd
 
 profile:
   default: fr
diff --git a/content/manipulation/01_numpy.qmd b/content/manipulation/01_numpy.qmd
@@ -12,17 +12,13 @@ categories:
 description: |
   `Numpy` constitue la brique de base de l'écosystème de la _data science_ en
   `Python`. Toutes les librairies de manipulation de données, de modélisation
-  et de visualisation reposent, de manière plus ou moins directe, sur `Numpy`.
-  Il est donc indispensable de revoir quelques notions sur ce package avant
-  d'aller plus loin.
+  et de visualisation reposent, de manière plus ou moins directe, sur `Numpy`. Il est donc indispensable de revoir quelques notions sur ce package avant d'aller plus loin.
 description-en: |
   `Numpy` is the cornerstone of the _data science_ ecosystem in `Python`. All data manipulation, modeling, and visualization libraries rely, directly or indirectly, on `Numpy`. It is therefore essential to review some concepts of this package before moving forward.
 image: scatter_numpy.png
 echo: false
 ---
 
-fffffffff
-
 {{< badges
     printMessage="true"
 >}}
@@ -38,8 +34,7 @@ fffffffff
 Ce chapitre constitue une introduction à _Numpy_ pour
 s'assurer que les bases du calcul vectoriel avec `Python`
 soient maîtrisées. La première partie du chapitre
-présente des petits exercices pour pratiquer
-quelques fonctions basiques de `Numpy`. La fin du chapitre présente
+présente des petits exercices pour pratiquer quelques fonctions basiques de `Numpy`. La fin du chapitre présente
 des exercices pratiques d'utilisation de `Numpy` plus approfondis.
 
 Il est recommandé de régulièrement se référer à
@@ -79,36 +74,34 @@ We will also set the seed of the random number generator to obtain reproducible
 
 ```{python}
 #| echo: true
-np.random.seed(12345)
+import numpy as np
+rng = np.random.default_rng(seed=12345)
 ```
 
 ::: {.content-visible when-profile="fr"}
 
-::: {.callout-note}
-
-Les auteurs de `numpy` [préconisent désormais](https://numpy.org/doc/stable/reference/random/index.html)
-de privilégier l'utilisation de générateurs via la fonction `default_rng()` plutôt que la simple utilisation de `numpy.random`.
+::: {.callout-caution}
 
-Afin d'être en phase avec les codes qu'on peut trouver partout sur internet, nous conservons encore `np.random.seed` mais cela peut être amené à évoluer.
+Historiquement, la génération de nombres aléatoires se faisait pas le biais du package `numpy.random`. Néanmoins, les auteurs de `Numpy` [recommandent maintenant](https://numpy.org/doc/stable/reference/random/index.html) d'utiliser plutôt des générateurs pour cela. Les exemples de ce tutoriel adoptent donc cette pratique.
 
 :::
 
 :::
 
 :::: {.content-visible when-profile="en"}
 
-::: {.callout-note}
+::: {.callout-caution}
 
-The authors of `numpy` [now recommend](https://numpy.org/doc/stable/reference/random/index.html) using generators via the `default_rng()` function rather than simply using `numpy.random`.
-
-To stay consistent with the codes found everywhere on the internet, we still use `np.random.seed`, but this may change in the future.
+Historically, random numbers were generated using the `numpy.random` package. However, the authors of `Numpy` [now recommend](https://numpy.org/doc/stable/reference/random/index.html) using generators instead. The examples in this tutorial adopt this practice.
 
 :::
 
 ::::
 
 ::: {.content-visible when-profile="fr"}
 
+
+
 # Le concept d'_array_
 
 Dans le monde de la science des données, comme cela sera évoqué
@@ -227,10 +220,10 @@ np.array([["a","z","e"],["r","t"],["y"]], dtype="object")
 
 Il existe aussi des méthodes pratiques pour créer des array:
 
-* séquences logiques : `np.arange` (suite) ou `np.linspace` (interpolation linéaire entre deux bornes)
-* séquences ordonnées : _array_ rempli de zéros, de 1 ou d'un nombre désiré : `np.zeros`, `np.ones` ou `np.full`
-* séquences aléatoires : fonctions de génération de nombres aléatoires : `np.rand.uniform`, `np.rand.normal`, etc. 
-* tableau sous forme de matrice identité : `np.eye`
+* séquences logiques : `np.arange` (suite) ou `np.linspace` (interpolation linéaire entre deux bornes) ;
+* séquences ordonnées : _array_ rempli de zéros, de 1 ou d'un nombre désiré : `np.zeros`, `np.ones` ou `np.full` ;
+* séquences aléatoires : fonctions de génération de nombres aléatoires : `rng.uniform`, `rng.normal`, etc. où `rng` est un générateur de nombre aléatoires ;  
+* tableau sous forme de matrice identité : `np.eye`.
 
 Ceci donne ainsi, pour les séquences logiques:
 
@@ -239,10 +232,10 @@ Ceci donne ainsi, pour les séquences logiques:
 ::: {.content-visible when-profile="en"}
 There are also practical methods for creating arrays:
 
-* Logical sequences: `np.arange` (sequence) or `np.linspace` (linear interpolation between two bounds)
-* Ordered sequences: array filled with zeros, ones, or a desired number: `np.zeros`, `np.ones`, or `np.full`
-* Random sequences: random number generation functions: `np.rand.uniform`, `np.rand.normal`, etc.
-* Matrix in the form of an identity matrix: `np.eye`
+* Logical sequences: `np.arange` (sequence) or `np.linspace` (linear interpolation between two bounds);
+* Ordered sequences: array filled with zeros, ones, or a desired number: `np.zeros`, `np.ones`, or `np.full`;
+* Random sequences: random number generation functions: `rng.uniform`, `rng.normal`, etc. where `rng` is a random number generator;
+* Matrix in the form of an identity matrix: `np.eye`.
 
 This gives, for logical sequences:
 :::
@@ -432,6 +425,51 @@ The syntax for selecting particular indices from a list also works with arrays.
 {{< include "01_numpy_exercises/_exo2_solution.qmd" >}}
 :::
 
+::: {.content-visible when-profile="fr"}
+
+La logique se généralise pour les _array_ multidimensionnels. L'indexation se fait alors à plusieurs niveaux.  Prenons par exemple un array à 2 dimensions (une matrice en quelques sortes):
+
+```{python}
+x = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
+```
+
+Si on veut sélectionner la 2e ligne, 3e colonne (l'élément de valeur 6), on fait
+
+```{python}
+x[1, 2]
+```
+
+Maintenant, pour sélectionner une colonne complète (par exemple la 2e), on peut utiliser le 2e index pour spécifier celle-ci (index 1 en Python puisque l'indexation part de 0) puis `:` sur la première dimension (version raccourcie de `0:N`) pour ne pas discriminer selon cette dimension: 
+
+```{python}
+x[:,1]
+```
+
+Le principe se généralise, mais se complexifie, pour des _array_ imbriqués. Heureusement, ce sont des objets qu'on manipule assez rarement directement, la plupart de nos données numériques étant des tableaux plats (une valeur - l'observation - est le croisement d'une ligne - l'individu - et d'une colonne - la variable). 
+
+:::
+::: {.content-visible when-profile="en"}
+The same logic applies to multidimensional _arrays_. Indexing then takes place at several levels.  Take, for example, a 2-dimensional array (a matrix of sorts):
+
+``{python}
+x = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
+````
+
+If we want to select the 2nd row, 3rd column (the element with value 6), we do
+
+x[1, 2]
+````{python}
+
+Now, to select a complete column (e.g. the 2nd), we can use the 2nd index to specify it (index 1 in Python since indexing starts from 0) and then `:` on the first dimension (shortened version of `0:N`) to avoid discriminating according to this dimension: 
+
+``{python}
+x[:,1]
+```
+
+The principle is generalized, but becomes more complex, for nested _arrays_. Fortunately, these are objects we rarely manipulate directly, as most of our numerical data are flat arrays (a value - the observation - is the intersection of a row - the individual - and a column - the variable). 
+:::
+
+
 ::: {.content-visible when-profile="fr"}
 
 ## Sur la performance
@@ -543,7 +581,8 @@ Let's create `x` a multidimensional array and `y` a one-dimensional array with a
 
 ```{python}
 #| echo: true
-x = np.random.normal(0, size=(3, 4))
+# Assuming rng has been created beforehand
+x = rng.normal(0, size=(3, 4))
 y = np.array([np.nan, 0, 1])
 ```
 
@@ -605,7 +644,7 @@ To combine arrays, you can use, depending on the case, the functions `np.concate
 
 ```{python}
 #| echo: true
-x = np.random.normal(size = 10)
+x = rng.normal(size = 10)
 ```
 
 ::: {.content-visible when-profile="fr"}
@@ -650,7 +689,7 @@ For classical descriptive statistics, `Numpy` offers a number of already impleme
 
 ```{python}
 #| echo: true
-x = np.random.normal(0, size=(3, 4))
+x = rng.normal(0, size=(3, 4))
 ```
 
 
diff --git a/content/manipulation/01_numpy_exercises/_exo1_solution.qmd b/content/manipulation/01_numpy_exercises/_exo1_solution.qmd
@@ -1,7 +1,11 @@
 ```{python}
 #| output: false
-X = np.random.uniform(0,1,1000)
-Y = np.random.normal(0,np.sqrt(2),1000)
+
+# If rng has not been created beforehand:
+# rng = np.random.default_rng()
+
+X = rng.uniform(0,1,1000)
+Y = rng.normal(0,np.sqrt(2),1000)
 
 np.var(Y)
 ```
diff --git a/content/manipulation/01_numpy_exercises/_exo3_solution.qmd b/content/manipulation/01_numpy_exercises/_exo3_solution.qmd
@@ -1,8 +1,10 @@
 ```{python}
 #| output: false
 
-# Solution
-x = np.random.normal(size=10000)
+# If rng has not been created beforehand:
+# rng = np.random.default_rng()
+
+x = rng.normal(size=10000)
 
 x2 = x[np.abs(x)>=1.96]
 
diff --git a/content/manipulation/01_numpy_exercises/_exo4_solution.qmd b/content/manipulation/01_numpy_exercises/_exo4_solution.qmd
@@ -1,8 +1,10 @@
 ```{python}
 #| output: false
 
-# Correction
-x = np.random.normal(0, size=(3, 4))
+# If rng has not been created beforehand:
+# rng = np.random.default_rng()
+
+x = rng.normal(0, size=(3, 4))
 y = np.array([np.nan, 0, 1])
 
 print(x)
diff --git a/content/manipulation/01_numpy_exercises/_exo6_en.qmd b/content/manipulation/01_numpy_exercises/_exo6_en.qmd
@@ -1,14 +1,14 @@
 ::: {.callout-tip}
-## Exercise (a bit more challenging)
+## Exercise 6 (a bit more challenging)
 
 1. Create `X`, a two-dimensional array (i.e., a matrix) with 10 rows and 2 columns. The numbers in the array are random.
 2. Import the `matplotlib.pyplot` module as `plt`. Use `plt.scatter` to plot the data as a scatter plot.
 3. Construct a 10x10 matrix storing, at element $(i,j)$, the Euclidean distance between points $X[i,]$ and $X[j,]$. To do this, you will need to work with dimensions by creating nested arrays using `np.newaxis`:
-  + First, use `X1 = X[:, np.newaxis, :]` to transform the matrix into a nested array. Check the dimensions.
-  + Create `X2` of dimension `(1, 10, 2)` using the same logic.
-  + Deduce, for each point, the distance with other points for each coordinate. Square this distance.
-  + At this stage, you should have an array of dimension `(10, 10, 2)`. The reduction to a matrix is obtained by summing over the last axis. Check the help of `np.sum` on how to sum over the last axis.
-  + Finally, apply the square root to obtain a proper Euclidean distance.
+   1. First, use `X1 = X[:, np.newaxis, :]` to transform the matrix into a nested array. Check the dimensions.
+   2. Create `X2` of dimension `(1, 10, 2)` using the same logic.
+   3. Deduce, for each point, the distance with other points for each coordinate. Square this distance.
+   4. At this stage, you should have an array of dimension `(10, 10, 2)`. The reduction to a matrix is obtained by summing over the last axis. Check the help of `np.sum` on how to sum over the last axis.
+   5. Finally, apply the square root to obtain a proper Euclidean distance.
 4. Verify that the diagonal elements are zero (distance of a point to itself...).
 5. Now, sort for each point the points with the most similar values. Use `np.argsort` to get the ranking of the closest points for each row.
 6. We are interested in the k-nearest neighbors. For now, set k=2. Use `argpartition` to reorder each row so that the 2 closest neighbors of each point come first, followed by the rest of the row.
diff --git a/content/manipulation/01_numpy_exercises/_exo6_fr.qmd b/content/manipulation/01_numpy_exercises/_exo6_fr.qmd
@@ -1,17 +1,17 @@
 ::: {.callout-tip}
-## Exercice (un peu plus corsé)
+## Exercice 6 (un peu plus corsé)
 
 
 1. Créer `X` un tableau à deux dimensions (i.e. une matrice) comportant 10 lignes
 et 2 colonnes. Les nombres dans le tableau sont aléatoires.
 2. Importer le module `matplotlib.pyplot` sous le nom `plt`. Utiliser
 `plt.scatter` pour représenter les données sous forme de nuage de points. 
 3. Constuire une matrice 10x10 stockant, à l'élément $(i,j)$, la distance euclidienne entre les points $X[i,]$ et $X[j,]$. Pour cela, il va falloir jouer avec les dimensions en créant des tableaux emboîtés à partir par des appels à `np.newaxis` :
-  + En premier lieu, utiliser `X1 = X[:, np.newaxis, :]` pour transformer la matrice en tableau emboîté. Vérifier les dimensions
-  + Créer `X2` de dimension `(1, 10, 2)` à partir de la même logique
-  + En déduire, pour chaque point, la distance avec les autres points pour chaque coordonnées. Elever celle-ci au carré
-  + A ce stade, vous devriez avoir un tableau de dimension `(10, 10, 2)`. La réduction à une matrice s'obtient en sommant sur le dernier axe. Regarder dans l'aide de `np.sum` comme effectuer une somme sur le dernier axe. 
-  + Enfin, appliquer la racine carrée pour obtenir une distance euclidienne en bonne et due forme. 
+   1. En premier lieu, utiliser `X1 = X[:, np.newaxis, :]` pour transformer la matrice en tableau emboîté. Vérifier les dimensions
+   2. Créer `X2` de dimension `(1, 10, 2)` à partir de la même logique
+   3. En déduire, pour chaque point, la distance avec les autres points pour chaque coordonnées. Elever celle-ci au carré
+   4. A ce stade, vous devriez avoir un tableau de dimension `(10, 10, 2)`. La réduction à une matrice s'obtient en sommant sur le dernier axe. Regarder dans l'aide de `np.sum` comme effectuer une somme sur le dernier axe.
+   5. Enfin, appliquer la racine carrée pour obtenir une distance euclidienne en bonne et due forme. 
 4. Vérifier que les termes diagonaux sont bien nuls (distance d'un point à lui-même...)
 5. Il s'agit maintenant de classer, pour chaque point, les points dont les valeurs sont les plus similaires. Utiliser `np.argsort` pour obtenir, pour chaque ligne, le classement des points les plus proches
 6. On va s'intéresser aux k-plus proches voisins. Pour le moment, fixons k=2. Utiliser `argpartition` pour réordonner chaque ligne de manière à avoir les 2 plus proches voisins de chaque point d'abord et le reste de la ligne ensuite
diff --git a/content/manipulation/01_numpy_exercises/_exo6_solution.qmd b/content/manipulation/01_numpy_exercises/_exo6_solution.qmd
@@ -1,13 +1,15 @@
 ```{python}
 #| output: false
 
-# Correction
+# If rng has not been created beforehand:
+# rng = np.random.default_rng()
+
+import matplotlib.pyplot as plt
 
 # Question 1
-X = np.random.rand(10, 2)
+X = rng.uniform(size = (10, 2))
 
 # Question 2 
-import matplotlib.pyplot as plt
 print(X[:,0])
 print(X[:,1])
 plt.scatter(X[:, 0], X[:, 1], s=100)
diff --git a/content/manipulation/01_numpy_exercises/_exo7_fr.qmd b/content/manipulation/01_numpy_exercises/_exo7_fr.qmd
@@ -3,11 +3,9 @@
 
 `Google` est devenu célèbre grâce à son algorithme `PageRank`. Celui-ci permet, à partir
 de liens entre sites _web_, de donner un score d'importance à un site _web_ qui va
-être utilisé pour évaluer sa centralité dans un réseau. 
-L'objectif de cet exercice est d'utiliser `Numpy` pour mettre en oeuvre un tel
-algorithme à partir d'une matrice d'adjacence qui relie les sites entre eux. 
+être utilisé pour évaluer sa centralité dans un réseau.  L'objectif de cet exercice est d'utiliser `Numpy` pour mettre en oeuvre un tel algorithme à partir d'une matrice d'adjacence qui relie les sites entre eux. 
 
-1. Créer la matrice suivante avec `numpy`. L'appeler `M`:
+1. Créer la matrice suivante avec `Numpy`. L'appeler `M`:
 
 $$
 \begin{bmatrix}
diff --git a/content/manipulation/01_numpy_exercises/_exo7_solution.qmd b/content/manipulation/01_numpy_exercises/_exo7_solution.qmd
@@ -29,6 +29,8 @@ ranking of nodes (pages) in the adjacency matrix
 """
 
 import numpy as np
+rng = np.random.default_rng()
+
 
 def pagerank(M, num_iterations: int = 100, d: float = 0.85):
     """PageRank: The trillion dollar algorithm.
@@ -51,7 +53,7 @@ def pagerank(M, num_iterations: int = 100, d: float = 0.85):
 
     """
     N = M.shape[1]
-    v = np.random.rand(N, 1)
+    v = rng.uniform(size = (N, 1))
     v = v / np.linalg.norm(v, 1)
     M_hat = (d * M + (1 - d) / N)
     for i in range(num_iterations):
diff --git a/content/manipulation/02_pandas_intro.qmd b/content/manipulation/02_pandas_intro.qmd
@@ -1590,8 +1590,12 @@ Technically, `Pandas` handles missing values without issues (except for `int` va
 
 ```{python}
 #| echo: true
+
+import numpy as np
+rng = np.random.default_rng()
+
 ventes = pd.DataFrame(
-    {'prix': np.random.uniform(size = 5),
+    {'prix': rng.uniform(size = 5),
      'client1': [i+1 for i in range(5)],
      'client2': [i+1 for i in range(4)] + [np.nan],
      'produit': [np.nan] + ['yaourt','pates','riz','tomates']
diff --git a/content/manipulation/index.qmd b/content/manipulation/index.qmd
diff --git a/content/modelisation/01_preprocessing/_exo1.qmd b/content/modelisation/01_preprocessing/_exo1.qmd
diff --git a/content/modelisation/index.qmd b/content/modelisation/index.qmd
diff --git a/content/visualisation/matplotlib.qmd b/content/visualisation/matplotlib.qmd