====== XML Extraccion del POS dins lo fichièr TEI del Congrès ======
La tòca es de traire l'informacion necessari dempuèi un fichièr TEI:XML gigantàs (18,4 Mo) que conten tota l'informacion del basicòt Francés/Lengadocian-Gascon. Lo diccionari presenta l'interès d'aver sos tèrmes amb d'etiquetas POS.\\
Lo basicòt compta 13610 tèrmes.
L'attribut "XML:lang" foncionava pas dins l'analisi sintaxica amb la librarià lxml. Doncas l'attribut foguèt remplaçada per "language" dins tot lo fichièr.
===== Extrach del fichièr basicunif_complet_language.xml=====
...
...
diminuer
...
===== Programa basic.py =====
D'unes còps lo còde TEI es pas forçadament plan format quand i a una error, lo programa se deu pas arrestar, l'error es marcada dins la resulta per èsser escartat.
#!/usr/bin/env python3
# coding: utf8
import json
from lxml import etree
tree = etree.parse("basicunif_complet_language.xml")
f = open("basicunif_complet_extract.txt",'w')
f.write("id,fr,posfr,sens,dialect,oc,posoc\n".encode('utf-8'))
for entry in tree.xpath("/TEI/text/body/entry"):
entryId = entry.get('n')
orth = entry.find("form/orth").text
gramNorm = entry.find("form/gramGrp/gram").get('norm')
for sense in entry.xpath("sense"):
for cit in sense.xpath("cit"):
try:
langcit = cit.find("form").get('dialect')
except AttributeError:
langcit = "langcit error"
# orthcit
orthcit = cit.find("form/orth")
try:
orthcitText = orthcit.text
except AttributeError:
orthcitText = "orthcit error"
#gramNormcit
try:
gramNormCit = cit.find("form/gramGrp/gram").get('norm')
except AttributeError:
gramNormCit = "gramNormCit error"
s = "\"%s\", \"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n" % (entryId, orth, gramNorm, sense.get('n'), langcit, orthcitText, gramNormCit)
#print(s)
f.write(s.encode('utf-8'))
f.close()
===== Resulta =====
La sortida es en format csv per èsser integrat dins una basa de donadas SQL e NoSQL.
id,fr,posfr,sens,dialect,oc,posoc
"1", "à","AP1","I-A-1-a","oc","a","AP1"
"1", "à","AP1","I-A-2-a","oc","per","AP1"
"1", "à","AP1","I-A-2-a","oc-gascon","entà","AP1"
"1", "à","AP1","I-A-3-a","oc","a","AP1"
"1", "à","AP1","I-A-4-a","oc","de","AP1"
"1", "à","AP1","I-A-5-a","oc","de","AP1"
"1", "à","AP1","I-A-6-a","oc-gascon","entà","AP1"
"1", "à","AP1","I-A-6-a","oc-lengadoc","a","AP1"
"2", "à cloche-pied","AV0000","I-A-1-a","oc-gascon","a sautapè","AV0000"
"2", "à cloche-pied","AV0000","I-A-1-a","oc-lengadoc","a pè-ranquet","AV0000"
"3", "à contrecoeur","AV0000","I-A-1-a","oc","a contracòr","AV0000"
"3", "à contrecoeur","AV0000","I-A-1-a","oc-gascon","d’arrèrcòr","AV0000"
"3", "à contrecoeur","AV0000","I-A-1-a","oc-lengadoc","de rèirecòr","AV0000"
"4", "a fortiori","AV0000","I-A-1-a","oc","a fortiori","AV0000"
"5", "à jeun","AV0000","I-A-1-a","oc-gascon","dejun","AJ0110"
"5", "à jeun","AV0000","I-A-1-a","oc-lengadoc","dejun","AJ0110"
"6", "à peu près","AP1","I-A-1-a","oc-gascon","haut o baish","AP1"
"6", "à peu près","AP1","I-A-1-a","oc-gascon","mei o mensh","AP1"
"6", "à peu près","AP1","I-A-1-a","oc-lengadoc","a pauc près","AP1"
"6", "à peu près","AP1","I-A-1-a","oc-lengadoc","pauc se'n manca","AP1"
"7", "à rebours","AV0000","I-A-1-a","oc-gascon","au reboish","AV0000"
"7", "à rebours","AV0000","I-A-1-a","oc-lengadoc","a revèrs","AV0000"
"8", "à reculons","AV0000","I-A-1-a","oc-gascon","d'arreculas","AV0000"
"8", "à reculons","AV0000","I-A-1-a","oc-gascon","de reculas","AV0000"
"8", "à reculons","AV0000","I-A-1-a","oc-lengadoc","de reculons","AV0000"
"9", "à tâtons","AV0000","I-A-1-a","oc-gascon","a paupas","AV0000"
"9", "à tâtons","AV0000","I-A-1-a","oc-lengadoc","a palpas","AV0000"
"10", "à-coup","N1110","I-A-1-a","oc-gascon","sacat","N1110"
"10", "à-coup","N1110","I-A-1-a","oc-lengadoc","bassacada","N1210"
"11", "à-peu-près","N1110","I-A-1-a","oc","a-pauc-près","N1110"
"11", "à-peu-près","N1110","I-A-2-a","oc","aproximacion","N1110"
"12", "abaisser","V000050000000","I-A-1-a","oc-gascon","abaishar","V000050000000"
"12", "abaisser","V000050000000","I-A-1-a","oc-gascon","baishar","V000050000000"
"12", "abaisser","V000050000000","I-A-1-a","oc-lengadoc","abaissar","V000050000000"
"12", "abaisser","V000050000000","I-A-1-a","oc-lengadoc","baissar","V000050000000"
"12", "abaisser","V000050000000","I-A-2-a","oc","amermar","V000050000000"
...
===== Tèrmes en error =====
|3034 | de | AT3000|I-A-1-a|langcit error | orthcit error|gramNormCit error|
|3720 | du |AT3100 |I-A-1-a |langcit error |orthcit error |gramNormCit error|
|13591 | je |PD312011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error|
|13593 | tu |PD211011500022 |I-A-1-a |langcit error |orthcit error |gramNormCit error|
|13595 |il |PD311011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error|
|13596 | je |PD312011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error|
===== CSV botat dins una taula de MySQL =====
lo nom de la basa de donadas 'cplo_basic_unificat'.
===== Comolacion dels POS per un sens donat=====
SELECT idterm, fr, posfr,
group_concat( sens ) AS sens,
group_concat( dialect ) AS dialectes,
group_concat( oc ) AS occitans,
group_concat( posoc ) AS posoccitans
FROM `cplo_basic_unificat`
WHERE dialect != 'langcit error'
GROUP BY idterm, sens
===== Extracion en CSV =====