Table des matières
XML Extraccion del POS dins lo fichièr TEI del Congrès
La tòca es de traire l'informacion necessari dempuèi un fichièr TEI:XML gigantàs (18,4 Mo) que conten tota l'informacion del basicòt Francés/Lengadocian-Gascon. Lo diccionari presenta l'interès d'aver sos tèrmes amb d'etiquetas POS.
Lo basicòt compta 13610 tèrmes.
L'attribut “XML:lang” foncionava pas dins l'analisi sintaxica amb la librarià lxml. Doncas l'attribut foguèt remplaçada per “language” dins tot lo fichièr.
Extrach del fichièr basicunif_complet_language.xml
<?xml version="1.0" encoding="UTF-8" standalone="no" ?> <TEI> <teiHeader> ... </teiHeader> <text> <body> ... <entry n="12"> <form type="main" language="fr"> <orth>abaisser</orth> <gramGrp> <pos norm="verb">v.</pos> <gram type="eaglescat" norm="V000050000000"></gram> </gramGrp> </form> <sense n="I-A-1-a"> <cit type="translation" language="oc-gascon"> <form type="main" language="oc-gascon"> <orth>abaishar</orth> <gramGrp> <gram type="eaglescat" norm="V000050000000"></gram> </gramGrp> </form> </cit> <cit type="translation" language="oc-gascon"> <form type="main" language="oc-gascon"> <orth>baishar</orth> <gramGrp> <gram type="eaglescat" norm="V000050000000"></gram> </gramGrp> </form> </cit> <cit type="translation" language="oc-lengadoc"> <form type="main" language="oc-lengadoc"> <orth>abaissar</orth> <gramGrp> <gram type="eaglescat" norm="V000050000000"></gram> </gramGrp> </form> </cit> <cit type="translation" language="oc-lengadoc"> <form type="main" language="oc-lengadoc"> <orth>baissar</orth> <gramGrp> <gram type="eaglescat" norm="V000050000000"></gram> </gramGrp> </form> </cit> </sense> <sense n="I-A-2-a"> <usg type="hint">diminuer</usg> <cit type="translation" language="oc"> <form type="main" language="oc"> <orth>amermar</orth> <gramGrp> <gram type="eaglescat" norm="V000050000000"></gram> </gramGrp> </form> </cit> </sense> </entry> ... </body> </text> </TEI>
Programa basic.py
D'unes còps lo còde TEI es pas forçadament plan format quand i a una error, lo programa se deu pas arrestar, l'error es marcada dins la resulta per èsser escartat.
#!/usr/bin/env python3 # coding: utf8 import json from lxml import etree tree = etree.parse("basicunif_complet_language.xml") f = open("basicunif_complet_extract.txt",'w') f.write("id,fr,posfr,sens,dialect,oc,posoc\n".encode('utf-8')) for entry in tree.xpath("/TEI/text/body/entry"): entryId = entry.get('n') orth = entry.find("form/orth").text gramNorm = entry.find("form/gramGrp/gram").get('norm') for sense in entry.xpath("sense"): for cit in sense.xpath("cit"): try: langcit = cit.find("form").get('dialect') except AttributeError: langcit = "langcit error" # orthcit orthcit = cit.find("form/orth") try: orthcitText = orthcit.text except AttributeError: orthcitText = "orthcit error" #gramNormcit try: gramNormCit = cit.find("form/gramGrp/gram").get('norm') except AttributeError: gramNormCit = "gramNormCit error" s = "\"%s\", \"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n" % (entryId, orth, gramNorm, sense.get('n'), langcit, orthcitText, gramNormCit) #print(s) f.write(s.encode('utf-8')) f.close()
Resulta
La sortida es en format csv per èsser integrat dins una basa de donadas SQL e NoSQL.
id,fr,posfr,sens,dialect,oc,posoc "1", "à","AP1","I-A-1-a","oc","a","AP1" "1", "à","AP1","I-A-2-a","oc","per","AP1" "1", "à","AP1","I-A-2-a","oc-gascon","entà","AP1" "1", "à","AP1","I-A-3-a","oc","a","AP1" "1", "à","AP1","I-A-4-a","oc","de","AP1" "1", "à","AP1","I-A-5-a","oc","de","AP1" "1", "à","AP1","I-A-6-a","oc-gascon","entà","AP1" "1", "à","AP1","I-A-6-a","oc-lengadoc","a","AP1" "2", "à cloche-pied","AV0000","I-A-1-a","oc-gascon","a sautapè","AV0000" "2", "à cloche-pied","AV0000","I-A-1-a","oc-lengadoc","a pè-ranquet","AV0000" "3", "à contrecoeur","AV0000","I-A-1-a","oc","a contracòr","AV0000" "3", "à contrecoeur","AV0000","I-A-1-a","oc-gascon","d’arrèrcòr","AV0000" "3", "à contrecoeur","AV0000","I-A-1-a","oc-lengadoc","de rèirecòr","AV0000" "4", "a fortiori","AV0000","I-A-1-a","oc","a fortiori","AV0000" "5", "à jeun","AV0000","I-A-1-a","oc-gascon","dejun","AJ0110" "5", "à jeun","AV0000","I-A-1-a","oc-lengadoc","dejun","AJ0110" "6", "à peu près","AP1","I-A-1-a","oc-gascon","haut o baish","AP1" "6", "à peu près","AP1","I-A-1-a","oc-gascon","mei o mensh","AP1" "6", "à peu près","AP1","I-A-1-a","oc-lengadoc","a pauc près","AP1" "6", "à peu près","AP1","I-A-1-a","oc-lengadoc","pauc se'n manca","AP1" "7", "à rebours","AV0000","I-A-1-a","oc-gascon","au reboish","AV0000" "7", "à rebours","AV0000","I-A-1-a","oc-lengadoc","a revèrs","AV0000" "8", "à reculons","AV0000","I-A-1-a","oc-gascon","d'arreculas","AV0000" "8", "à reculons","AV0000","I-A-1-a","oc-gascon","de reculas","AV0000" "8", "à reculons","AV0000","I-A-1-a","oc-lengadoc","de reculons","AV0000" "9", "à tâtons","AV0000","I-A-1-a","oc-gascon","a paupas","AV0000" "9", "à tâtons","AV0000","I-A-1-a","oc-lengadoc","a palpas","AV0000" "10", "à-coup","N1110","I-A-1-a","oc-gascon","sacat","N1110" "10", "à-coup","N1110","I-A-1-a","oc-lengadoc","bassacada","N1210" "11", "à-peu-près","N1110","I-A-1-a","oc","a-pauc-près","N1110" "11", "à-peu-près","N1110","I-A-2-a","oc","aproximacion","N1110" "12", "abaisser","V000050000000","I-A-1-a","oc-gascon","abaishar","V000050000000" "12", "abaisser","V000050000000","I-A-1-a","oc-gascon","baishar","V000050000000" "12", "abaisser","V000050000000","I-A-1-a","oc-lengadoc","abaissar","V000050000000" "12", "abaisser","V000050000000","I-A-1-a","oc-lengadoc","baissar","V000050000000" "12", "abaisser","V000050000000","I-A-2-a","oc","amermar","V000050000000" ...
Tèrmes en error
3034 | de | AT3000 | I-A-1-a | langcit error | orthcit error | gramNormCit error |
3720 | du | AT3100 | I-A-1-a | langcit error | orthcit error | gramNormCit error |
13591 | je | PD312011500002 | I-A-1-a | langcit error | orthcit error | gramNormCit error |
13593 | tu | PD211011500022 | I-A-1-a | langcit error | orthcit error | gramNormCit error |
13595 | il | PD311011500002 | I-A-1-a | langcit error | orthcit error | gramNormCit error |
13596 | je | PD312011500002 | I-A-1-a | langcit error | orthcit error | gramNormCit error |
CSV botat dins una taula de MySQL
lo nom de la basa de donadas 'cplo_basic_unificat'.
Comolacion dels POS per un sens donat
SELECT idterm, fr, posfr, group_concat( sens ) AS sens, group_concat( dialect ) AS dialectes, group_concat( oc ) AS occitans, group_concat( posoc ) AS posoccitans FROM `cplo_basic_unificat` WHERE dialect != 'langcit error' GROUP BY idterm, sens